Backstory
Some days ago, at work I had to do regular expression matching with some UTF-8 or Unicode[1] encoded text. That made me read perlunitut. After reading it, I got to know that UTF-8 & Unicode are not the same[2]. There’re many Unicode encodings & UTF-8 is the most commonly used Unicode encoding.
Today, once again I was bitten by my code.This is what I did there (in brief):
- I store details of an event in a YAML file.
- Construct an HAML file using data from the YAML file.
- Construct a .tt (Template Toolkit) from a HAML file.
- Finally, an .html file using that .tt file.
- Also, the unicode character appeared malformed in the email.
In YAML file, one of the speakers has a unicode character in his name, which made it appear on the browser like this : When I tried to print the variable having UTF-8 text, I got this warning :
Wide character in print
Solutions
In my code, I was using quite a few modules namely :
- IO::All for writing text to .haml file.
- Text::Haml for rendering text of .haml file.
- Template for rendering .tt file.
- Email::Simple for sending emails.
IO:All
I was using IO::All::io for reading & writing files. You can enable UTF-8 encoding in IO::All, by two ways (from the man page of IO::All):
- Enable UTF-8 for a single operation:
$contents = io('file.txt')->utf8->all; # Turn on utf8
- Enable UTF-8 for all operations:
use IO::All -utf8; # Turn on utf8 for all io $contents = io('file.txt')->all; # by default in this package.
Text::Haml
In Text::Haml, you can set text encoding while creating an object of Text::Haml, like this :
my $haml = Text::Haml->new( encoding => 'utf-8' );
Template
In Template, you can specify encoding as an argument to process, like this:
$tt->process( 'input.tt', { key => "value"}, 'output.html', binmode => ':utf8' # set encoding to UTF-8 )
Email::Simple
In Email::Simple, you can specify the encoding in the email headers, like this:
my $email = Email::Simple->create( header => [ To => $to, From => $from, Subject => $subject, 'Content-Type' => "text/plain; charset=utf-8", ], body => $email_body, );
print statements
For getting rid of this warning:
Wide character in print
You can do this:
# This allows you to write in UTF-8 in your program use utf8; # This tells STDOUT that it should see everything as UTF-8, instead of bytes binmode STDOUT, ':utf8';
Better solution?
The bad part of the solutions is that I had to enable UTF-8 for every module separately. I want to know a way to enable UTF-8 through out the script for everything. I’ve not found any way to do this :(. So, if you know about any such way, then please let me know :).
[1]Don’t dare stop reading here, if you will do then you may live in the dark.
[2]Reason behind [1]
Have you asked this question on Stack Overflow? Might be a good spot to ask, if it has not already been answered.
No, not yet!
I will do, but after doing some research (in the next weekend probably) 🙂
There is utf8::all, which would save you some Perl-interpreter only work, but you are on your own wrt. CPAN modules – there is no standard way to request use of utf8 (perhaps if Exporter, Sub;;Exporter etc. provided it…).
Thanks, I didn’t know about utf8:all
To maintain compatibility with older code Perl 5 still prefers latin1 byte strings over unicode strings and you can experience arbitrary downgrades of strings you thought were unicode. There is no method currently in Perl 5 to say I really mean all strings should be unicode strings and at my option latin1 strings need to be either upgraded for all string ops or be fatal errors. Perl6 won’t have this baggage, but that probably won’t help you with the code you’re writing today. My boilerplate for new scripts includes utf8::all which is a module that turns on as much unicode as possible.
I’ve not tried utf8:all, but the other person (Jakub) commented that it doesn’t work with other modules I use in my program.
Thanks! 🙂
Here’s the classic Stackoverflow answer on UTF-8
http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/#answer-6163129
Try not to weep.
Thanks for sharing it, very informative!
Just don’t do the “bindmode(stdout, ‘:utf8’)” and then create a thread in a threaded program… https://rt.perl.org/Public/Bug/Display.html?id=31923
Thanks, very informative!
Pingback: Perlbuzz news roundup for 2016-05-27 – perlbuzz.com