UTF-8 & perl

Backstory

Some days ago, at work I had to do regular expression matching with some UTF-8 or Unicode[1] encoded text. That made me read perlunitut. After reading it, I got to know that UTF-8 & Unicode are not the same[2]. There’re many Unicode encodings & UTF-8 is the most commonly used Unicode encoding.

Today, once again I was bitten by my code.This is what I did there (in brief):

  • I store details of an event in a YAML file.
  • Construct an HAML file using data from the YAML file.
  • Construct a .tt (Template Toolkit) from a HAML file.
  • Finally, an .html file using that .tt file.
  • Also, the unicode character appeared malformed in the email.

In YAML file, one of the speakers has a unicode character in his name, which made it appear on the browser like this : Screenshot - 07202015 - 12:54:32 AMWhen I tried to print the variable having UTF-8 text, I got this warning :

Wide character in print

Solutions

In my code, I was using quite a few modules namely :

  • IO::All for writing text to .haml file.
  • Text::Haml for rendering text of .haml file.
  • Template for rendering .tt file.
  • Email::Simple for sending emails.

IO:All

I was using IO::All::io for reading & writing files. You can enable UTF-8 encoding in IO::All, by two ways (from the man page of IO::All):

  • Enable UTF-8 for a single operation:
$contents = io('file.txt')->utf8->all;  # Turn on utf8
  • Enable UTF-8 for all operations:
use IO::All -utf8;                      # Turn on utf8 for all io
$contents = io('file.txt')->all;        # by default in this package.

Text::Haml

In Text::Haml, you can set text encoding while creating an object of Text::Haml, like this :

my $haml   = Text::Haml->new( encoding => 'utf-8' );

Template

In Template, you can specify encoding as an argument to process, like this:


$tt->process(
    'input.tt',
    { key => "value"},
    'output.html',
    binmode => ':utf8'  # set encoding to UTF-8
)

Email::Simple

In Email::Simple, you can specify the encoding in the email headers, like this:


my $email = Email::Simple->create(
    header => [
        To                   => $to,
        From               => $from,
        Subject           => $subject,
        'Content-Type' => "text/plain; charset=utf-8",
    ],
    body   => $email_body,
);

print statements

For getting rid of this warning:


Wide character in print

You can do this:

# This allows you to write in UTF-8 in your program
use utf8;

# This tells STDOUT that it should see everything as UTF-8, instead of bytes
binmode STDOUT, ':utf8';

Better solution?

The bad part of the solutions is that I had to enable UTF-8 for every module separately. I want to know a way to enable UTF-8 through out the script for everything. I’ve not found any way to do this :(. So, if you know about any such way, then please let me know :).

[1]Don’t dare stop reading here, if you will do then you may live in the dark.
[2]Reason behind [1]

11 thoughts on “UTF-8 & perl

  1. To maintain compatibility with older code Perl 5 still prefers latin1 byte strings over unicode strings and you can experience arbitrary downgrades of strings you thought were unicode. There is no method currently in Perl 5 to say I really mean all strings should be unicode strings and at my option latin1 strings need to be either upgraded for all string ops or be fatal errors. Perl6 won’t have this baggage, but that probably won’t help you with the code you’re writing today. My boilerplate for new scripts includes utf8::all which is a module that turns on as much unicode as possible.

  2. Pingback: Perlbuzz news roundup for 2016-05-27 – perlbuzz.com

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s