Perl: Default to UTF-8 encoding

The UTF-8 (Unicode) character encoding system is a well supported alternative to the older ISO-8859-1 (Latin-1) system that can make it easier to work with special characters and multiple languages.  Many developers can exercise sufficient control over their system to ensure that:

  • All Perl source code is encoded in UTF-8
  • All text input files and streams are encoded in UTF-8
  • Interactions with browsers are encoded in UTF-8
  • Interactions with all other interfaces are UTF-8 based
  • All other text data is encoded in UTF-8 by default

If, like me, you have decided to standardize your code on UTF-8 and, like me, are not particularly concerned with exceptions such as alternate encodings, then it’s useful (and reasonably safe) to set up scripts to default to UTF-8 character encoding for all text.

Perl versions >5.8 automatically tag text as UTF-8 whenever the interpreter knows for certain that it is UTF-8 encoded.  Examples of this include using chr() with a UTF-8 code-point (>255), a string containing the \x{code-point} syntax, data from files opened with a :utf8 encoding layer, or any data explicitly tagged as UTF-8 encoding using (for example) utf8::decode().

Unfortunately however, for reasons of backwards compatibility with older code, there are many exceptions where UTF-8 encoded text is not automatically tagged as UTF-8.  Examples of this include setting UTF-8 encoded string constants in Perl code, using filehandles that are not explicitly tagged as UTF-8, environment variables, command-line arguments, data fetched from a database, CGI input including unescaped URIs, cookies, parameters, and POSTed data.

Strings that contain UTF-8 encoded characters but not tagged as UTF-8 are treated as Latin-1 text (equivalent to binary), and may produce different/unexpected results when used in functions such as length() and sort(), and in regular expressions.  Additionally, mixing tagged UTF-8 data with untagged UTF-8 data may lead to double-encoding.  One solution is to manually tag all text strings as UTF-8.  However, assuming as above that the system only ever interacts with UTF-8 encoded text, it is desireable to set that as the default in as many cases as is safe to do so.

The following script serves as a useful test case for demonstrating Perl’s default UTF-8 support features, and additional features that can be turned on to provide additional defaults:

#!/usr/bin/perl

use CGI;
no bytes;

open ( my $in, "utf8.txt" );
CGI::param ( "UTF8TEST" => '٭' );

my @tests = ( "\x{3a}",                           # Not UTF-8
              "a",                                # Not UTF-8
              "\xE2",                             # Automatic
              chr(0x00B2),                        # Automatic
              "\x{00DA}",                         # Automatic
              chr ( 140 ),                        # Automatic
              chr(0x05d0),                        # Automatic
              "\x{263a}",                         # Automatic
              x,                                  # use utf8
              'é',                                # use utf8
              '□',                                # use utf8
              $ARGV[0],                           # export PERL5OPT=-CA
              $ENV{UTF8TEST},                     # untaint()
              CGI::unescape(CGI::escape('س')),    # CGI override
              CGI::param ( 'UTF8TEST' ),          # CGI override
              <$in>,                              # use open
              `echo -n €`,                        # use open
              <STDIN>,                            # use open
              substr ( <DATA>, 0, -1 ),           # use utf8
              "\xD8" . 'Ø',                       # use utf8
              join ( "", a => "\xE2" ),           # use utf8
              join ( "", "a" => "\xE2" ) );       # Not UTF-8

close ( $in );

my $i = 0;
foreach my $test ( @tests )
{
  print sprintf ( "%.2d: ", ++$i )
      . join ( "; ", "char=$test",
                     "utf8=" . ( utf8::is_utf8 ( $test ) +0 ) ,
                     "length=" . length ( $test ),
                     "bytes=" . bytes::length ( $test ),
                     "alpha=" . ( $test =~ /\w/ || 0 ),
                     "ord=" . join ( ",", unpack ( "C*", $test ) ) ) . "\n";
}

__DATA__
↔

The script is invoked as follows:

>echo -n Ɣ | env UTF8TEST=⁒ utf8 Ω

And here is the result:

01: char=:; utf8=0; length=1; bytes=1; alpha=0; ord=58
02: char=a; utf8=0; length=1; bytes=1; alpha=1; ord=97
03: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=226
04: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=178
05: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=218
06: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=140
Wide character in print at ./utf8 line 37, <DATA> line 1.
07: char=א; utf8=1; length=1; bytes=2; alpha=1; ord=1488
Wide character in print at ./utf8 line 37, <DATA> line 1.
08: char=☺; utf8=1; length=1; bytes=3; alpha=0; ord=9786
09: char=x; utf8=0; length=1; bytes=1; alpha=1; ord=120
10: char=é; utf8=0; length=2; bytes=2; alpha=0; ord=195,169
11: char=□; utf8=0; length=3; bytes=3; alpha=0; ord=226,150,161
12: char=Ω; utf8=0; length=2; bytes=2; alpha=0; ord=206,169
13: char=⁒; utf8=0; length=3; bytes=3; alpha=0; ord=226,129,146
14: char=س; utf8=0; length=2; bytes=2; alpha=0; ord=216,179
15: char=٭; utf8=0; length=2; bytes=2; alpha=0; ord=217,173
16: char=Ʃ; utf8=0; length=2; bytes=2; alpha=0; ord=198,169
17: char=€; utf8=0; length=3; bytes=3; alpha=0; ord=226,130,172
18: char=Ɣ; utf8=0; length=2; bytes=2; alpha=0; ord=198,148
19: char=↔; utf8=0; length=3; bytes=3; alpha=0; ord=226,134,148
20: char=�Ø; utf8=0; length=3; bytes=3; alpha=0; ord=216,195,152
21: char=a�; utf8=0; length=2; bytes=2; alpha=1; ord=97,226
22: char=a�; utf8=0; length=2; bytes=2; alpha=1; ord=97,226

Although the characters do (for the most part) display correctly, this is only because the viewer (shell terminal in this case) interprets the byte sequences correctly as UTF-8.  Most of these test cases are not automatically tagged as UTF-8 (utf8=0), and hence mixing them with text that is may lead to the issues noted above.  Also note that the 2 cases that are tagged as UTF-8 (utf8=1) produce a “Wide character in print” warning, which indicates the same problem – they are also displaying correctly only because of the viewer.

The premise here is that in a controlled environment that is standardized on UTF-8, all of the above text can be assumed by default to be UTF-8 encoded, unless Perl is explicitly told otherwise.  How do we do this?

1. Tell Perl that all code, including string constants, is UTF-8 encoded by adding:

use utf8;

2. Tell Perl that all input and output files, handles, and streams are UTF-8 encoded by adding:

use open ( ":encoding(UTF-8)", ":std" );

3. Tag all environment variables as UTF-8 in a loop at the beginning of the script.  If you are running in taint mode (as you should), then you can combine this with untainting, eg:

sub untaint
{
  foreach ( keys ( %ENV ) )
  {
    $ENV{$_} = $1 if $ENV{$_} =~ /^([^;{}"`|&<>\n\r]*)$/;
    utf8::decode ( $ENV{$_} );
  }
}

4. Tell Perl that all command-line arguments are UTF-8 encoded using:

>export PERL5OPT=-CA

5. Tell the browser to communicate using UTF-8 encoded text by adding “charset=UTF-8” to the Content-Type header.  This can be done for example in the Apache config file using:

AddDefaultCharset UTF-8

6. Decode CGI params using:

use CGI ( "-utf8" );

7. Connect to MySQL using a UTF-8 encoded interface:

my $connection = DBI->connect ( "dbi:mysql:$database", $username, $password, { mysql_enable_utf8 => 1 } );

8. Remember to set binmode() on non-text (binary) filehandles, and use the Encode module where necessary to make exceptions.

9. Since this list only covers Perl, other interfaces are not addressed here, but don’t forget about: The shell’s LANG environment variable, your terminal program, your editor, email you send out via Perl, etc.

The goal of all this is to automatically tag strings as UTF-8 encoded by default.  With all these in place, the test script produces the following output:

01: char=:; utf8=0; length=1; bytes=1; alpha=0; ord=58
02: char=a; utf8=0; length=1; bytes=1; alpha=1; ord=97
03: char=â; utf8=0; length=1; bytes=1; alpha=0; ord=226
04: char=²; utf8=0; length=1; bytes=1; alpha=0; ord=178
05: char=Ú; utf8=0; length=1; bytes=1; alpha=0; ord=218
06: char=; utf8=0; length=1; bytes=1; alpha=0; ord=140
07: char=א; utf8=1; length=1; bytes=2; alpha=1; ord=1488
08: char=☺; utf8=1; length=1; bytes=3; alpha=0; ord=9786
09: char=x; utf8=1; length=1; bytes=1; alpha=1; ord=120
10: char=é; utf8=1; length=1; bytes=2; alpha=1; ord=233
11: char=□; utf8=1; length=1; bytes=3; alpha=0; ord=9633
12: char=Ω; utf8=1; length=1; bytes=2; alpha=1; ord=937
13: char=⁒; utf8=1; length=1; bytes=3; alpha=0; ord=8274
14: char=س; utf8=1; length=1; bytes=2; alpha=1; ord=1587
15: char=٭; utf8=1; length=1; bytes=2; alpha=0; ord=1645
16: char=Ʃ; utf8=1; length=1; bytes=2; alpha=1; ord=425
17: char=€; utf8=1; length=1; bytes=3; alpha=0; ord=8364
18: char=Ɣ; utf8=1; length=1; bytes=2; alpha=1; ord=404
19: char=↔; utf8=1; length=1; bytes=3; alpha=0; ord=8596
20: char=ØØ; utf8=1; length=2; bytes=4; alpha=1; ord=216,216
21: char=aâ; utf8=1; length=2; bytes=3; alpha=1; ord=97,226
22: char=aâ; utf8=0; length=2; bytes=2; alpha=1; ord=97,226

Now most of the test cases are tagged as UTF-8, with the exception of the non-UTF-8 cases, which are at least displaying correctly now.

Notes:

  • The term “tagged as UTF-8” refers to the “UTF-8 flag” – Perl’s internal flag that indicates that a string has a known decoding.  This is checked in the test script above using utf8::is_utf8(), and is useful for debugging purposes only.
  • The “use” pragmas above have a lexical scope, which means that every script needs them.  In a web server (CGI) environment, this means every web page.  A good content management system should be able to automate this.  It may also be a good idea to use a boilerplate such as utf8::all.
  • In some cases it may be desirable to ensure that correct encoding is used by die()ing, instead of just logging a warning, when it isn’t:
use warnings ( "FATAL" => "utf8" );
  • Be sure to set up MySQL to use UTF-8 by adding the following to /etc/my.cnf:
[client]
default-character-set = utf8
[mysqld]
default-character-set = utf8

… and converting all existing tables to UTF-8 as well.

  • CGI’s -utf8 option only automatically decodes string values returned by CGI::param().  It does not decode:
  • Parameter names
  • Values returned by other CGI functions, such as cookie(), url_param(), unescape(), or Vars()
  • File names

A more comprehensive solution is described below, which decodes parameter names and values from other CGI functions.  However, file names must still be decoded manually when used – as described in the CGI documentation.

This solution overrides some existing CGI methods with UTF-8 friendly versions.  It encodes all incoming arguments and decodes all outgoing results – so that CGI only interfaces with encoded (non-UTF-8) strings.  An exception is made for CGI::param(), which should set UTF-8 decoded values so as to prevent double-encoding problems with input fields.

our %originals = ();

foreach my $sub ( "unescape", "cookie",
                  "param", "url_param", "Vars", "delete", "upload" )
{
  next if exists ( $originals{$sub} );

  no strict;

  my $s = "CGI::$sub"; $originals{$sub} = \&$s; $originals{$sub}->();
  *$s = sub
  {
    # Prevent CGI from using overridden functions internally:
    return ( wantarray ? ( $originals{$sub}->( @_ ) )
                       : scalar ( $originals{$sub}->( @_ ) ) )
      if (caller())[0] eq "CGI";

    # Encode incoming arguments so as not to confuse CGI:
    my @args = @_; foreach ( @args )
    {
      utf8::encode ( $_ ) if ! ref ( $_ ) and utf8::is_utf8 ( $_ );
      last if $sub eq "param";
    }

    # Decode results to be returned:
    my @result = map { ref ( $_ ) ? $_ : do { utf8::decode ( $_ ); $_; } }
                     wantarray ? ( $originals{$sub}->( @args ) )
                               : scalar ( $originals{$sub}->( @args ) );

    # Special exception for Vars(), which may return a hashref:
    if ( ! wantarray and ref ( $result[0] ) eq "HASH" )
    {
      my %result = ();
      foreach ( keys ( %{$result[0]} ) )
      {
        my $key = $_; utf8::decode ( $_ );
        $result{$_} = $result[0]->{$key};
        utf8::decode ( $result{$_} );
      }
      $result[0] = \%result;
    }

    return ( wantarray ? @result : $result[0] );
  };
}
  • There are currently many bugs in Perl and modules with respect to UTF-8 support, so not everything goes smoothly.  Perhaps one of the most significant shortcomings is that several file handling functions, including opendir(), readdir(), and features that rely on them such as <globbing*>, File::Find, etc, do not respect the use open pragma for file handling.  This means that file/directory names are not automatically decoded, and there is currently no way to change that default behaviour.  If you are using a UTF-8 encoded file system with encoded file/directory names, then you will have to decode them manually:
    foreach ( <*> )
    {
      utf8::decode($_);
    
      ...
    }

    Another example is Data::Dumper – a module often used to encode and decode complex variables – that does not handle character encodings, so these are lost if not handled manually. Consider using JSON for this purpose instead.

Leave a Reply

Your email address will not be published. Required fields are marked *