The UTF-8 (Unicode) character encoding system is a well supported alternative to the older ISO-8859-1 (Latin-1) system that can make it easier to work with special characters and multiple languages. Many developers can exercise sufficient control over their system to ensure that:
- All Perl source code is encoded in UTF-8
- All text input files and streams are encoded in UTF-8
- Interactions with browsers are encoded in UTF-8
- Interactions with all other interfaces are UTF-8 based
- All other text data is encoded in UTF-8 by default
If, like me, you have decided to standardize your code on UTF-8 and, like me, are not particularly concerned with exceptions such as alternate encodings, then it’s useful (and reasonably safe) to set up scripts to default to UTF-8 character encoding for all text.
Perl versions >5.8 automatically tag text as UTF-8 whenever the interpreter knows for certain that it is UTF-8 encoded. Examples of this include using chr() with a UTF-8 code-point (>255), a string containing the \x{code-point} syntax, data from files opened with a :utf8 encoding layer, or any data explicitly tagged as UTF-8 encoding using (for example) utf8::decode().
Unfortunately however, for reasons of backwards compatibility with older code, there are many exceptions where UTF-8 encoded text is not automatically tagged as UTF-8. Examples of this include setting UTF-8 encoded string constants in Perl code, using filehandles that are not explicitly tagged as UTF-8, environment variables, command-line arguments, data fetched from a database, CGI input including unescaped URIs, cookies, parameters, and POSTed data.
Strings that contain UTF-8 encoded characters but not tagged as UTF-8 are treated as Latin-1 text (equivalent to binary), and may produce different/unexpected results when used in functions such as length() and sort(), and in regular expressions. Additionally, mixing tagged UTF-8 data with untagged UTF-8 data may lead to double-encoding. One solution is to manually tag all text strings as UTF-8. However, assuming as above that the system only ever interacts with UTF-8 encoded text, it is desireable to set that as the default in as many cases as is safe to do so.
The following script serves as a useful test case for demonstrating Perl’s default UTF-8 support features, and additional features that can be turned on to provide additional defaults:
#!/usr/bin/perl use CGI; no bytes; open ( my $in, "utf8.txt" ); CGI::param ( "UTF8TEST" => '٭' ); my @tests = ( "\x{3a}", # Not UTF-8 "a", # Not UTF-8 "\xE2", # Automatic chr(0x00B2), # Automatic "\x{00DA}", # Automatic chr ( 140 ), # Automatic chr(0x05d0), # Automatic "\x{263a}", # Automatic x, # use utf8 'é', # use utf8 '□', # use utf8 $ARGV[0], # export PERL5OPT=-CA $ENV{UTF8TEST}, # untaint() CGI::unescape(CGI::escape('س')), # CGI override CGI::param ( 'UTF8TEST' ), # CGI override <$in>, # use open `echo -n €`, # use open <STDIN>, # use open substr ( <DATA>, 0, -1 ), # use utf8 "\xD8" . 'Ø', # use utf8 join ( "", a => "\xE2" ), # use utf8 join ( "", "a" => "\xE2" ) ); # Not UTF-8 close ( $in ); my $i = 0; foreach my $test ( @tests ) { print sprintf ( "%.2d: ", ++$i ) . join ( "; ", "char=$test", "utf8=" . ( utf8::is_utf8 ( $test ) +0 ) , "length=" . length ( $test ), "bytes=" . bytes::length ( $test ), "alpha=" . ( $test =~ /\w/ || 0 ), "ord=" . join ( ",", unpack ( "C*", $test ) ) ) . "\n"; } __DATA__ ↔
The script is invoked as follows:
>echo -n Ɣ | env UTF8TEST=⁒ utf8 Ω
And here is the result:
01: char=:; utf8=0; length=1; bytes=1; alpha=0; ord=58 02: char=a; utf8=0; length=1; bytes=1; alpha=1; ord=97 03: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=226 04: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=178 05: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=218 06: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=140 Wide character in print at ./utf8 line 37, <DATA> line 1. 07: char=א; utf8=1; length=1; bytes=2; alpha=1; ord=1488 Wide character in print at ./utf8 line 37, <DATA> line 1. 08: char=☺; utf8=1; length=1; bytes=3; alpha=0; ord=9786 09: char=x; utf8=0; length=1; bytes=1; alpha=1; ord=120 10: char=é; utf8=0; length=2; bytes=2; alpha=0; ord=195,169 11: char=□; utf8=0; length=3; bytes=3; alpha=0; ord=226,150,161 12: char=Ω; utf8=0; length=2; bytes=2; alpha=0; ord=206,169 13: char=⁒; utf8=0; length=3; bytes=3; alpha=0; ord=226,129,146 14: char=س; utf8=0; length=2; bytes=2; alpha=0; ord=216,179 15: char=٭; utf8=0; length=2; bytes=2; alpha=0; ord=217,173 16: char=Ʃ; utf8=0; length=2; bytes=2; alpha=0; ord=198,169 17: char=€; utf8=0; length=3; bytes=3; alpha=0; ord=226,130,172 18: char=Ɣ; utf8=0; length=2; bytes=2; alpha=0; ord=198,148 19: char=↔; utf8=0; length=3; bytes=3; alpha=0; ord=226,134,148 20: char=�Ø; utf8=0; length=3; bytes=3; alpha=0; ord=216,195,152 21: char=a�; utf8=0; length=2; bytes=2; alpha=1; ord=97,226 22: char=a�; utf8=0; length=2; bytes=2; alpha=1; ord=97,226
Although the characters do (for the most part) display correctly, this is only because the viewer (shell terminal in this case) interprets the byte sequences correctly as UTF-8. Most of these test cases are not automatically tagged as UTF-8 (utf8=0), and hence mixing them with text that is may lead to the issues noted above. Also note that the 2 cases that are tagged as UTF-8 (utf8=1) produce a “Wide character in print” warning, which indicates the same problem – they are also displaying correctly only because of the viewer.
The premise here is that in a controlled environment that is standardized on UTF-8, all of the above text can be assumed by default to be UTF-8 encoded, unless Perl is explicitly told otherwise. How do we do this?
1. Tell Perl that all code, including string constants, is UTF-8 encoded by adding:
use utf8;
2. Tell Perl that all input and output files, handles, and streams are UTF-8 encoded by adding:
use open ( ":encoding(UTF-8)", ":std" );
3. Tag all environment variables as UTF-8 in a loop at the beginning of the script. If you are running in taint mode (as you should), then you can combine this with untainting, eg:
sub untaint { foreach ( keys ( %ENV ) ) { $ENV{$_} = $1 if $ENV{$_} =~ /^([^;{}"`|&<>\n\r]*)$/; utf8::decode ( $ENV{$_} ); } }
4. Tell Perl that all command-line arguments are UTF-8 encoded using:
>export PERL5OPT=-CA
5. Tell the browser to communicate using UTF-8 encoded text by adding “charset=UTF-8” to the Content-Type header. This can be done for example in the Apache config file using:
AddDefaultCharset UTF-8
6. Decode CGI params using:
use CGI ( "-utf8" );
7. Connect to MySQL using a UTF-8 encoded interface:
my $connection = DBI->connect ( "dbi:mysql:$database", $username, $password, { mysql_enable_utf8 => 1 } );
8. Remember to set binmode() on non-text (binary) filehandles, and use the Encode module where necessary to make exceptions.
9. Since this list only covers Perl, other interfaces are not addressed here, but don’t forget about: The shell’s LANG environment variable, your terminal program, your editor, email you send out via Perl, etc.
The goal of all this is to automatically tag strings as UTF-8 encoded by default. With all these in place, the test script produces the following output:
01: char=:; utf8=0; length=1; bytes=1; alpha=0; ord=58 02: char=a; utf8=0; length=1; bytes=1; alpha=1; ord=97 03: char=â; utf8=0; length=1; bytes=1; alpha=0; ord=226 04: char=²; utf8=0; length=1; bytes=1; alpha=0; ord=178 05: char=Ú; utf8=0; length=1; bytes=1; alpha=0; ord=218 06: char=; utf8=0; length=1; bytes=1; alpha=0; ord=140 07: char=א; utf8=1; length=1; bytes=2; alpha=1; ord=1488 08: char=☺; utf8=1; length=1; bytes=3; alpha=0; ord=9786 09: char=x; utf8=1; length=1; bytes=1; alpha=1; ord=120 10: char=é; utf8=1; length=1; bytes=2; alpha=1; ord=233 11: char=□; utf8=1; length=1; bytes=3; alpha=0; ord=9633 12: char=Ω; utf8=1; length=1; bytes=2; alpha=1; ord=937 13: char=⁒; utf8=1; length=1; bytes=3; alpha=0; ord=8274 14: char=س; utf8=1; length=1; bytes=2; alpha=1; ord=1587 15: char=٭; utf8=1; length=1; bytes=2; alpha=0; ord=1645 16: char=Ʃ; utf8=1; length=1; bytes=2; alpha=1; ord=425 17: char=€; utf8=1; length=1; bytes=3; alpha=0; ord=8364 18: char=Ɣ; utf8=1; length=1; bytes=2; alpha=1; ord=404 19: char=↔; utf8=1; length=1; bytes=3; alpha=0; ord=8596 20: char=ØØ; utf8=1; length=2; bytes=4; alpha=1; ord=216,216 21: char=aâ; utf8=1; length=2; bytes=3; alpha=1; ord=97,226 22: char=aâ; utf8=0; length=2; bytes=2; alpha=1; ord=97,226
Now most of the test cases are tagged as UTF-8, with the exception of the non-UTF-8 cases, which are at least displaying correctly now.
Notes:
- The term “tagged as UTF-8” refers to the “UTF-8 flag” – Perl’s internal flag that indicates that a string has a known decoding. This is checked in the test script above using utf8::is_utf8(), and is useful for debugging purposes only.
- The “use” pragmas above have a lexical scope, which means that every script needs them. In a web server (CGI) environment, this means every web page. A good content management system should be able to automate this. It may also be a good idea to use a boilerplate such as utf8::all.
- In some cases it may be desirable to ensure that correct encoding is used by die()ing, instead of just logging a warning, when it isn’t:
use warnings ( "FATAL" => "utf8" );
- Be sure to set up MySQL to use UTF-8 by adding the following to /etc/my.cnf:
[client] default-character-set = utf8 [mysqld] default-character-set = utf8
… and converting all existing tables to UTF-8 as well.
- CGI’s -utf8 option only automatically decodes string values returned by CGI::param(). It does not decode:
- Parameter names
- Values returned by other CGI functions, such as cookie(), url_param(), unescape(), or Vars()
- File names
A more comprehensive solution is described below, which decodes parameter names and values from other CGI functions. However, file names must still be decoded manually when used – as described in the CGI documentation.
This solution overrides some existing CGI methods with UTF-8 friendly versions. It encodes all incoming arguments and decodes all outgoing results – so that CGI only interfaces with encoded (non-UTF-8) strings. An exception is made for CGI::param(), which should set UTF-8 decoded values so as to prevent double-encoding problems with input fields.
our %originals = (); foreach my $sub ( "unescape", "cookie", "param", "url_param", "Vars", "delete", "upload" ) { next if exists ( $originals{$sub} ); no strict; my $s = "CGI::$sub"; $originals{$sub} = \&$s; $originals{$sub}->(); *$s = sub { # Prevent CGI from using overridden functions internally: return ( wantarray ? ( $originals{$sub}->( @_ ) ) : scalar ( $originals{$sub}->( @_ ) ) ) if (caller())[0] eq "CGI"; # Encode incoming arguments so as not to confuse CGI: my @args = @_; foreach ( @args ) { utf8::encode ( $_ ) if ! ref ( $_ ) and utf8::is_utf8 ( $_ ); last if $sub eq "param"; } # Decode results to be returned: my @result = map { ref ( $_ ) ? $_ : do { utf8::decode ( $_ ); $_; } } wantarray ? ( $originals{$sub}->( @args ) ) : scalar ( $originals{$sub}->( @args ) ); # Special exception for Vars(), which may return a hashref: if ( ! wantarray and ref ( $result[0] ) eq "HASH" ) { my %result = (); foreach ( keys ( %{$result[0]} ) ) { my $key = $_; utf8::decode ( $_ ); $result{$_} = $result[0]->{$key}; utf8::decode ( $result{$_} ); } $result[0] = \%result; } return ( wantarray ? @result : $result[0] ); }; }
- There are currently many bugs in Perl and modules with respect to UTF-8 support, so not everything goes smoothly. Perhaps one of the most significant shortcomings is that several file handling functions, including opendir(), readdir(), and features that rely on them such as <globbing*>, File::Find, etc, do not respect the use open pragma for file handling. This means that file/directory names are not automatically decoded, and there is currently no way to change that default behaviour. If you are using a UTF-8 encoded file system with encoded file/directory names, then you will have to decode them manually:
foreach ( <*> ) { utf8::decode($_); ... }
Another example is Data::Dumper – a module often used to encode and decode complex variables – that does not handle character encodings, so these are lost if not handled manually. Consider using JSON for this purpose instead.