The UTF-8 (Unicode) character encoding system is a well supported alternative to the older ISO-8859-1 (Latin-1) system that can make it easier to work with special characters and multiple languages. Many developers can exercise sufficient control over their system to ensure that:
- All Perl source code is encoded in UTF-8
- All text input files and streams are encoded in UTF-8
- Interactions with browsers are encoded in UTF-8
- Interactions with all other interfaces are UTF-8 based
- All other text data is encoded in UTF-8 by default
If, like me, you have decided to standardize your code on UTF-8 and, like me, are not particularly concerned with exceptions such as alternate encodings, then it’s useful (and reasonably safe) to set up scripts to default to UTF-8 character encoding for all text.
Perl versions >5.8 automatically tag text as UTF-8 whenever the interpreter knows for certain that it is UTF-8 encoded. Examples of this include using chr() with a UTF-8 code-point (>255), a string containing the \x{code-point} syntax, data from files opened with a :utf8 encoding layer, or any data explicitly tagged as UTF-8 encoding using (for example) utf8::decode().
Unfortunately however, for reasons of backwards compatibility with older code, there are many exceptions where UTF-8 encoded text is not automatically tagged as UTF-8. Examples of this include setting UTF-8 encoded string constants in Perl code, using filehandles that are not explicitly tagged as UTF-8, environment variables, command-line arguments, data fetched from a database, CGI input including unescaped URIs, cookies, parameters, and POSTed data.
Strings that contain UTF-8 encoded characters but not tagged as UTF-8 are treated as Latin-1 text (equivalent to binary), and may produce different/unexpected results when used in functions such as length() and sort(), and in regular expressions. Additionally, mixing tagged UTF-8 data with untagged UTF-8 data may lead to double-encoding. One solution is to manually tag all text strings as UTF-8. However, assuming as above that the system only ever interacts with UTF-8 encoded text, it is desireable to set that as the default in as many cases as is safe to do so.
The following script serves as a useful test case for demonstrating Perl’s default UTF-8 support features, and additional features that can be turned on to provide additional defaults:
#!/usr/bin/perl
use CGI;
no bytes;
open ( my $in, "utf8.txt" );
CGI::param ( "UTF8TEST" => '٭' );
my @tests = ( "\x{3a}", # Not UTF-8
"a", # Not UTF-8
"\xE2", # Automatic
chr(0x00B2), # Automatic
"\x{00DA}", # Automatic
chr ( 140 ), # Automatic
chr(0x05d0), # Automatic
"\x{263a}", # Automatic
x, # use utf8
'é', # use utf8
'□', # use utf8
$ARGV[0], # export PERL5OPT=-CA
$ENV{UTF8TEST}, # untaint()
CGI::unescape(CGI::escape('س')), # CGI override
CGI::param ( 'UTF8TEST' ), # CGI override
<$in>, # use open
`echo -n €`, # use open
<STDIN>, # use open
substr ( <DATA>, 0, -1 ), # use utf8
"\xD8" . 'Ø', # use utf8
join ( "", a => "\xE2" ), # use utf8
join ( "", "a" => "\xE2" ) ); # Not UTF-8
close ( $in );
my $i = 0;
foreach my $test ( @tests )
{
print sprintf ( "%.2d: ", ++$i )
. join ( "; ", "char=$test",
"utf8=" . ( utf8::is_utf8 ( $test ) +0 ) ,
"length=" . length ( $test ),
"bytes=" . bytes::length ( $test ),
"alpha=" . ( $test =~ /\w/ || 0 ),
"ord=" . join ( ",", unpack ( "C*", $test ) ) ) . "\n";
}
__DATA__
↔
The script is invoked as follows:
>echo -n Ɣ | env UTF8TEST=⁒ utf8 Ω
And here is the result:
01: char=:; utf8=0; length=1; bytes=1; alpha=0; ord=58 02: char=a; utf8=0; length=1; bytes=1; alpha=1; ord=97 03: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=226 04: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=178 05: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=218 06: char=�; utf8=0; length=1; bytes=1; alpha=0; ord=140 Wide character in print at ./utf8 line 37, <DATA> line 1. 07: char=א; utf8=1; length=1; bytes=2; alpha=1; ord=1488 Wide character in print at ./utf8 line 37, <DATA> line 1. 08: char=☺; utf8=1; length=1; bytes=3; alpha=0; ord=9786 09: char=x; utf8=0; length=1; bytes=1; alpha=1; ord=120 10: char=é; utf8=0; length=2; bytes=2; alpha=0; ord=195,169 11: char=□; utf8=0; length=3; bytes=3; alpha=0; ord=226,150,161 12: char=Ω; utf8=0; length=2; bytes=2; alpha=0; ord=206,169 13: char=⁒; utf8=0; length=3; bytes=3; alpha=0; ord=226,129,146 14: char=س; utf8=0; length=2; bytes=2; alpha=0; ord=216,179 15: char=٭; utf8=0; length=2; bytes=2; alpha=0; ord=217,173 16: char=Ʃ; utf8=0; length=2; bytes=2; alpha=0; ord=198,169 17: char=€; utf8=0; length=3; bytes=3; alpha=0; ord=226,130,172 18: char=Ɣ; utf8=0; length=2; bytes=2; alpha=0; ord=198,148 19: char=↔; utf8=0; length=3; bytes=3; alpha=0; ord=226,134,148 20: char=�Ø; utf8=0; length=3; bytes=3; alpha=0; ord=216,195,152 21: char=a�; utf8=0; length=2; bytes=2; alpha=1; ord=97,226 22: char=a�; utf8=0; length=2; bytes=2; alpha=1; ord=97,226
Although the characters do (for the most part) display correctly, this is only because the viewer (shell terminal in this case) interprets the byte sequences correctly as UTF-8. Most of these test cases are not automatically tagged as UTF-8 (utf8=0), and hence mixing them with text that is may lead to the issues noted above. Also note that the 2 cases that are tagged as UTF-8 (utf8=1) produce a “Wide character in print” warning, which indicates the same problem – they are also displaying correctly only because of the viewer.
The premise here is that in a controlled environment that is standardized on UTF-8, all of the above text can be assumed by default to be UTF-8 encoded, unless Perl is explicitly told otherwise. How do we do this?
1. Tell Perl that all code, including string constants, is UTF-8 encoded by adding:
use utf8;
2. Tell Perl that all input and output files, handles, and streams are UTF-8 encoded by adding:
use open ( ":encoding(UTF-8)", ":std" );
3. Tag all environment variables as UTF-8 in a loop at the beginning of the script. If you are running in taint mode (as you should), then you can combine this with untainting, eg:
sub untaint
{
foreach ( keys ( %ENV ) )
{
$ENV{$_} = $1 if $ENV{$_} =~ /^([^;{}"`|&<>\n\r]*)$/;
utf8::decode ( $ENV{$_} );
}
}
4. Tell Perl that all command-line arguments are UTF-8 encoded using:
>export PERL5OPT=-CA
5. Tell the browser to communicate using UTF-8 encoded text by adding “charset=UTF-8” to the Content-Type header. This can be done for example in the Apache config file using:
AddDefaultCharset UTF-8
6. Decode CGI params using:
use CGI ( "-utf8" );
7. Connect to MySQL using a UTF-8 encoded interface:
my $connection = DBI->connect ( "dbi:mysql:$database", $username, $password, { mysql_enable_utf8 => 1 } );
8. Remember to set binmode() on non-text (binary) filehandles, and use the Encode module where necessary to make exceptions.
9. Since this list only covers Perl, other interfaces are not addressed here, but don’t forget about: The shell’s LANG environment variable, your terminal program, your editor, email you send out via Perl, etc.
The goal of all this is to automatically tag strings as UTF-8 encoded by default. With all these in place, the test script produces the following output:
01: char=:; utf8=0; length=1; bytes=1; alpha=0; ord=58 02: char=a; utf8=0; length=1; bytes=1; alpha=1; ord=97 03: char=â; utf8=0; length=1; bytes=1; alpha=0; ord=226 04: char=²; utf8=0; length=1; bytes=1; alpha=0; ord=178 05: char=Ú; utf8=0; length=1; bytes=1; alpha=0; ord=218 06: char=; utf8=0; length=1; bytes=1; alpha=0; ord=140 07: char=א; utf8=1; length=1; bytes=2; alpha=1; ord=1488 08: char=☺; utf8=1; length=1; bytes=3; alpha=0; ord=9786 09: char=x; utf8=1; length=1; bytes=1; alpha=1; ord=120 10: char=é; utf8=1; length=1; bytes=2; alpha=1; ord=233 11: char=□; utf8=1; length=1; bytes=3; alpha=0; ord=9633 12: char=Ω; utf8=1; length=1; bytes=2; alpha=1; ord=937 13: char=⁒; utf8=1; length=1; bytes=3; alpha=0; ord=8274 14: char=س; utf8=1; length=1; bytes=2; alpha=1; ord=1587 15: char=٭; utf8=1; length=1; bytes=2; alpha=0; ord=1645 16: char=Ʃ; utf8=1; length=1; bytes=2; alpha=1; ord=425 17: char=€; utf8=1; length=1; bytes=3; alpha=0; ord=8364 18: char=Ɣ; utf8=1; length=1; bytes=2; alpha=1; ord=404 19: char=↔; utf8=1; length=1; bytes=3; alpha=0; ord=8596 20: char=ØØ; utf8=1; length=2; bytes=4; alpha=1; ord=216,216 21: char=aâ; utf8=1; length=2; bytes=3; alpha=1; ord=97,226 22: char=aâ; utf8=0; length=2; bytes=2; alpha=1; ord=97,226
Now most of the test cases are tagged as UTF-8, with the exception of the non-UTF-8 cases, which are at least displaying correctly now.
Notes:
- The term “tagged as UTF-8” refers to the “UTF-8 flag” – Perl’s internal flag that indicates that a string has a known decoding. This is checked in the test script above using utf8::is_utf8(), and is useful for debugging purposes only.
- The “use” pragmas above have a lexical scope, which means that every script needs them. In a web server (CGI) environment, this means every web page. A good content management system should be able to automate this. It may also be a good idea to use a boilerplate such as utf8::all.
- In some cases it may be desirable to ensure that correct encoding is used by die()ing, instead of just logging a warning, when it isn’t:
use warnings ( "FATAL" => "utf8" );
- Be sure to set up MySQL to use UTF-8 by adding the following to /etc/my.cnf:
[client] default-character-set = utf8 [mysqld] default-character-set = utf8
… and converting all existing tables to UTF-8 as well.
- CGI’s -utf8 option only automatically decodes string values returned by CGI::param(). It does not decode:
- Parameter names
- Values returned by other CGI functions, such as cookie(), url_param(), unescape(), or Vars()
- File names
A more comprehensive solution is described below, which decodes parameter names and values from other CGI functions. However, file names must still be decoded manually when used – as described in the CGI documentation.
This solution overrides some existing CGI methods with UTF-8 friendly versions. It encodes all incoming arguments and decodes all outgoing results – so that CGI only interfaces with encoded (non-UTF-8) strings. An exception is made for CGI::param(), which should set UTF-8 decoded values so as to prevent double-encoding problems with input fields.
our %originals = ();
foreach my $sub ( "unescape", "cookie",
"param", "url_param", "Vars", "delete", "upload" )
{
next if exists ( $originals{$sub} );
no strict;
my $s = "CGI::$sub"; $originals{$sub} = \&$s; $originals{$sub}->();
*$s = sub
{
# Prevent CGI from using overridden functions internally:
return ( wantarray ? ( $originals{$sub}->( @_ ) )
: scalar ( $originals{$sub}->( @_ ) ) )
if (caller())[0] eq "CGI";
# Encode incoming arguments so as not to confuse CGI:
my @args = @_; foreach ( @args )
{
utf8::encode ( $_ ) if ! ref ( $_ ) and utf8::is_utf8 ( $_ );
last if $sub eq "param";
}
# Decode results to be returned:
my @result = map { ref ( $_ ) ? $_ : do { utf8::decode ( $_ ); $_; } }
wantarray ? ( $originals{$sub}->( @args ) )
: scalar ( $originals{$sub}->( @args ) );
# Special exception for Vars(), which may return a hashref:
if ( ! wantarray and ref ( $result[0] ) eq "HASH" )
{
my %result = ();
foreach ( keys ( %{$result[0]} ) )
{
my $key = $_; utf8::decode ( $_ );
$result{$_} = $result[0]->{$key};
utf8::decode ( $result{$_} );
}
$result[0] = \%result;
}
return ( wantarray ? @result : $result[0] );
};
}
- There are currently many bugs in Perl and modules with respect to UTF-8 support, so not everything goes smoothly. Perhaps one of the most significant shortcomings is that several file handling functions, including opendir(), readdir(), and features that rely on them such as <globbing*>, File::Find, etc, do not respect the use open pragma for file handling. This means that file/directory names are not automatically decoded, and there is currently no way to change that default behaviour. If you are using a UTF-8 encoded file system with encoded file/directory names, then you will have to decode them manually:
foreach ( <*> ) { utf8::decode($_); ... }Another example is Data::Dumper – a module often used to encode and decode complex variables – that does not handle character encodings, so these are lost if not handled manually. Consider using JSON for this purpose instead.