utf8::decode() may actually unset the UTF-8 flag

According to the documentation, the utf8::decode() function should generally set/turn on the UTF-8 flag for a string that contains multi-byte characters.  However, apparently, there are circumstances under which utf8::decode() may not only not set the flag, but may actually unset/clear/turn off the flag for a string that contains multi-byte characters.

The following script contains 4 test cases to demonstrate this:

#!/usr/bin/perl

use utf8;
use open ( ":encoding(UTF-8)", ":std" );
use Encode;

foreach my $char ( "\xC3\xA9", "é" )
{
  print "$char before utf8::decode(): utf8=".(utf8::is_utf8($char)||0)."\n";
  utf8::decode ( $char );
  print "$char after utf8::decode(): utf8=".(utf8::is_utf8($char)||0)."\n";
}

foreach my $char ( "\xC3\xA9", "é" )
{
  print "$char before Encode::decode(): utf8=".(utf8::is_utf8($char)||0)."\n";
  my $char = Encode::decode ( 'UTF-8', $char );
  print "$char after Encode::decode(): utf8=".(utf8::is_utf8($char)||0)."\n";
}

Results:

é before utf8::decode(): utf8=0
é after utf8::decode(): utf8=1
é before utf8::decode(): utf8=1
é after utf8::decode(): utf8=0
é before Encode::decode(): utf8=0
é after Encode::decode(): utf8=1
é before Encode::decode(): utf8=1
� after Encode::decode(): utf8=1

The 1st test case gives utf8::decode() a string that is not automatically flagged as UTF-8 (utf8=0), and utf8::decode() expectedly turns on the flag for this string (utf8=1).

The 2nd test case gives utf8::decode() a string that is automatically flagged as UTF-8 (utf8=1), and utf8::decode() unexpectedly turns off the flag for this string (utf8=0)!

For comparison purposes, the Encode::decode() function, given the same input in the 4th test case, double-decodes the results, producing garbage (though leaving the utf8 flag set).  This is one reason why I prefer utf8::decode().

At first I thought this was a bug in utf8::decode() – I mean, why would it ever turn off the UTF-8 flag, when its purpose suggests that it should typically only ever turn it on?  However, its real purpose is to decode the string into Perl’s internal representation, which is UTF-8 for strings that cannot be encoded with single bytes.  But for strings that can be, it may or may not use a UTF-8 encoding internally.  So while utf8::encode() always unsets the flag (utf8=0), utf8::decode() may set it, unset it, or leave it as-is (utf8=?).

If you (like me) are using a module such as CGI that depends on Perl’s internal representation (ie, it uses the utf8::is_utf8() function), then this can cause unexpected results.  For example, CGI::escape() produces different results depending on Perl’s seemingly arbitrary choice of internal representation.  Hence this code:

#!/usr/bin/perl

use utf8;
use open ( ":encoding(UTF-8)", ":std" );
use CGI;

foreach my $char ( "\xC3\xA9", "é" )
{
  utf8::decode ( $char );
  print "$char=" . CGI::escape ( $char ) . "\n";
}

Produces the following results:

é=%C3%A9
é=%E9

Presumably this is fixed by not using CGI::escape(), which is buggy in this regard.  Use URI::Escape::uri_escape_utf8() instead.  Alternatively, use utf8::upgrade() or Encode::decode() to ensure that the UTF-8 flag is set when desired.

Leave a Reply

Your email address will not be published. Required fields are marked *