According to the documentation, the utf8::decode() function should generally set/turn on the UTF-8 flag for a string that contains multi-byte characters. However, apparently, there are circumstances under which utf8::decode() may not only not set the flag, but may actually unset/clear/turn off the flag for a string that contains multi-byte characters.
The following script contains 4 test cases to demonstrate this:
#!/usr/bin/perl use utf8; use open ( ":encoding(UTF-8)", ":std" ); use Encode; foreach my $char ( "\xC3\xA9", "é" ) { print "$char before utf8::decode(): utf8=".(utf8::is_utf8($char)||0)."\n"; utf8::decode ( $char ); print "$char after utf8::decode(): utf8=".(utf8::is_utf8($char)||0)."\n"; } foreach my $char ( "\xC3\xA9", "é" ) { print "$char before Encode::decode(): utf8=".(utf8::is_utf8($char)||0)."\n"; my $char = Encode::decode ( 'UTF-8', $char ); print "$char after Encode::decode(): utf8=".(utf8::is_utf8($char)||0)."\n"; }
Results:
é before utf8::decode(): utf8=0 é after utf8::decode(): utf8=1 é before utf8::decode(): utf8=1 é after utf8::decode(): utf8=0 é before Encode::decode(): utf8=0 é after Encode::decode(): utf8=1 é before Encode::decode(): utf8=1 � after Encode::decode(): utf8=1
The 1st test case gives utf8::decode() a string that is not automatically flagged as UTF-8 (utf8=0), and utf8::decode() expectedly turns on the flag for this string (utf8=1).
The 2nd test case gives utf8::decode() a string that is automatically flagged as UTF-8 (utf8=1), and utf8::decode() unexpectedly turns off the flag for this string (utf8=0)!
For comparison purposes, the Encode::decode() function, given the same input in the 4th test case, double-decodes the results, producing garbage (though leaving the utf8 flag set). This is one reason why I prefer utf8::decode().
At first I thought this was a bug in utf8::decode() – I mean, why would it ever turn off the UTF-8 flag, when its purpose suggests that it should typically only ever turn it on? However, its real purpose is to decode the string into Perl’s internal representation, which is UTF-8 for strings that cannot be encoded with single bytes. But for strings that can be, it may or may not use a UTF-8 encoding internally. So while utf8::encode() always unsets the flag (utf8=0), utf8::decode() may set it, unset it, or leave it as-is (utf8=?).
If you (like me) are using a module such as CGI that depends on Perl’s internal representation (ie, it uses the utf8::is_utf8() function), then this can cause unexpected results. For example, CGI::escape() produces different results depending on Perl’s seemingly arbitrary choice of internal representation. Hence this code:
#!/usr/bin/perl use utf8; use open ( ":encoding(UTF-8)", ":std" ); use CGI; foreach my $char ( "\xC3\xA9", "é" ) { utf8::decode ( $char ); print "$char=" . CGI::escape ( $char ) . "\n"; }
Produces the following results:
é=%C3%A9 é=%E9
Presumably this is fixed by not using CGI::escape(), which is buggy in this regard. Use URI::Escape::uri_escape_utf8() instead. Alternatively, use utf8::upgrade() or Encode::decode() to ensure that the UTF-8 flag is set when desired.