Assumptions
- Generally one would like to use UTF-8 and forget about the details.
- XML::LibXML and XML::LibXSLT is the sane way to munge XML in Perl.
- MySQL will be involved somewhere. The cool kids probably use postgreSQL but still!
Tricks
The database
MySQL is fairly happy with UTF-8 if you set the relevant options. It seems to be enough to set
character-set-server = utf8
DBI
You need to tell DBD::mysql to use UTF-8. It's easiest to do this at connect time:
$dbh = DBI->connect(..., { RaiseError => 1, mysql_enable_utf8 => 1, });
XML::LibXML
Sadly XML::LibXML's serialization code will sometimes encode non-ASCII characters as numeric entities and not UTF-8. For example, you might see pâte sucrée instead of pâte sucrée. Obviously this is valid XML but it plays havoc with e.g. MySQL's sorting!
Most perniciously this transformation can be invisible if you look at the data in web browser, and of course it depends on both the data being serialized and (apparently) the version of the underlying libxml2.
The solution is, of course, easy:
my $txt = $node->toString;
$txt =~ s/&#x([0-9a-f]+);/chr(hex($1))/gie;
$txt =~ s/&#([0-9]+);/chr $1/ge;
Handy debugging hints
I think this stuff can be quite a pain to debug. There seem to be ample scope for confusion because transformations sometimes happen automatically or invisibly. Ultimately some sort of hex dump (or the like) gives us unambiguous data.
It also seems to be handy to memorize the byte sequences you might see. fileformat.info1 has good Unicode pages.
For example, consider é2, which has codepoint 0xe9. I managed to create these sequences:
- The correct UTF-8 encoding: 0xc3, 0xa9.
- A broken 'double' UTF-8 encoding: 0xc3, 0x83, 0xc2, 0xa9. To get this I managed to run the data through the UTF-8 encoder twice.
Look in the mysql .MYD file
For applications which load data into a database, then display it, the mysql data files are a convenient place to bisect the problem.
If you use MySQL's default MyISAM storage engine, then the data in a table are stored in a .MYD file called $datadir/$database/$table.MYD. It's a binary file, but you can always dump it. On MacOS X:
$ sudo od -c /usr/local/mysql/data/foo/bar.MYD
...
00024000 65 74 20 49 6e 74 65 72 6e c3 a9 65 73 20 64 65
^^^^^^
From which we can probably conclude that the data are being correctly inserted into the database.
References
- 1. http://www.fileformat.info/info/unicode/
- 2. http://www.fileformat.info/info/unicode/char/00e9/index.htm