Assumptions

  1. Generally one would like to use UTF-8 and forget about the details.
  2. XML::LibXML and XML::LibXSLT is the sane way to munge XML in Perl.
  3. MySQL will be involved somewhere. The cool kids probably use postgreSQL but still!

Tricks

The database

MySQL is fairly happy with UTF-8 if you set the relevant options. It seems to be enough to set

character-set-server = utf8

DBI

You need to tell DBD::mysql to use UTF-8. It's easiest to do this at connect time:

$dbh = DBI->connect(..., { RaiseError => 1, mysql_enable_utf8 => 1, });

XML::LibXML

Sadly XML::LibXML's serialization code will sometimes encode non-ASCII characters as numeric entities and not UTF-8. For example, you might see pâte sucrée instead of pâte sucrée. Obviously this is valid XML but it plays havoc with e.g. MySQL's sorting!

Most perniciously this transformation can be invisible if you look at the data in web browser, and of course it depends on both the data being serialized and (apparently) the version of the underlying libxml2.

The solution is, of course, easy:

my $txt = $node->toString;

$txt =~ s/&#x([0-9a-f]+);/chr(hex($1))/gie;
$txt =~ s/&#([0-9]+);/chr $1/ge;

Handy debugging hints

I think this stuff can be quite a pain to debug. There seem to be ample scope for confusion because transformations sometimes happen automatically or invisibly. Ultimately some sort of hex dump (or the like) gives us unambiguous data.

It also seems to be handy to memorize the byte sequences you might see. fileformat.info1 has good Unicode pages.

For example, consider é2, which has codepoint 0xe9. I managed to create these sequences:

Look in the mysql .MYD file

For applications which load data into a database, then display it, the mysql data files are a convenient place to bisect the problem.

If you use MySQL's default MyISAM storage engine, then the data in a table are stored in a .MYD file called $datadir/$database/$table.MYD. It's a binary file, but you can always dump it. On MacOS X:

$ sudo od -c /usr/local/mysql/data/foo/bar.MYD
...
00024000    65  74  20  49  6e  74  65  72  6e  c3  a9  65  73  20  64  65
                                                ^^^^^^

From which we can probably conclude that the data are being correctly inserted into the database.