Mastering Characters Sets in Linux (Weird Characters, part 2)
gucharmap and recode

Akkana Peck
Wednesday, November 25, 2009 09:55:05 AM
In the
last
article I talked about Unicode, character sets and encoding --
how accented and special characters are transferred in email and web pages,
and why you sometimes see funny characters when the process goes wrong.
But can you fix it when it does go wrong?
And if you're a programmer, how should you be handling all these encodings?
gucharmap
First, when you're testing anything involving character encoding,
gucharmap is invaluable (Figure 1).

figure 1
Every Unicode character is in some category, shown in the list on
the left -- in addition to Basic Latin, Latin-1 Supplement
(accented characters), Greek, Cyrillic, Katakana etc. there are categories
for Braille, Cuneiform, punctuation, mathematics, music and so forth.
The Character Details tab tells you the Unicode, UTF-8,
UTF-16 and XML/HTML codes for the character.
If you have a character from a web page or email and don't know what it is,
just paste it into gucharmap's Search->Find field (Figure 2).

figure 2
recode
You can fix some encoding problems using a simple command-line tool:
recode, probably available on your Linux distribution.
To experiment with recode, you'll need some test data.
You can make a file containing UTF-8 by pasting something from Firefox,
which usually pastes UTF-8
even if you copy from a page with another encoding, like this one.
$ cat >voila-utf8
"Voilá!" <-- paste this string
^D <-- Type Ctrl-D on a new line
$ cat >curly-utf8
“Curly quotes” <-- paste this string
^D <-- Type Ctrl-D on a new line
Once you have test data, run recode like this:
$ recode utf8..iso8859-15 <voila-utf8 >voila-8859
$
Now test-8859 should contain a Latin-1 version of the original
UTF-8 string. Of course, you can go the other way too:
$ recode iso8859-15..utf8 <voila-8859 >voila2-utf8
$ diff voila-utf8 voila2-utf8
$ <-- no differences
You can examine the files and compare them with a binary dump program
like od, then use gucharmap to verify which characters are which.
I find od output a bit hard to read, so I wrote a Python equivalent,
bdump.
recode can even map curly quotes ("smartquotes") to regular quotes,
if its output format is one that doesn't include curly quotes, like
ASCII or ISO8859-15:
$ recode utf8..iso8859-15 <curly-utf8
"Curly quotes"
recode translates those curly quotes to straight ASCII quotes.
Very useful! Of course, that means that the translation has lost
information -- you can't go back to the original UTF-8. To prevent that,
use recode's -s (strict) option.
Next: Encoded Strings in Python »