Home | Hardware | Internet News |Web Hosting |IT Management |Network Storage
LinuxPlanet
Search 
  Power Search | Tips 

 Front Door
 Discussion
 LinuxEngine
 Opinions
 Reports
 Reviews
 Tutorials
 News
 Technology Jobs

 Browse by subject.
Free Newsletter

Java/Open Source Daily
Linux Today
More Free Newsletters

Be a Commerce Partner


















internet.com
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

Print this article
Email this article

   LinuxPlanet / Tutorials







Mastering Characters Sets in Linux (Weird Characters, part 2)
gucharmap and recode

Akkana Peck
Wednesday, November 25, 2009 09:55:05 AM

In the last article I talked about Unicode, character sets and encoding -- how accented and special characters are transferred in email and web pages, and why you sometimes see funny characters when the process goes wrong.

But can you fix it when it does go wrong? And if you're a programmer, how should you be handling all these encodings?

gucharmap

First, when you're testing anything involving character encoding, gucharmap is invaluable (Figure 1).

figure 1
figure 1

Every Unicode character is in some category, shown in the list on the left -- in addition to Basic Latin, Latin-1 Supplement (accented characters), Greek, Cyrillic, Katakana etc. there are categories for Braille, Cuneiform, punctuation, mathematics, music and so forth.

The Character Details tab tells you the Unicode, UTF-8, UTF-16 and XML/HTML codes for the character.

If you have a character from a web page or email and don't know what it is, just paste it into gucharmap's Search->Find field (Figure 2).

figure 2
figure 2

recode

You can fix some encoding problems using a simple command-line tool: recode, probably available on your Linux distribution.

To experiment with recode, you'll need some test data. You can make a file containing UTF-8 by pasting something from Firefox, which usually pastes UTF-8 even if you copy from a page with another encoding, like this one.

$ cat >voila-utf8
"Voilá!"       <-- paste this string
^D             <-- Type Ctrl-D on a new line

$ cat >curly-utf8
“Curly quotes” <-- paste this string
^D             <-- Type Ctrl-D on a new line

Once you have test data, run recode like this:

$ recode utf8..iso8859-15 <voila-utf8 >voila-8859
$ 

Now test-8859 should contain a Latin-1 version of the original UTF-8 string. Of course, you can go the other way too:

$ recode iso8859-15..utf8 <voila-8859 >voila2-utf8
$ diff voila-utf8 voila2-utf8
$               <-- no differences

You can examine the files and compare them with a binary dump program like od, then use gucharmap to verify which characters are which. I find od output a bit hard to read, so I wrote a Python equivalent, bdump.

recode can even map curly quotes ("smartquotes") to regular quotes, if its output format is one that doesn't include curly quotes, like ASCII or ISO8859-15:

$ recode utf8..iso8859-15 <curly-utf8
"Curly quotes"

recode translates those curly quotes to straight ASCII quotes. Very useful! Of course, that means that the translation has lost information -- you can't go back to the original UTF-8. To prevent that, use recode's -s (strict) option.

Next: Encoded Strings in Python »

Skip Ahead

1 gucharmap and recode
2 Encoded Strings in Python
3 Combining Forms
figure 1
figure 1

figure 2
figure 2





Linux is a trademark of Linus Torvalds.


internet.com home | search | help! | about us

Jupiter Online Media

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Web Hosting | Newsletters | Tech Jobs | Shopping | E-mail Offers