Editor's Note: Nobody Expects the ISO-8859-1 Inquisition!
"Mr Wentwoth just told me to come in here and say that there was trouble at the mill..."

Michael Hall
Tuesday, September 4, 2001 03:16:45 PM
There are times when you're not only wrong... you're wrong in
such a way that the darkness of your ignorance spreads out and
touches others. A few weeks ago, I was about this wrong, and
wrong over something Linux advocates (me included) have been so
self-righteous and unpleasant about that it brings heat to the
back of my neck as I type.
One of the unhappy parts of working over on LinuxToday is actually preparing
stories for posting on that site. I have to traverse a lot of
different corners of the web, created with a variety of tools, to
keep LinuxToday moving along. Some fairly prominent sites depend
on Microsoft tools for production, and have a hit-and-miss
attitude toward the ever-present menace of "smart quotes,"
a.k.a. character entities #146 through #149, a.k.a. "curly
quotes" a.k.a. "those quotes that look like question marks
(sometimes) under Linux (depending on the application.)"
Perl super-hacker Tom Christiansen calls them
something more incisive:
"intentional errors designed to destroy the web by subverting
open standards and thus secure Microsoft's hegemony."
In particular, Mr. Christiansen is referring to how the
specification for the ISO-8859-1 character set works. C
Character codes above 128 aren't supposed to display anything
in particular because they're supposed to be control characters.
Microsoft created a superset of ISO-8859-1, according to a bit of
searching I did that unearthed a Usenet post from Jamie Zawinski of Netscape fame, called
"ISO-8859-1-Windows-3.1-Latin-1," which helps itself to some of
the characters ISO-8859-1 doesn't, including the smart quotes, by
assigning them to the range of codes above 128, where
non-displaying control characters are supposed to live. In other
words, as noted in Mr. Christiansen's commentary (and just about
every other person to ever comment on smart quotes), Microsoft
"embraced and extended" a standard. To add to the soup of
character set standards, by the way, a freshly saved HTML
document produced in Microsoft Word reveals its declared
character set to be windows-1252 (which appears to be a variant
on the Western European (1252-c) character set.
So, ISO-8859-1-Windows-3.1-Latin-1 is ISO-8859-1-Latin-1,
with the exception of some characters Microsoft chose to store up
in the attic. The result of this bit of divergence is a
deviation from prescribed practice: Web clients are left to their
own devices when it comes to rendering these character codes, and
all of Microsoft's know to render them as smart quotes. Other
platforms and clients may or may not. Up until very recently,
Linux clients were reliable in their failure to render smart
quotes.
Smart quotes are considered bad manners among many people,
even people who wouldn't touch anything besides Microsoft
software. Web design pages that address the existence of the
smart quote usually include tips on how to turn off smart quotes
in assorted HTML-producing or -exporting software so as to avoid
an appearance of thoughtlessness or (worse, on the Web) blithe
ignorance of the (strict standards compliance|impoverishment) of
non-Microsoft clients.
More than bad manners, an attempt to take over the Web, or
badges of a content author's ignorance, smart quotes are, or
were, the "Microsoft detectors" of the Linux world: liberal
sprinklings of question marks throughout a document are, or were,
a dead giveaway that a Microsoft product was present somewhere in
the production pipeline. The reaction of many who start noticing
them around the web once they make the move to Linux is vaguely
akin to "Rowdy" Roddy Piper's in John Carpenter's They
Live as he dons his special alien-spotting glasses and
realizes the world is in the grip of a vast conspiracy of
skeletoid monsters. The effect is amplified when they visit a
few sites that trumpet independence from Microsoft products but
show the tell-tales of the conspiracy to destroy the web right on
their own pages.
On a Linux site, the presence of smart quotes are often cause
for severe reactions. A site like LinuxToday, which is 90%
cut-and-pasted content from all over the Web, has to be
especially careful because sites vary wildly even internally when
it comes to their use of smart quotes, and it's easy to miss a
single tell-tale question mark in the midst of three or four
paragraphs of text, especially when you spend all day reading
sites that require you subconsciously substitute the appropriate
characters.
Sadly, though, it's time to note that the days of curly
quotes and their mis-rendering on a Linux browser as an indicator
of OS purity are over, depending on the tool that produced them
and depending on your browser.
As an unhappy experience a few weeks ago indicated, a few
open source tools (notably AbiWord) now produce unicode character
entities above ISO-8859-1-Latin1's 128 (using characters above
255, in compliance with "internationalized HTML" as specified in
the HTML 4.0 standard) to provide smart quotes in documents
exported as HTML.
An informal trial using character codes in keeping with
Microsoft's extended ISO-8859-1 character set also indicates that
some open source tools (Mozilla and Konqueror 2.2) and proprietary tools running
under Linux (Opera and Netscape 4.7) have thrown up their hands
and decided to exercise their option (as provided under the HTML
standard) to render character codes 147 and 148 the way Microsoft
intends: as smart quotes. Amaya doesn't show anything at all.
I've even provided a screenshot of an AbiWord document (which
is compliant with the "internationalized HTML" standard) opened
in Opera, Konqueror, Netscape 4.7, Mozilla's latest nightly as of
the 30th of August, and the W3C's own testbed browser - Amaya (v
5.0). The example, by the way, uses the "right single quote"
character, Unicode #x2019. As the screenshot shows, Konqueror had problems, while
all the others did not. Konqueror's problems went away, I should
note, once I told it to consider the document's native character
set as Unicode with (UTF-8) and it promptly
picked an illegible font with which to render the page.
The long and short of it? Not all clients you can run under
Linux are "pure" anymore when it comes to dealing with some
characters. Some, in fact, have decided just to honor the
Microsoft version of the convention even if it isn't strictly
standards-compliant. And some don't seem to honor the
internationalized HTML standard, which does allow for characters
over and above the 128 allowed by ISO-8859-1-Latin1. Some will
also forgive a document mis-declaring the character set it uses
in its headers as ISO-8859-1 and go ahead and render characters
that properly belong to the Unicode (UTF-8) character set.
"Cardinal, Read the Charges."
If it seems like I put too much time into rooting out the issues
behind this whole quote mark mess, it's because character
mis-renderings provide for a special kind of tyranny familiar to
anyone who ever worked a Linux site, made the mistake of leaving
a thrown character in something he cut-and-pasted into a buffer
on some back-end, and dealt with an inbox full of flame: what I've
come to think of as the Tool Taliban, a.k.a. the Platform Purity
Squad, a.k.a. "the Same People Who Complain About Cookies Being
Intrusive But Think Nothing of Demanding That You Explain What
You're Running on Your Computer If They Think They Might Not Like
It."
These people probably deserve to be ignored... but they're so
excitable one welcomes the opportunity to poke at them through
the bars of their cages with a sharp stick, and they need to
understand that a.) the software I use to get my job done is none
of their business and b.) they may need to check the "purity"
(with regards to standards compliance) of their own tools before
embarking on jihad against someone who might actually be using a
100% standards-compliant, 100% Linux-based production pipeline.
I also spent some time looking into it because a few weeks ago
a bit of AbiWord-produced HTML caused the issue to rear its head
here on LinuxPlanet when a column written in that word processor
and exported to HTML produced the Unicode character entities for
smart quotes we've been discussing, which some browsers (Mozilla,
Netscape, Opera) handled fine and others (Konqueror) did not.
The ensuing consternation it caused in LinuxToday's
talkbacks caused me to ignorantly malign AbiWord:
"One wouldn't suspect that a prominent open source
project would cause these sorts of problems, either: I guess it's
clear now that Microsoft's approach to this issue is gaining
converts."
It doesn't appear that AbiWord actually caused any "problems"
at all, with the exception of producing HTML source that used
legal (as of the Internationalized HTML standard to be found in
HTML 4) character entities that some Linux clients don't know how
to read. In fairness, the AbiWord output declared ISO-8859-1 as
its character set, which might have confused Konqueror, but even
changing that to UTF-8 in the document headers and reloading the
document did no good.
A reader also suggested use of 'demoroniser'
to get the bugs out of the document in question, which produced
an interesting result: the venerable sanitizer of
Microsoft-mangled HTML is silent on the question of the character
entities involved, because they simply aren't "moronic." They're
legal. Legal enough, anyhow, that complaining about them is
something best left to the real purists, who think it was a
terrible mistake to ever cave in to the layout-oriented people as
much as HTML 4 did in the first place.
What's It Mean?
In brief, I was right (some applications to be found in Linux,
including an open source app of some prominence) have capitulated
on the "smart quote" issue. Others haven't, but have their own
snags when it comes to producing some Unicode character entities,
anyhow. But I was also wrong: when a Linux-using reader
complains that they see question marks where there were supposed
to be certain quote marks, it isn't an indication that yet
another standard has fallen to the creeping insinuations of
Redmond.
The interesting question in all of this is how to confront
the issue of apparent compliance with a non-standard on the
part of open source developers and their projects: something
Mozilla has appeared to go ahead with. That's a question I'll
leave to the reader.