Do you have a question? Post it now! No Registration Necessary. Now with pictures!
- Posted on
It's certainly not necessary to use Unicode to represent an
The problem is that MS Word has an annoying habit to replace perfectly
appropriate apostrophes with a different character, the right-facing
single quote, aka "curly quote". That does then require non-ASCII
handling. As Word has always (IMHE, but Jukka reports differently)
used the character at 8200-something, this also requires Unicode
rather than -1252
On Wed, 5 May 2010, Andy Dingley wrote:
Please explain what a “perfectly appropriate apostrophe” is.
MS Word (like other programs) does this only when you have the option
“smart quotes” selected.
This *is* the apostrophe (’).
0x2019 = 8217
No. U+2019 is included in code page 1252:
From the New World:
A quote. Different character to an apostrophe.
Which is the default option. More annoyingly, it's the default option
on my customer's users' copy of Word (and there are several hundreds
of these people).
No, it isn't. That is (in M$oft parlance) a "straight quote" rather
than a "curly quote".
The Unicode codepoint has an equivalent character, but the octets for
its Unicode representation aren't the same as the bytes representing
it in CP-1252
A user typing (or pasting) a straight apostrophe character into Word
can (for common versions of Word, with their default settings) see
this serialized into an rtf document in a form that makes the document
dependent on being processed in a Unicode-aware form downstream. This
is A Bad Thing and unnecessary.
On Wed, 5 May 2010, Andy Dingley wrote:
I see it’s pointless to talk to you. Nevertheless I will write
something for the other readers here.
That is completely irrelevant; the character remains the same.
For example, the euro sign has many different byte representations
even in 8-bit character sets:
But there is only one euro character.
In memoriam Alan J. Flavell
Firstly, the character doesn't remain the same. My users press the
apostrophe key, Word delivers them a single curly quote mark.
Secondly Word then delivers this Unicode codepoint to a context (not
necessarily HTML) where Unicode is poorly supported. I presume that
they think support for legacies is unimportant and anything that looks
vaguely like their HTML version's behaviour is acceptable, no matter
how poorly formed.
As to -1252 compatibility, read the list you yourself cited:
0x82 0x201A #SINGLE LOW-9 QUOTATION MARK
Same character / codepoint, different octets. It's not appropriate to
write 0x201A into a document claiming to be -1252, but Word does this.
On Wed, 5 May 2010, Andy Dingley wrote:
The main problem is: when are two characters the same?
There is no simple answer. Is a comma the same as a 9-shaped single quote
at the bottom (in German texts an opening single quote)? Is an apostrophe
the same as a 9-shaped single quote at the top (in English texts a closing
single quote)? Is an apostrophe as a letter (as in Swahili or Breton) the
same as apostrophe as a punctuation mark?
The current Unicode standard says no to the first question, yes to the
second, and no to the third. A bad choice for a number of reasons. A
former Unicode standard identified the apostrophe with another character,
to wit the modifier letter apostrophe U+02BC, which is not really better.
There are some ASCII characters that were used for more than one purpose
at the time of ASCII and typewriters: the hyphen also used for dash and
minus, the double quote used for both left and right double quotes, the
apostrophe also used for both left and right single quotes, and probably
some more. On typewriters, these had compromise glyphs: the hyphen longer
than a hyphen but shorter than a dash or minus, and the apostrophe/quotes
straight down to avoid any direction -- always in order that these
characters fit all their purposes approximately but no purpose exactly.
Two choices would have been consistent:
a) Abandon the multi-purpose ASCII character and invent new non-ASCII
characters with special meanings.
b) Use the ASCII character for only one of the old purposes and invent new
non-ASCII characters with special meanings for the remaining purposes.
Experience has shown that (a) does not really work: nobody uses the hyphen
U+2010, and hardly anybody the line separator U+2028 because there are
fine ASCII characters for the purpose. In the same spirit, most people use
the ASCII apostrophe U+0027 as apostrophe. And that would also have been
the most straighforward choice for Unicode.
Unfortunately, for the apostrophe, Unicode has followed a third strategy: the
ASCII character was abandoned for all purposes, but another character has
got a double meaning: first the modifier letter apostrophe U+02BC, then
the right single quotation mark U+2019. That's only weird.
A related problem is the double meaning of U+201C as opening quote for
English texts and as closing quote for German texts. This double meaning
is font-dependent. In antiqua fonts, the two look the same (to wit a
raised 66-shape), in sans-serif fonts, they are often distinct (`` for
``English´´ and ´´ for ,,German´´), yielding often the typographically
wrong ,,German`` even when the text has the proper Unicode characters in it.
That's quite another problem. Most keyboards have no distinct left and
right quotes, so there is really something smart about smart quotes. Like
any smartness, it is often necessary that it can also be avoided.
Andy Dingley wrote:
Are you seriously confusing U+0027 with U+2019? The latter, U+2019, is the
preferred punctuation apostrophe, according to the Unicode Standard and any
serious typographer. The former, U+0027, is what you see in the "apostrophe
key" on most keyboard and also the character it produces - though
application programs may then change the character to something else. U+0027
carries the _name_ apostrophe in ASCII, Unicode, etc., but such names are
just identifiers and sometimes misleading. The only good use for U+0027 is
No, Word is.
I rarely get the luxury of working with serious typographer. My users
are in finance, and they just want their PDFs to work. This requires
Word to not screw up an export to a format (and then a long toolchain)
that doesn't support Unicode characters, by putting Unicode in it.
Or in this case, SQL. Not all of the cases affected relate to SQL,
but Word is breaking that too.
On Wed, 5 May 2010, Jukka K. Korpela wrote:
Yes, this is what the Unicode Standard says, at least in the current
Any "serious typographer" will agree that in most fonts, the right single
quote in English text (other languages have other quotes) has a shape that
resembles the shape of an apostrophe, just as the left single quote in
German text (other languages have other quotes) has a shape that resembles
the shape of a comma. Whether that should lead to the unification of the
two characters in a standard like Unicode is a question the typographer
won't try to answer if he is serious.
I consider it a bad idea to regard the two characters as the same, in both
cases: for the apostrophe where it was done, and for the comma where it
was not done. Apostrophes have a range of applications in languages: as
sign for elided letters as in English "don't" or German "tu's"; as a
separator between stem and ending as in English "Peter's" or Turkish
"Almanya'da"; as a diacritic as in Breton "yec'hed" or Swahili "ng'ombe";
as a substitute for Semitic Ayin or Alef in Latin script as in the German
transcription "Ma'ariv" (an Israeli newspaper), and as a tone marker in
more exotic languages. Unicode regards these as not the same (the first
now "preferredly" U+2019, the second undefined, the others more ore less
Given the mess in the standard, that's probably true. Better would have
been to define the apostrophe as apostrophe, and to let the serious
typographers design the glyphs for the apostrophe -- whether or not the
outcome resembles the glyph for a single quote, which is an entirely