Click here to get back home

html tidy, word 2003 and "smart quotes"

 HomeNewsGroups | Search | About
 comp.infosystems.www.authoring.html    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
html tidy, word 2003 and "smart quotes" Ron 04-13-2005
Get Chitika Premium
Posted by Ron on April 13, 2005, 7:30 pm
Please log in for more thread options


Hello, I'm having an aggravating time getting the "html" spewed by Word
2003 to display correctly in a webpage.

The situation here is that the people creating the documents only know
Word, and aren't very computer savvy. I created a system where they
can save their Word documents as "html" and upload them to a certain
directory, and the web page dynamically runs them through tidylib using
the tidy extension to php4, thus causing the document to display
correctly. I also run the files through a couple sed expressions to
remove xml tags that have no business being there.

It alllllmost works. The resulting document follows the page's css
rules and displays correctly, except for those durned "smart quotes".

As you know, Word defaults to replacing straight quotes with fancy
quotes using an encoding that doesn't work on web pages. When you
"save as html", the resulting code doesn't display correctly. You can
turn off "smart quotes" (which I have suggested) but that only counts
towards *new* documents -- existing documents still have the problem.

Now when I use TidyUI on Windows XP, I can SEE the fancy quotes turn
into straight quotes. But when I use tidy on the command line or
tidylib through the php extension, the substitution does *not* take
place. (Freshly downloaded version of tidy in every case.)

On the Linux box I have "bare", "clean" and "word-2000" turned on.
(The code looks different if I turn any of them off, so I'm sure
they're getting turned on.) What it seems to come down to is that
tidy, with the same options, cleans up *different* things on Linux than
it does on Windows.

What are my options at this point? The users will continue to use Word
2003 -- no help there. My web server is Apache on Linux -- that's not
going to change. How do I get from here to there, dynamically, with no
user intervention?

Thanks very much for any and all suggestions. If I can solve this,
I've made it that much less likely that we'll switch to IIS.



Ron (ronc@europa.com)


Posted by Lachlan Hunt on April 14, 2005, 5:37 am
Please log in for more thread options


Ron wrote:
> Hello, I'm having an aggravating time getting the "html" spewed by Word
> 2003 to display correctly in a webpage.

Not a good idea to use word for HTML at all, but at least your trying to
clean it up.

> I created a system where they can save their Word documents as "html"
> and upload them to a certain directory, and the web page dynamically
> runs them through tidylib...
>
> It alllllmost works. The resulting document follows the page's css
> rules and displays correctly, except for those durned "smart quotes".

There's nothing inherently wrong with the curly quotes, the problem with
them is only that people fail to understand the character encoding
issues properly. Word documents are saved in the Windows-1252 encoding
by default. The quotes you are referring to are in the positions 145
(‘), 146 (’), 147 (“) and 148 (”). However, these code points (and all
others in the range from 128 to 159 are control codes in ISO-8859-1 and
others. Thus, the main problem is only caused by declaring the
incorrect character encoding.

Although declaring the encoding as Windows-1252 in the HTTP headers will
work, it is not recommended because Windows-1252 is a proprietary
encoding designed for windows only (although support may have been added
to other systems too, but that's not guarenteed).

The best options are to either save the files as UTF-8 and declare that
encoding in the HTTP headers or, continue to use ISO-8859-1 and replace
the quotes (and other special windows-1252 chars) with numeric character
references. I think word does have an option to save files as UTF-8,
which I recommend.

More informaiton about WIndows-1252 and the numeric character references
are available.
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

Posted by Benjamin Niemann on April 14, 2005, 9:25 am
Please log in for more thread options


Lachlan Hunt wrote:

> Ron wrote:
>> Hello, I'm having an aggravating time getting the "html" spewed by Word
>> 2003 to display correctly in a webpage.
>
> Not a good idea to use word for HTML at all, but at least your trying to
> clean it up.
That was not his idea...
The same problem occurs, when you are using a WYSIWYG editor component (like
HTMLArea) on a webpage and people copy&paste stuff from Word - I hate these
things (beside the fact that the web and the WYSIWYG concept are completely
incompatible, they are only causing problems), but I was not able to
prevent the decision to embed WYSIWYG editors :(


>> I created a system where they can save their Word documents as "html"
>> and upload them to a certain directory, and the web page dynamically
>> runs them through tidylib...
>>
>> It alllllmost works. The resulting document follows the page's css
>> rules and displays correctly, except for those durned "smart quotes".
>
> There's nothing inherently wrong with the curly quotes, the problem with
> them is only that people fail to understand the character encoding
> issues properly. Word documents are saved in the Windows-1252 encoding
> by default. The quotes you are referring to are in the positions 145
> (?), 146 (?), 147 (?) and 148 (?). However, these code points (and all
> others in the range from 128 to 159 are control codes in ISO-8859-1 and
> others. Thus, the main problem is only caused by declaring the
> incorrect character encoding.
>
> Although declaring the encoding as Windows-1252 in the HTTP headers will
> work, it is not recommended because Windows-1252 is a proprietary
> encoding designed for windows only (although support may have been added
> to other systems too, but that's not guarenteed).
>
> The best options are to either save the files as UTF-8 and declare that
> encoding in the HTTP headers or, continue to use ISO-8859-1 and replace
> the quotes (and other special windows-1252 chars) with numeric character
> references. I think word does have an option to save files as UTF-8,
> which I recommend.
>
> More informaiton about WIndows-1252 and the numeric character references
> are available.
> http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
I had this problem myself often enough and I usually used a list of
str_replace expressions to turn these characters into the corrent &#...;
counterparts. After reading Lachlan's comment an untested idea popped up in
my head: you could try using the iconv module of PHP to convert the
Windows-1252 into UTF-8 on the fly.
I have neither Word nor Windows available, so I can't test it now...

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/


Posted by Alan J. Flavell on April 14, 2005, 10:39 am
Please log in for more thread options


On Thu, 14 Apr 2005, Lachlan Hunt wrote:

> There's nothing inherently wrong with the curly quotes, the problem
> with them is only that people fail to understand the character
> encoding issues properly. Word documents are saved in the
> Windows-1252 encoding by default. The quotes you are referring to
> are in the positions 145 (), 146 (), 147 () and 148 ().

Thereby neatly presenting yet another demonstration of the problem ;-}

> However, these code points (and all others in the range from 128 to
> 159 are control codes in ISO-8859-1 and others. Thus, the main
> problem is only caused by declaring the incorrect character
> encoding.

agreed

> Although declaring the encoding as Windows-1252 in the HTTP headers
> will work, it is not recommended because Windows-1252 is a
> proprietary encoding designed for windows only (although support may
> have been added to other systems too, but that's not guarenteed).

in fact, support is pretty widespread, but I'd still counsel against
using it.

> The best options are to either save the files as UTF-8 and declare
> that encoding in the HTTP headers or, continue to use ISO-8859-1 and
> replace the quotes (and other special windows-1252 chars) with
> numeric character references.

I just wanted to make sure that nobody reading this thought that you
meant character references such as ‘ etc. : funnily enough,
historically MS software seems to have generated those undefined
references more enthusiastically than the actual 8-bit characters, but
the undefined references are quite bogus from Unicode's point of view.

The correct Unicode
code points for these characters are all greater than 255, as you
obviously already know (there's a somewhat official table of them
with hex equivalents at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT )

> I think word does have an option to save files as UTF-8, which I
> recommend.

I guess it depends on what version you're using. The subject line
mentioned 2003, but plenty of folks aren't there yet.

> http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

good cite.

all the best

Posted by Henri Sivonen on April 14, 2005, 1:54 pm
Please log in for more thread options



> > Although declaring the encoding as Windows-1252 in the HTTP headers
> > will work, it is not recommended because Windows-1252 is a
> > proprietary encoding designed for windows only (although support may
> > have been added to other systems too, but that's not guarenteed).
>
> in fact, support is pretty widespread, but I'd still counsel against
> using it.

As a matter of principle or based on practical concerns?

--
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html


Similar ThreadsPosted
Smart quotes back to straight quotes? March 30, 2006, 12:55 am
tidy ms word output as pure xhtml without css style and font styles July 10, 2007, 5:10 am
tidy html editor January 12, 2006, 10:04 pm
Tidy HTML - feedback requestes November 20, 2004, 5:20 pm
trouble using html tidy with template files March 4, 2006, 1:12 pm
HTML reprocessor: how do you get rid of bloated (obese) MS-Word (normal or filtered) HTML? November 5, 2006, 8:14 pm
HTML Tidy vs. HTML Validator March 4, 2006, 8:02 am
convert non-western languages to HTML from Word January 18, 2008, 11:11 pm
Editor to clean up MS Word-generated HTML table October 24, 2007, 2:37 am
HTML problem. Converting a Webpage into Word- features are lost- Can anyone please help me as I need to reproduce the page more or less as it was. August 12, 2006, 11:53 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap