Click here to get back home

convert non-western languages to HTML from Word

 HomeNewsGroups | Search | About
 comp.infosystems.www.authoring.html    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
convert non-western languages to HTML from Word annalisa 01-18-2008
Posted by annalisa on January 18, 2008, 11:11 pm
Please log in for more thread options
I really need some help with langauge conversion to HTML. My
translators are translating into Word and I need to convert Word to
HTML. It's been awhile since I've worked with Unicode and know that
not all the fonts being used are unicode. Is there a way to strip all
the junk out of the MS Word filtered (yeah right) HTML. All I want are
the basic formatting tags no spans,fonts, divs, css, but I don't want
to lose any language identification in the doctype or metatags or
directionality. Any help would be greatly appreciated. This is for a
highly ranked (google) non-profit site.
Thanks

Posted by Ben C on January 19, 2008, 4:55 am
Please log in for more thread options
> I really need some help with langauge conversion to HTML. My
> translators are translating into Word and I need to convert Word to
> HTML. It's been awhile since I've worked with Unicode and know that
> not all the fonts being used are unicode.

Never mind the fonts. What you want out of the Word docs is the
characters. You need to figure out how Word has encoded the output and
then probably transcode it to UTF-8 (you don't have to use UTF-8 but
it's simpler).

A good transcoding program is "iconv".

> Is there a way to strip all the junk out of the MS Word filtered (yeah
> right) HTML.

I am lucky enough not to be speaking from experience of having had to do
that but I would start with Python and BeautifulSoup.

> All I want are
> the basic formatting tags no spans,fonts, divs, css, but I don't want
> to lose any language identification in the doctype or metatags or
> directionality.

Directionality should just work-- the characters are stored from "start"
to "end" and it's up to the browser to lay them out right-to-left or
left-to-right where appropriate.

An interesting question though is whether your authors have used special
characters like RLO and RLE, and whether if they have Word will save
them out as the Unicode characters.

Then you have to decide whether to leave them in the output, or to
replace them with the equivalent unicode-bidi properties. I don't know
which has better browser support.

Posted by Jukka K. Korpela on January 20, 2008, 3:39 pm
Please log in for more thread options
Scripsit Ben C:

>> I really need some help with langauge conversion to HTML. My
>> translators are translating into Word and I need to convert Word to
>> HTML. It's been awhile since I've worked with Unicode and know that
>> not all the fonts being used are unicode.
>
> Never mind the fonts. What you want out of the Word docs is the
> characters.

Yes, but fonts _might_ be involved. In the bad old days, people wrote
some languages using "ad hoc" fonts, e.g. 8-bit encoded fonts where some
national characters were placed. The documents were supposed to be
viewed using that very font only; using any other font changed the
content to gibberish.

If this is the case, you would need to find out what the font is
supposed to contain and replace all the data by proper characters.

More often, the characters are correct but the output from Word is a
mess, containing an absurd amount of tricky markup and CSS code, a
nightmare to maintain. There are two possible approaches:
1) Clean it up. First use "filtered HTML" when creating HTML from Word.
This removes much of the crap, but far from everything. You might still
have e.g. lots of hard-wired pixel dimensions for table cells. You would
probably want to find a utility, like a programmable editor, for
removing them.
2) Don't create the mess. Instead, use "Save As" and select plain text
in Word. Then use the plain text file as a basis for creating an HTML
document (just add simple markup, then use CSS for styling), or maybe
use an HTML document template (with just markup, no content) and copy
and paste the content piecewise there.

> You need to figure out how Word has encoded the output and
> then probably transcode it to UTF-8 (you don't have to use UTF-8 but
> it's simpler).

Probably not. When you do a Save As HTML (Save As Web Page) in Word,
Word uses the encoding defined in its Web settings (somewhere in the
Tools/Settings/General menu), and any characters that cannot be
represented in it will appear as character references like ا.
Somewhat inconvenient for editing on an editor that does not grok them,
but works on browsers. You can also change the encoding to UTF-8 in Word
and avoid this, but then any postprocessing needs a UTF-8 enabled tool.

>> Is there a way to strip all the junk out of the MS Word filtered
>> (yeah right) HTML.
>
> I am lucky enough not to be speaking from experience of having had to
> do that but I would start with Python and BeautifulSoup.

I have usually done it by using Emacs on the filtered HTML output. Not
so convenient, but tolerable if you don't need to do it every day.
Surely a general-purpose tool like Python or Perl would be nice, but
sadly enough, not everyone is fluent in using them.

>> All I want are
>> the basic formatting tags no spans,fonts, divs, css, but I don't want
>> to lose any language identification in the doctype or metatags or
>> directionality.

Note: There is absolutely no language identification in the doctype
declaration, except about the language of the document type definition,
which is English, and you can't change that.

> Directionality should just work-- the characters are stored from
> "start" to "end" and it's up to the browser to lay them out
> right-to-left or left-to-right where appropriate.

It's not quite that simple. Characters have inherent directionality, and
browsers are supposed to observe it, but they sometimes fail, and then
there's the problem that directionality is more than that. It also
involves things like table column layout direction, default alignment
(left vs. right), placement of vertical scroll bar, etc. Thus, anyone
authoring in a right to left language should use <html dir="rtl">, and
any texts with the opposite direction should have their own dir
attribute.

I just made a simple text using Word 2002. I entered some Arabic
letters, then asked Word to save it in HTML format, as filtered. Here's
a key part of the output:

<meta http-equiv=Content-Type content="text/html; charset=iso-8859-1">
...
<p class=MsoNormal><span lang=AR-SA
dir=RTL>&#1575;&#1576;&#1578;&#1580;</span></p>

So Word decided, on my behalf, that the text is in Arabic as used in
Saudi Arabia, and it inserted both a lang attribute and a dir attribute.
Since it's using iso-8859-1 (due to its defaults), it converted the
Arabic letters to character references. This is not that bad.

> An interesting question though is whether your authors have used
> special characters like RLO and RLE, and whether if they have Word
> will save them out as the Unicode characters.

That might be a problem... but in my test, RLO doesn't seem to work even
in Word, so why would an author use it?

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/


Posted by Ben C on January 20, 2008, 4:11 pm
Please log in for more thread options
> Scripsit Ben C:
[...]
>> Directionality should just work-- the characters are stored from
>> "start" to "end" and it's up to the browser to lay them out
>> right-to-left or left-to-right where appropriate.
>
> It's not quite that simple. Characters have inherent directionality, and
> browsers are supposed to observe it, but they sometimes fail, and then
> there's the problem that directionality is more than that. It also
> involves things like table column layout direction, default alignment
> (left vs. right), placement of vertical scroll bar, etc. Thus, anyone
> authoring in a right to left language should use <html dir="rtl">, and
> any texts with the opposite direction should have their own dir
> attribute.

I'm not saying browsers are perfect! They mostly don't implement the
rules for rtl quite correctly-- they don't for example all alter
margin-left rather than for margin-right when width properties are
overconstrained.

The point is just that I would expect the character order to be same
in the Word document, in the HTML, and to correspond to the order in
which the author typed the characters in. So it shouldn't present any
new or peculiar problems.

OP might have been worried that Word would have rearranged the actual
character order to do right-to-left. That isn't how it works, but it
isn't always so obvious unless you have more experience with these
things.

> I just made a simple text using Word 2002. I entered some Arabic
> letters, then asked Word to save it in HTML format, as filtered. Here's
> a key part of the output:
>
><meta http-equiv=Content-Type content="text/html; charset=iso-8859-1">
> ...
><p class=MsoNormal><span lang=AR-SA
> dir=RTL>&#1575;&#1576;&#1578;&#1580;</span></p>
>
> So Word decided, on my behalf, that the text is in Arabic as used in
> Saudi Arabia, and it inserted both a lang attribute and a dir attribute.
> Since it's using iso-8859-1 (due to its defaults), it converted the
> Arabic letters to character references. This is not that bad.

It could be worse.

>> An interesting question though is whether your authors have used
>> special characters like RLO and RLE, and whether if they have Word
>> will save them out as the Unicode characters.
>
> That might be a problem... but in my test, RLO doesn't seem to work even
> in Word, so why would an author use it?

Then all is well and the OP need not worry about these characters.

They do work (at least to some extent, I have not done extensive tests)
in Firefox and Opera. But on the www it's perhaps better to use
unicode-bidi properties instead: CSS 2.1 says they are supposed to work,
but as far as I know there's nothing that says that HTML UAs have to
support the RLO etc. characters.

Posted by Andreas Prilop on January 22, 2008, 12:07 pm
Please log in for more thread options
On Sun, 20 Jan 2008, Jukka K. Korpela wrote:

>> An interesting question though is whether your authors have used
>> special characters like RLO and RLE, and whether if they have Word
>> will save them out as the Unicode characters.
>
> That might be a problem... but in my test, RLO doesn't seem to work
> even in Word,

You must first install
http://www.microsoft.com/globaldev/handson/user/xpintlsupp.mspx
http://www.microsoft.com/globaldev/handson/user/2kintlsupp.mspx

> so why would an author use it?

Authors should avoid these Unicode characters in HTML:
http://www.unics.uni-hannover.de/nhtcapri/bidirectional-text#control
You could use character references like &#8235; . Check at
http://www.unics.uni-hannover.de/nhtcapri/right-to-left.html
whether they work in your browser.

--
In memoriam Alan J. Flavell
http://groups.google.com/groups/search?q=author:Alan.J.Flavell

Similar ThreadsPosted
Convert ODP RDF to Static HTML Pages? October 6, 2004, 7:06 pm
Best tool to convert html into XHTML for XML parsing? March 17, 2005, 1:15 am
HTML reprocessor: how do you get rid of bloated (obese) MS-Word (normal or filtered) HTML? November 5, 2006, 8:14 pm
html tidy, word 2003 and "smart quotes" April 13, 2005, 7:30 pm
Editor to clean up MS Word-generated HTML table October 24, 2007, 2:37 am
HTML problem. Converting a Webpage into Word- features are lost- Can anyone please help me as I need to reproduce the page more or less as it was. August 12, 2006, 11:53 am
HTML problem. Converting a Webpage into Word- features are lost- Can anyone please help me as I need to reproduce the page more or less as it was. August 12, 2006, 11:54 am
convert a table to a May 19, 2008, 9:40 am
"Re:" in other languages October 3, 2005, 1:35 pm
multiple languages in one document March 18, 2005, 2:55 pm

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap