Click here to get back home

tidy ms word output as pure xhtml without css style and font styles

 HomeNewsGroups | Search | About
 comp.infosystems.www.authoring.html    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
tidy ms word output as pure xhtml without css style and font styles Martin Bretschneider 07-10-2007
Get Chitika Premium
Posted by Martin Bretschneider on July 10, 2007, 5:10 am
Please log in for more thread options


Hi,

ms word should output xhtml without any css style. Tidy
(http://tidy.sourceforge.net/) helps quite a lot but leaves the css
styles like the following:

<p class="P11 c2">foo</p>
<ul class="c4">
<li class="P11 c3">
<p class="P11 c2">bar</p>
</li>
</ul>

And I do want to have:

<p>foo</p>
<ul>
<li>
<p>bar</p>
</li>
</ul>

In other word: I want all the attributes to be deleted.

Is there an option for tidy to achive this or another small app?


TIA Martin

ps.: I could do this with xslt but the input must be xml and I have not
used xslt for some years...
--
http://www.bretschneidernet.de/me/contact OpenPGP-key: 0x4EA52583
_o)(o_ Philip R. Zimmermann:
-./\//\.- If privacy is outlawed,
_\_VV_/_ only outlaws will have privacy.

Posted by Pavel Lepin on July 10, 2007, 6:55 am
Please log in for more thread options



> ms word should output xhtml without any css style. Tidy
> (http://tidy.sourceforge.net/) helps quite a lot but
> leaves the css styles like the following:

[...]

> In other word: I want all the attributes to be deleted.
>
> Is there an option for tidy to achive this or another
> small app?
>
> ps.: I could do this with xslt but the input must be xml
> and I have not used xslt for some years...

At least some of the XSLT processors don't care what their
input is, as long as it's something that looks like a
DOMDocument. xsltproc (comes with libxslt) has a --html
switch specifically for transforming HTML documents.

I believe this very problem (or an extremely similar one)
was discussed at some length a few months ago either here,
on c.i.w.a.s, or on comp.text.xml. I'd recommend searching
Google Groups' archives to see if you can find that thread.

--
...the pleasure of obedience is pretty thin compared with
the pleasure of hearing a rotten tomato hit someone in the
rear end. -- Garrison Keillor

Posted by David Stone on July 10, 2007, 7:55 am
Please log in for more thread options


wrote:

> > ms word should output xhtml without any css style. Tidy
> > (http://tidy.sourceforge.net/) helps quite a lot but
> > leaves the css styles like the following:
>
> [...]
>
> > In other word: I want all the attributes to be deleted.
> >
> > Is there an option for tidy to achive this or another
> > small app?
> >
> > ps.: I could do this with xslt but the input must be xml
> > and I have not used xslt for some years...
>
> At least some of the XSLT processors don't care what their
> input is, as long as it's something that looks like a
> DOMDocument. xsltproc (comes with libxslt) has a --html
> switch specifically for transforming HTML documents.
>
> I believe this very problem (or an extremely similar one)
> was discussed at some length a few months ago either here,
> on c.i.w.a.s, or on comp.text.xml. I'd recommend searching
> Google Groups' archives to see if you can find that thread.

When faced with a similar problem, someone recommended Beautiful
Soup -

http://www.crummy.com/software/BeautifulSoup/

I never got around to trying it (found a different way), so I
don't know how well it works.

Posted by Jukka K. Korpela on July 10, 2007, 1:44 pm
Please log in for more thread options


Scripsit Martin Bretschneider:

> ms word should output xhtml without any css style.

I can't see what you mean by that statement. In which sense "should" MS Word
do something like that?

> Tidy
> (http://tidy.sourceforge.net/) helps quite a lot but leaves the css
> styles like the following:

It seems that you actually mean that MS Word does _not_ output XHTML the way
you want. You're not complaining about CSS but about class attributes.

> <p class="P11 c2">foo</p>
> <ul class="c4">
> <li class="P11 c3">
> <p class="P11 c2">bar</p>
> </li>
> </ul>

The class attributes might be redundant, but are they really a problem? They
obfuscate the HTML code a bit, but not seriously. Besides, some day you - or
maybe even a user, playing with a user style sheet - might find the class
attributes useful, for the purposes of styling.

> In other word: I want all the attributes to be deleted.

Are you sure? Even if the author of the MS Word document used styles (in the
MS Word sense), selecting descriptive names, and had these names copied into
the HTML markup generated by MS Word? I think that would mean going
backwards.

If MS Word output CSS rules using the classes, as I suspect it does, then
it's a matter of deleting (or modifying) those rules, rather than the class
attributes. The real problem is that MS Word wants to transmit the fixed
font face, font size, table cell width etc. assignments into HTML and CSS
code. That's what makes the result less suitable for the WWW; but the class
attributes as such have _no_ effect.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/


Posted by Adrienne Boswell on July 11, 2007, 3:05 pm
Please log in for more thread options


Gazing into my crystal ball I observed "Jukka K. Korpela"

>> In other word: I want all the attributes to be deleted.
>
> Are you sure? Even if the author of the MS Word document used styles
> (in the MS Word sense), selecting descriptive names, and had these
> names copied into the HTML markup generated by MS Word? I think that
> would mean going backwards.
>
> If MS Word output CSS rules using the classes, as I suspect it does,
> then it's a matter of deleting (or modifying) those rules, rather than
> the class attributes. The real problem is that MS Word wants to
> transmit the fixed font face, font size, table cell width etc.
> assignments into HTML and CSS code. That's what makes the result less
> suitable for the WWW; but the class attributes as such have _no_
> effect.
>
>

Say Word does something like:
<p style="font-weight:bold">Bold</p>
<p style="font-style:italic">Italic</p>

HTML-Tidy will do:

<p class="c1">Bold</p>
<p class="c2">Italic</p>

To the OP, IIRC, there is a way to tell Tidy not to put any attributes.
Go to the Tidy's page at SourceForge for directions.

--
Adrienne Boswell at Home
Arbpen Web Site Design Services
http://www.cavalcade-of-coding.info
Please respond to the group so others can share


Similar ThreadsPosted
Dealing with overriding styles with stylesheets in Microsoft Word November 19, 2004, 7:28 am
html tidy, word 2003 and "smart quotes" April 13, 2005, 7:30 pm
style to remove all styles March 25, 2005, 5:09 am
XHTML DOCTYPE breaks JavaScript x.style.top and x.style.left? July 5, 2005, 12:58 pm
Can a font face be specified in a style sheet? May 23, 2005, 2:37 am
Browser won't relect font-size from style cheet December 26, 2005, 7:02 pm
inline selectors? style="li " ? April 25, 2006, 3:50 pm
not able to operate element's style property in xhtml (but fine in html) January 2, 2007, 3:06 am
XHTML & CSS Font-Size Oddity.. May 18, 2007, 4:23 pm
help in getting output June 6, 2006, 6:54 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap