Why do they do this? And what to do about it?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Why do word processors insist on inserting a lot of extraneous stuff
into the HTML documents they produce?

And is there a simple way to get rid of this junk automatically?

MS Word is probably the worst, but OpenOffice Writer does it to. It puts
   CRs at the end of each line, starts each paragraph with a lot of
redundant font information, etc.

If you want clean HTML all that stuff has to be combed out. When you're
dealing with 3000-5000 word articles this is time-consuming. Editors, at
least the ones I've encountered, quite reasonably insist this is the
author's job.

Why comb it out? Because I'm not writing for my own pages. I'm writing
to editorial specification and every editor I've ever dealt with hates
this stuff.

What I'd like to find is either a) some kind of setting that turns this
stuff off (not likely) or b) some way to set up a filter to get rid of it.

Re: Why do they do this? And what to do about it?

Rick Cook wrote:

Quoted text here. Click to load it

Because they're word processors, not text or real HTML editors.
Quoted text here. Click to load it

Don't write HTML with word processors and you won't have anything to get
rid of.

Everyone will suggest his or her favorite.  I do 99% of my work in Linux,
but since your headers imply you're using Windows, I'll just mention that
for my occasional Windows HTML editing I use Crimson Editor.  Any text
editor will work for you, but be sure it has syntax highlighting for
HTML/CSS.  You might also consider an HTML/CSS editor that includes
project/file management; it's convenient to be able to upload your pages
with a click or two from within your HTML editor.

Killing all posts from Google Groups
The Usenet Improvement Project - http://improve-usenet.org

Re: Why do they do this? And what to do about it?

Scripsit Rick Cook:

Quoted text here. Click to load it

Probably because the vendors want to preserve as much formatting as
possible, even including formatting as per word processor defaults (thus
often not _intentionally_ chosen by its user). The reason behind this is
that they don't understand web publishing but see it as desktop

Quoted text here. Click to load it

Some of it. Using "filtered output" in Word helps somewhat, but Word
still inserts a bulky stylesheet (easy to delete of course) and lots of
width and height attributes for table cells and other oddities. The Tidy
software is claimed to clean things up, but it's not reliable; it also
messes things up.

What I have done, after "filtered output", is some simple Perl-based
processing or, depending on my mood and the phase of the moon, some
Emacs processing with keyboard macros. But those tools are not for
everyone. And it's not possible to automate it all, since some of the
presentational-looking markup should _not_ be removed since it reflects
structural intentions, such as highlighting.

Quoted text here. Click to load it


That's not serious; CR is just whitespace.

Quoted text here. Click to load it

That's worse, but what's much worse is <td width="20o"> and things like
that, which create rigit layout. Redundant stuff is just useless, not

Quoted text here. Click to load it

It might actually be faster to tell the word processor to save it as
plain text, then add adequate markup "by hand".

Jukka K. Korpela ("Yucca")

Re: Why do they do this? And what to do about it?

Rick Cook wrote:
Quoted text here. Click to load it

I wrote a Win32 utility called "Xtag" that you can find at
http://industrologic.com/basic/ I threw it together to convert
some documents that people gave me. I'll try to improve it if I
get feedback from those using it. It can't work miracles, but at
least it is something.

Gary Peek

Site Timeline