How can I programmatically validate html ?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I am importing text from a column of a database table to display as
part of a web page in  There are about 7000 rows in the table.

About 10% of the columns have their content as html and about 10% of
those columns have badly broken html. When broke it generally uses <tr>
and <td> content with no enclosing <table>.

I have two alternatives.

1. I could write a program to create a page from each occurence of html
content in a row and validate that against a html parser.

As anyone done this. If so how could it be done. which parser could be

2. I can replace all the table tags in the database:

<tr -> <div
</tr> -> </div>
<table -> <div
</table> -> </div>
<td -> <span
</td> -> </span>

That will also give me broken html but not nearly as bad because
afterwards at least everything will display in the correct order that
the document author intended. That's what doesn't always happen now;
generally when there is <tr> and <td> content with no enclosing <table>.

Re: How can I programmatically validate html ?

Quoted text here. Click to load it

< will validate HTML for you. No need to
reinvent the wheel.

Michael F. Stemper
#include <Standard_Disclaimer>
Always use apostrophe's and "quotation marks" properly.

Re: How can I programmatically validate html ?

mark4asp wrote:

Quoted text here. Click to load it

I ran a very crude SQL script against that column so I now know the IDs
of reports with this dramatically broken html.

Nevertheless, I'm still interested in getting some help to solve the
the problem addressed by the post. If there's anyone who's actaully
done this with the help of available tools out there.

Re: How can I programmatically validate html ?

lovely and talented mark4asp broadcast on

Quoted text here. Click to load it

I have not done this, not even on TV.

From what I read down thread, if you can automate this at all, your best bet
probably is to pass stuff through tidy --- which is not a validator, but
which can fix many kinds of brokenness and then through a real parser like
nsgmls, either from the SP or OpenSP package.

You have to decide on a DOCTYPE because otherwise "validate" is meaningless.
In your circumstance 4.01 loose seems reasonable.  So it looks like this:

1. slap your doctype on the string from the database and a TITLE element. In
html 4.01, open and close HTML, HEAD, and BODY tags are optional, but the
TITLE element is required.  You can also take this oportunity to groom the
empty tags.  Parsing HTML with regexes is in general a bad idea, but fixing
the empty tags is a piece of cake.  This might be a good place to look for
markdown and common types of wikisms such as na´ve users might have
introduced and filter for them if you can determine which type they are.

(An all-wiki diagnoser and filter would be a useful contribution to the

At this point you have something that purports to be an HTML document.  

2) Send it through tidy (with appropriate tidy configuration or arguments).
Check to see if tidy died.  If it did, write the unique ID to a list of
things that need manual intervention and go to the next record.  Tidy is
very chatty, but you may want to save its error output anyway. You should
take this opportunity to get tidy to close tags and use lowercase in tags
and attribute names in case XHTML is your ultimate target or might ever
become your target one day.

3) (Tidy did not die) Send your tidified document through nsgmls.  Discard
the output and look at the errors.  You are looking for zilch in the error
file.  If there are errors, record the unique ID as needing manual
intervention. You probably want to save the nsgmls errors as nsgmls is not
chatty and when it says something, it means it.

4) (Passed validation).  Put the stuff through a filter to remove the
doctype and TITLE element and any extraneous stuff tidy may have added. At
this point (which will be HEAD, BODY, and HTML tags if you got tidy to add
optional tags).  If 4.01 loose is not your target you may have to add stuff
to the filter to conform to what you want (such as closing empty tags if you
are going for XHTML).  Write the now valid fragment back to the database
(you are not crazy enough to do any of this without extensive testing and
backing up the database first, right?)

5) Examine the exceptions.  You may find enough commonality of some failures
to devise stuff that will fix most of them in the filter at step 1.  Tidy is
the weak link --- when it does what you want, it's great.  When it doesn't
perhaps you can convince it with stuff in step 1.

6) Don't sue me if you screw it up.

Lars Eighner <
              War on Terrorism:  Bad News from the Sanity Front
"There's one thing ... that I do like about Rumsfeld, he's just a little bit
             crazy, OK"? --Thomas Friedman, _The New York Times_

Site Timeline