Re: How can I programmatically validate html ?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

Quoted text here. Click to load it

It's hard to programatically validate HTML. You need to use Jade,
because Mere Mortals don't get to go near the code that does it
otherwise. Even that's not easy.

An easier way is to make valid code, then check it, just the once, by
loading it into a browser that supports validation, such as Firefox with
Marc Gueury's plugin.  Provided that you code is actualy valid (in not
many attempts) then this is workable.

Quoted text here. Click to load it

First define what the DB can contain. Match one of the HTML productions,
such as %block;, TR  or (TR)+

Your content won't be HTML (according to the DTD) unless it uses <html>
as the one and only root element. It won't do this. It _can't_ do this,
not if you want to join rows together. So if you have to use a fragment,
then make it a well-defined fragment.

If the content is always one thing (e.g. %block;) then that's easy. If
it isn't alwasy the same, then work out what it is.   %block; |  (TR)+
is quite workable - your current content might even be valid already!
(just not valid HTML...).   If you have to work with one of two entities
like this, then it makes it a little hard to assemble on reading it, but
not impossible.

I'd suggest adding another column to indicate just which content model
it follows.  Querying that would be easy.   If you can't do this, then
find a way to tell what the content model is, such as a regex to look
for bare <tr> start tags at the front. This will be slower than
retrieving a value you calculated earlier, but still workable.   The
rest is just a coding exercise.

 Cats have nine lives, which is why they rarely post to Usenet.

Re: How can I programmatically validate html ?

Andy Dingley wrote:

Quoted text here. Click to load it

Thanks Andy,

Your post gave me a few ideas to chew over.

It doesn't solve the immediate problem; which may be unsolvable. I'll
write a sql script to fix the very badly broken code.

I already decided to add extra validation for the CMS.

By the way, this code will never validate against a DTD because it's
created by users who don't know the various definitions and are not
very technical. Some of it is copied by them from a public site which
seems to be written by a (broken) robot.

I had hopes once upon a time of having this website produce perfectly
valid xhtml, now I just want to ensure the html is not really badly
broken (i.e. that the layout is not very silly).

Re: How can I programmatically validate html ?

Quoted text here. Click to load it

I'd be wary of solving the immediate problem first.

Take a step back. _Design_ something, don't rush into coding it.

Q: What do you really need to store here?

Q: How is that best represented as HTML fragments?

_After_ you've answered those, think about other questions like:

Q: With a database full of those, how would I query them?

Q: How do I get my current content to resemble this ideal?

Q: How do I handle user input in the future, so that the content
created is always correct?

You seem to have a DB full of "nodes", where the presentation of a
node can vary, but the overall document is always a simple, single
linear list of nodes and doesn't need to handle trees of nodes,
multiple categorized lists etc.

The structure (not just presentation) of a node is going to be some
set of properties. Is this consistent across nodes, are some
properties null for some nodes, or is the structure radically
different for each node? In the worst case, there's no consistent
structure across nodes (but you could represent each as a sub-list of

From your answers here, then you can choose how to represent a node in
a row of the DB.

In some cases, you break each property type out into its own column
(maybe even its own table). This is the case when you want fuil DB-
level processing based on individual property values. It also copes
with nodes that have widely differing or unpredictable property sets.

At the other extreme, you store only the presentational version of the
node, in some finalised HTML structure. This has many disadvantages:
it's tightly coupled to presentation (hard to change), it's tightly
coupled to presentation (hard to manipulate in an abstract manner),
and it's difficult to query for selecting rows on particular property

If you expose the stored structure of the noodes to the content
authors, does this help or hinder them?  IMHE, it's OK to expose HTML
fragments here, providing that:

* It's simple enough to teach, ab initio, on one side of A4 paper.

* The fragments the authors work on are short. Don't ask too much.

* You validate before storing. You check syntactic well-formedness and
you also check validity.  Your validation _MUST_ give good error
messages that highlight how to fix the error, not impenetrable
messgaes about validation (usually at a point long after the error).

* If you resemble HTML, then you must _be_ HTML. Something that's "a
bit like" HTML, except that the <foo> element is also necessary, <div>
is forbidden and <span> has the wrong content model will only confuse
those authors who know  a bit of HTML already. It's extremely useful
to build in class and style attributes (maybe <div> and <span> too) as
a simple pass-through model so that "power users" can make it dance
for those few rare occasions when it's needed.

* You can work with either HTML or XHTML / XML syntax. Make it clear
though, and train accordingly. Don't let authors develop
misunderstandings such as "<p> means para break" or "<br><br><br>"

In your specific case, the main problem seems to be whether a "node"
is going to be permitted to be marked up as %block; (TR)+ or both. If
there are columns inside a <tr>, are they in a particular order?  This
question depends on usage for query and the idealised results you
expect (do you want tables generated or a list of <div>?  Which is
actually best, then make that your goal).  Once you've _decided_ that
(don't just let it be a result by default), then code up the content
entry interface to support that.

Site Timeline