Tokenize an HTML page.

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!


I would like index a whole bunch of html documents on my site to speed
up my internal searches, (I currently use 'LIKE "%...%"' and that's not
very efficient).

My understanding would be to:
1) Remove some html (with strip_tags( ... ))
2) Walk the string and, every time I come across a stop character,
(<space>,',",?,! etc...), then count that as a word.

The above solution is over simplistic as it does not work for many
languages, (Hebrew for example uses the single quote as part of the word).

Also stripping HTML assumes that it is properly formated, something I
cannot really guaranty, (and in any case, I might want to keep certain
items such as websites inside the href='' tags).

So, before I re-invent the wheel, can someone suggest a
script/class/code that is able to tokenize html content?

Any suggestions?

Many thanks


Site Timeline