cutting out the tags

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

For some program I write, I need a function that will take an HTML file and
cut tags out of it, leaving text only data.
Obvious way is (seems to be) searching for <, then, for > and cutting
everything between them.

Is the use of <> characters in them any limited? I know that they should be
normally replaced with &lt; and &gt (afair) in the plain text data, but I
guess HTML documents you see on the net are not ideal and many html makers
write their documents with rule violations that still allow them to be
displayed normally by all the well known browsers.

Is <!-- <<<<< --> a valid comment, or <img src="aaa.jpg" alt="<<evil alt><">
a valid image tag, for example? (by valid, I mean usable without errors in
this case ;) )

Is "\<" and "\>" treated as the plain-text "<" or ">" character itself in
HTML files?

And the last one, are there any other special things that I have to think
about if I am using this simple method (cutting out everything between < and
Quoted text here. Click to load it

Re: cutting out the tags

Raven wrote:

Quoted text here. Click to load it

lynx --dump /

Quoted text here. Click to load it



\ has no escaping function, that's what entities are for.

David Dorward                            /

Re: cutting out the tags


Quoted text here. Click to load it

There is already a php function along these lines that may be of some

although note the disclaimer, '<i>tries</i> to return a string'.


Site Timeline