Do you have a question? Post it now! No Registration Necessary. Now with pictures!
- Posted on
- how to get the list of words from html files
October 8, 2005, 11:53 pm
rate this thread
- Gunnar Hjalmarsson
October 9, 2005, 10:08 am
Re: how to get the list of words from html files
No it doesn't. You should read the surrounding text more carefully.
Did you copy/paste that code? That "vertical bar" (or) isn't
the right/correct vertical bar character.
There does not exist a regular expression that is "right" for
reliably removing HTML markup.
You might be able to find a regex that is "good enough", knowing
that it will occasionally fail. Only you can decide how robust
it must be for your application.
(there is no need to backslash the double quote character there.)
To remove "some punctuation" you mean. There are lots of punctuation
characters that you do not remove.
You might want to turn it around to say what characters you want
to keep, rather than what characters you want to discard...
You don't even need (or want) regular expressions for
replacing "characters" (rather than "strings").
For replacing characters you probably should use:
perldoc -f tr
Here's one that removes the same characters as your s///g does:
Tad McClellan SGML consulting
email@example.com Perl programming
Fort Worth, Texas