Parsing out text from in between HTML tags

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hello -

I'm new to perl and am having a tough time trying to complete a
theoretically simple statement.  What I'm trying to do is write a very
simple search engine that searches an html file for a given
searchQuery.  The way it's set up now is that if the searchQuery is
something like "java," every single page is a hit because the word
"javascript" is in the code in the form of the "<script
language="javascript">" etc.  I want to specify that $searchQuery
should be surrounded like so:


In other words, the searchQuery has to be in between two HTML tags.
Here's what I have at this point (the wrong way):

return unless ($fileName =~ /\Q$searchQuery\E/i);

Any help would be greatly appreciated!


Re: Parsing out text from in between HTML tags schrieb:
Quoted text here. Click to load it

Most like you'll only get a partly working solution if you approach
this problem with a regular expression. There are all kinds of
things that can go wrong. I'd leave the parsing of the HTML to
a module that knows what it's doing, like HTML::TreeBuilder.

use HTML::TreeBuilder;

my $t = HTML::TreeBuilder->new_from_file( "input.html" );
# or my $t = HTML::TreeBuilder->new_from_content( $html );
if( index( $t->as_text, $searchQuery ) >= 0 ) {
   # ... found ...

Using this module, you could also search for your query in
different attributes, e.g. link titles:

my $foundlinks = $t->look_down(
   '_tag',  'a',
   'title', qr/$searchQuery/
if( $foundlinks ) {
   # ... had a hit ...


Site Timeline