# Needs help with Matching Logic

#### Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on
I am comparitively a newbie in Perl.
I am working a logic to display the snippets matched results of a
'keyword' from a text file just like google would do in the search
results.

I have the content of the text file in the variable \$file_content.
And I have the 'keyword' in \$keyword.

I need to get the string like google does when displaying the search
results..
When I match the \$keyword in the \$file_content, I want to also pull 5
words before and 5 words after so I can show that snippet of the file
where the matching of the keyword occurs.

I searched in the google groups for a few days, but couldn't find
anything to help me.

I really appreciate any help I can get.

Thanks!
Kishore

## Re: Needs help with Matching Logic

On Tue, 20 Jul 2004, Kishore wrote:

> I am comparitively a newbie in Perl.
> I am working a logic to display the snippets matched results of a
> 'keyword' from a text file just like google would do in the search
> results.
>
> I have the content of the text file in the variable \$file_content.
> And I have the 'keyword' in \$keyword.
>
> I need to get the string like google does when displaying the search
> results..
> When I match the \$keyword in the \$file_content, I want to also pull 5
> words before and 5 words after so I can show that snippet of the file
> where the matching of the keyword occurs.
>
> I searched in the google groups for a few days, but couldn't find
> anything to help me.
>
> I really appreciate any help I can get.

m/((?:\S+\s+))(\$keyword)((?:\s+\S+))/

Using that, \$1 is the series of up to five words before the match, \$2 is
the match, and \$3 is the series of up to five words after the match.

It'd probably have to be tweaked a bit to get exactly what you want, but
it should at least give you a starting point.

Paul Lalli

## Re: Needs help with Matching Logic

>
> m/((?:\S+\s+))(\$keyword)((?:\s+\S+))/
>
> Using that, \$1 is the series of up to five words before the match, \$2 is
> the match, and \$3 is the series of up to five words after the match.
>

It works really great.

Thank you very much.

What is colon(:) for? I don't believe I saw this in the books I have
been refering to so far.

Thanks!
- Kishore.

## Re: Needs help with Matching Logic

> > how about something like:
> >
> > m/((?:\S+\s+))(\$keyword)((?:\s+\S+))/
> >
>
> It works really great.
>
> What is colon(:) for? I don't believe I saw this in the books I have
> been refering to so far.

(?:...)

look up 'Extended Patterns' in
perldoc perlre

gnari

## Re: Needs help with Matching Logic

Quoth krishnakishore.r.challa.lzi1@statefarm.com (Kishore):
> > how about something like:
> >
> > m/((?:\S+\s+))(\$keyword)((?:\s+\S+))/
>
> What is colon(:) for? I don't believe I saw this in the books I have
> been refering to so far.

The construction is (?: ... ), to be contrasted with ( ... ); it modifes
the parens so that they just group without capturing. See perldoc
perlre or perldoc perlretut.

[as a side note, I would *always* use /x on a regex with (?:) in, just
because things get lost:

/( (?: \S+\s+ ) ) (\$keyword) ( (?: \s+\S+ ) )/x

]

Ben

--
"If a book is worth reading when you are six,                * ben@morrow.me.uk
it is worth reading when you are sixty." - C.S.Lewis

## Re: Needs help with Matching Logic

>
> m/((?:\S+\s+))(\$keyword)((?:\s+\S+))/
>
> Using that, \$1 is the series of up to five words before the match, \$2 is
> the match, and \$3 is the series of up to five words after the match.

Note that if \$keyword is supposed to be a plain string rather than a
regex, you'll neet to escape metacharacters in it.  An easy way to do
this is:

m/((?:\S+\s+))(\Q\$keyword\E)((?:\s+\S+))/

Also, this regex can be optimized a bit by noting that the only way \$1
can contain less than 5 words is if the match occurs at the very
beginning of the string.  Separating that special case, we get:

m/((?:\S+\s+)|^\s*(?:\S+\s+))(\Q\$keyword\E)((?:\s+\S+))/

This is noticeably faster if the first occurrence of \$keyword isn't
near the beginning, since it saves the regex engine some needless
backtracking.

Also note that, if you use global matching to extract multiple
snippets from the text, the results can be unexpected if there are
multiple occurrences of \$keyword near each other.  In particular, if
there are less than 5 words between two occurrences, the second one
will be swallowed in the 5 words matched after the first one.

The easiest way to fix that is to use negative look-ahead:

m/((?:\S+\s+)?)(\Q\$keyword\E)((?:\s+(?!\Q\$keyword\E)\S+))/g

Oddly enough, optimizing this regex the same way as before doesn't
seem to help, and seems to tickle a perl bug (probably related to \G
handling?) when used in scalar context.

Oh, and you probably want case-insensitive matching, and should
probably allow punctuation around \$keyword, something like:

m/((?:\w+\W+))(\Q\$keyword\E)((?:\W+\w+))/i

or (optimized):

m/((?:\w+\W+)|^\W*(?:\w+\W+))(\Q\$keyword\E)((?:\W+\w+))/i

or for global matching:

m/((?:\w+\W+)?)(\Q\$keyword\E)((?:\W+(?!\Q\$keyword\E)\w+))/ig

--
Ilmari Karonen

## Re: Needs help with Matching Logic

> >
> > m/((?:\S+\s+))(\$keyword)((?:\s+\S+))/
> >
> > Using that, \$1 is the series of up to five words before the match, \$2 is
> > the match, and \$3 is the series of up to five words after the match.
>
> Note that if \$keyword is supposed to be a plain string rather than a
> regex, you'll neet to escape metacharacters in it.  An easy way to do
> this is:
>
>   m/((?:\S+\s+))(\Q\$keyword\E)((?:\s+\S+))/

> Also note that, if you use global matching to extract multiple
> snippets from the text, the results can be unexpected if there are
> multiple occurrences of \$keyword near each other.  In particular, if
> there are less than 5 words between two occurrences, the second one
> will be swallowed in the 5 words matched after the first one.
>
> The easiest way to fix that is to use negative look-ahead:
>
>   m/((?:\S+\s+)?)(\Q\$keyword\E)((?:\s+(?!\Q\$keyword\E)\S+))/g

Er, no it would be easier and more ideomatic to put the third capture

m/((?:\S+\s+)?)(\Q\$keyword\E)(?=((?:\s+\S+)))/g

--
\   ( )
.  _\__[oo
.__/  \ /\@
.  l___\
# ll  l\
###LL  LL\

## Re: Needs help with Matching Logic

>>
>> Also note that, if you use global matching to extract multiple
>> snippets from the text, the results can be unexpected if there are
>> multiple occurrences of \$keyword near each other.  In particular, if
>> there are less than 5 words between two occurrences, the second one
>> will be swallowed in the 5 words matched after the first one.
>>
>> The easiest way to fix that is to use negative look-ahead:
>>
>>   m/((?:\S+\s+)?)(\Q\$keyword\E)((?:\s+(?!\Q\$keyword\E)\S+))/g
>
> Er, no it would be easier and more ideomatic to put the third capture
>
> m/((?:\S+\s+)?)(\Q\$keyword\E)(?=((?:\s+\S+)))/g

Those two don't do the same thing.  With your version the snippets may
overlap, with mine they can't.  Deciding which solution is better is
really up to the OP.

--
Ilmari Karonen

## Re: Needs help with Matching Logic

>
> Oh, and you probably want case-insensitive matching, and should
> probably allow punctuation around \$keyword, something like:
>
>   m/((?:\w+\W+))(\Q\$keyword\E)((?:\W+\w+))/i
>

I was having problems with punctuation.
This code solved the problem.
Thanks very much.