Click here to get back home

RegEx - matching previous match

 HomeNewsGroups | Search | About
 comp.lang.perl.misc    Post an article   get this group's latest topics as an RSS feed add this group's latest topics to your My MSN content add this group's latest topics to your My Yahoo content
Subject Author Date
RegEx - matching previous match j ellings 02-27-2008
Posted by j ellings on February 27, 2008, 5:12 pm
Please log in for more thread options
Hello.

I have an html file converted from PDF that includes the following
sample lines:

(html has been converted)

<i><b>Z & A Newsstand</b></i><br>
<i>Retail Food: Mobile Food Vendor</i><br>
<i>2 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>4/11/07</i><br>
No Critical Violations<br>
<i>4/11/07</i><br>
No Critical Violations<br>
<i>11/28/06</i><br>
No Critical Violations<br>
<i>4/24/06</i><br>
No Critical Violations<br>
<i><b>Newstand</b></i><br>
<i>Retail Food: Mobile Food Vendor</i><br>
<i>32 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>7/2/07</i><br>
No Critical Violations<br>
<i><b>Pudgies Deli</b></i><br>
<i>Retail Food: Restaurant, Eat-in</i><br>
<i>46 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>1/11/07</i><br>
No Critical Violations<br>
<i>9/25/06</i><br>
No Critical Violations<br>
<i>8/7/06</i><br>
No Critical Violations<br>


I am trying to capture the information between the <i><b>
tags as these are the only unique delimiters between entries.

My regex is as follows:

while ($html =~ mgs) {
#do something
}

Unfortunately, the regex will match the first instance( Z & A
Newsstand), but ignore the second (Newstand) and then match on the
third (Pudgies Deli).

I can see that the match is working according to what I wrote; I am
trying to fine tune it so that I can grab every match. Is there a way
to include the previous <i><b> in the next match such that
it will not skip a potential match?

Any suggestions or advice would be most appreciated.

John

Any


Posted by Gunnar Hjalmarsson on February 27, 2008, 8:21 pm
Please log in for more thread options
j ellings wrote:
>
> (html has been converted)

Yes, but why on earth did you post the data in that format?

<non-html data snipped>

> I am trying to capture the information between the &lt;i&gt;&lt;b&gt;
> tags as these are the only unique delimiters between entries.
>
> My regex is as follows:
>
> while ($html =~ mgs) {
> #do something
> }
>
> Unfortunately, the regex will match the first instance( Z &amp; A
> Newsstand), but ignore the second (Newstand) and then match on the
> third (Pudgies Deli).
>
> I can see that the match is working according to what I wrote; I am
> trying to fine tune it so that I can grab every match. Is there a way
> to include the previous &lt;i&gt;&lt;b&gt; in the next match such that
> it will not skip a potential match?

A zero-width positive look-ahead assertion may be what you are after;
see "perldoc perlre".

while ($html =~ mgs) {
---------------------------------^^^------^

Another approach that doesn't slurp the whole file into a scalar variable:

local $/ = '<i><b>';
while ( my $html = <> ) {
#do something
}

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Posted by j ellings on February 27, 2008, 9:09 pm
Please log in for more thread options

>
> A zero-width positive look-ahead assertion may be what you are after;
> see "perldoc perlre".
>
> while ($html =~ mgs) {
> ---------------------------------^^^------^
>
> Another approach that doesn't slurp the whole file into a scalar variable:
>
> local $/ = '<i><b>';
> while ( my $html = <> ) {
> #do something
> }
>
> --
> Gunnar Hjalmarsson

Thanks Gunnar, this worked perfectly; apologies for the formatting.

Posted by Tad J McClellan on February 27, 2008, 8:21 pm
Please log in for more thread options
> Hello.
>
> I have an html file converted from PDF that includes the following
> sample lines:
>
> (html has been converted)


Why has HTML been converted?

This is a plain-text medium...


> &lt;i&gt;&lt;b&gt;Z &amp; A Newsstand&lt;/b&gt;&lt;/i&gt;&lt;br&gt;
^^ ^^
^^ ^^


> My regex is as follows:
>
> while ($html =~ mgs) {


End tags have slash characters in them that your pattern will not match.

Your data closes the bold before the italic, but your regex looks
for the italic close before the bold close.


> I can see that the match is working according to what I wrote;


You have a strange definition of "working" then...


> trying to fine tune it so that I can grab every match. Is there a way
> to include the previous &lt;i&gt;&lt;b&gt; in the next match such that
> it will not skip a potential match?


You do not need a way to include the previous <i><b> in the next match.


> Any suggestions or advice would be most appreciated.


while ($html =~ mgs) {


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher0cmdat/"

Posted by j ellings on February 27, 2008, 9:22 pm
Please log in for more thread options
>
> You do not need a way to include the previous <i><b> in the next match.
>
> > Any suggestions or advice would be most appreciated.
>
> while ($html =~ mgs) {
>
> --
> Tad McClellan
> email: perl -le "print scalar reverse qq/moc.noitatibaher0cmdat/"

Tad

Thanks for the suggestion. Your regex will match the first instance
of opening and closing of the <b><i> tags; what I needed it to do was
to match the opening of the two tags. My original regex did capture
between two opening instances, but only after skipping one.

Similar ThreadsPosted
RegEx - matching previous match February 27, 2008, 5:12 pm
Multi-Match (to Array) Regex with a precodition match? August 5, 2007, 2:43 pm
Regex not matching May 15, 2005, 4:37 am
REGEX NAME Matching.. June 23, 2005, 5:11 pm
RegEx Help, Please? (match after n) June 26, 2005, 10:49 pm
regex to match any url February 14, 2006, 4:02 pm
Matching substrings within a Regex May 16, 2006, 12:41 pm
regex matching exactly 10 digits November 28, 2006, 8:58 am
get the matching regex pattern March 20, 2008, 9:16 am
regex back matching June 5, 2008, 8:53 am

Our other projects:

Art Dolls, Fairies and Mermaids - Sunnyfaces.net

Roy's Linux, Programming and Search Engines messages

1-Script XML SitemapXML Sitemap