Regex Nested Backreferences

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
For my web-based php regex find/replace do-hickey, I need to match  
individual back references and wrap a tag around them so they'll be unique  
to the rest of the match for individual color markup.  Initially this  
would seem easy enough, however not all of a potential regex match is  
going to be within a back reference.  So it's necessary to replace the  
back reference, and only the back reference, while preserving the context  
of the match.  For example, if I were to search the text

fish this fish fish

looking for
.*?(?<=this )(fish).*

I'd match everything, capturing  the second instance of fish into the back  
reference.  I can't simply take the match and run a replace for fish in  
order to apply the highlighting, because then i'd end up with 3  
highlighted "fish", 2 of which weren't supposed to be.  I also couldn't  
simply return the back reference with the markup, as that wouldn't return  
the non-back referenced stuff.

My initial solution was to run the original find text over the match to  
get the back references, using an extra flag to have it return the offset  
of each back reference.  So now I have the location of the text within the  
string, and can get the length of it from that point from the string  
itself.  Going backwards so as not to mess with the numeric location with  
in the string, it captures back references without losing context or  
data.  Perfect.

. . . until back references are nested.

In this example:
(.*?(?<=this )(fish).*)

back reference 1 would be fish this fish fish, back reference 2 would be  
fish -- here's where the problem surfaces.

If I wrap back reference 2 in the markup, when I apply back reference 1's  
markup it's going to apply the end tag in the wrong place since the string  
has increased and the original length calculated no longer applies.  If I  
replace back reference 1 first, same problem.  I'm sure there's some  
obvious, simple solution I'm overlooking having exhausted a bunch of  
complex attempts to compensate for it.  Any fresh perspectives on the best  
way to markup nested groups while preserving the integrity of the return?

Below is the function the matches are being passed through, you'll see I'm  
useing preg_match_all to get the capture groups as well as the match  
location and then using substr_repalce to insert the pseudo-markup.

function hltr($text,$find) {
   if ( isset($_POST['debug']) || isset($_GET['debug']) ) {
     echo "<pre>";
     echo "</pre>";
   $text = $hlight[0][0][0];
   while ( $n > 0 ) {
     $text =  
   return('<strong class="result">'.$text.'</strong>');

To see it highlight backreferences correctly:
And failing on nested groups

Thanks . . .


Re: Regex Nested Backreferences

Re: Regex Nested Backreferences

Quoted text here. Click to load it

I don't believe you read my message, Bob -- I'm not asking for help with  
regex, I know regex.  My problem is that I'm trying to take regex and  
highlight various aspects of the syntax, in this case the different sub  
groups.  Had you read the post, you'd have seen that the links to what I'm  
working on can do everything and more than what you linked to.  Thanks  


Re: Regex Nested Backreferences

I skimmed. I saw you wanted to do some highlighting of regex matches.
This guy (Rob Locher) wrote a nice regex highlighter. Thought you could
possibly get something useful out of it (i.e. analyze his algorithm).
You're welcome anyway.

Re: Regex Nested Backreferences

Quoted text here. Click to load it

I'd have appreciated that explanation -- at any rate, I'm sorry for my  
curt response, I'd spent too many hours with code to be any good with  
people.  I did put together a solution, The working model is linked  
below.  I might have to check out his source to see if there's anything I  
can glean from it anyway. Thanks.


A Web based regular expressions powered find/replace utility

Site Timeline