How to do variable-width look-behind?

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View


   I have a Perl script that processes multi-line input.  The problem
is, sometimes this input has newlines stuck in arbitrary places (such
as right in the middle of a valid token).  This makes the input out-of-
spec, but I have no control over this, so I want to correct it if I
can.  What's more is, sometimes this newline breaks a token in two,
where the first half still looks like a valid token while the other
does not, and vice-versa.

   I'm trying to modify my Perl script so that it reviews every
newline and see if it should be discarded.  The logic I want to use is
to throw out every newline UNLESS it is flanked (on both sides) by
valid tokens.  I would like to be able to do something like this:

   # Create a regular expression that matches tokens
   # like "N50E40", "N50 E40", "N5000 E4000",
   # "50N40E", "50N 40E", and "5000N4000E":
   my $tokenRegExp = qr/\b(?:[NS]\d+\s*[EW]\d+|\d+[NS]\s*\d+[EW])\b/;

   # Remove newlines that are not surrounded by valid tokens:
   $input =~ s/(?<!$tokenRegExp)\n(?=$tokenRegExp)//g;  # no token
   $input =~ s/(?<=$tokenRegExp)\n(?!$tokenRegExp)//g;  # no token
   $input =~ s/(?<!$tokenRegExp)\n(?!$tokenRegExp)//g;  # no tokens

   The problem is is that the look-behind assertions (both positive
and negative) only work for fixed-width expressions, according to
"perldoc perlre".  Unfortunately, it would be so useful for me to be
able to match a string with a variable look-behind, that I'm hoping
there's a logical work-around to this limitation.

   Is there any way for me to work around this limitation?


   -- Jean-Luc

Re: How to do variable-width look-behind?

Quoted text here. Click to load it

In this case I would recommend adding "\n" as a token, and when you find
you've just parsed TOKEN "\n" TOKEN go back and join those two tokens
together before processing them further. You will need to turn the "\n"
token back into whitespace (or whatever they should usually be treated
as) afterwards, of course.

Alternatively, if there is a character that is guaranteed not to appear
in your input ("" might be a good choice, or if your data is already
Unicode or you can afford the performance hit of upgrading everything
you could use one of the Unicode private-use characters) you could use
the usual solution for positive look-behind:


If you have 5.10 or Regexp::Keep this can be simplified to



Re: How to do variable-width look-behind?

Quoted text here. Click to load it

IIUC you could use 5.10's more efficient counterparts for $`,$',$& :

while ( $input =~ m/ \n /gpx ) {  # note /p switch: perldoc perlre

      my( $pre, $post ) = ( $, $ );

      unless ( $pre  =~ / $tokenRegExp $/x   and
               $post =~ / ^ $tokenRegExp /x  )
          substr($input, pos($input)-1, 1, "" );

Charles DeRykus

Re: How to do variable-width look-behind?

Quoted text here. Click to load it

Ben Morrow replied:
Quoted text here. Click to load it

   Ah.  Thanks for showing me.  Yes, that would work for my purposes,
as null bytes (and a whole lot of other characters) are guaranteed to
not appear in my $text.

   Before I had read your reply, I came up with a solution of my own.
It's not as simple as yours, but it did appear to work correctly:
First, I split the $text into an array of lines, and then looped
through every pair of lines.  If the a line does not end in the token
or its next line does not begin with the token, then I chomp() that
line.  Then I set $text to the lines.

   Here's essentially what I did:

   $text = do {
      my @lines = split m/(?<=3D\n)/, $text;

      foreach my $i (0 .. $#lines-1)
         my ($current, $next) = @lines[$i, $i+1];

         # Skip removing newline if surrounded by tokens:
         next  if     $current =~ m/$tokenRegExp$/
                  and $next =~ m/^$tokenRegExp/;

         # This is a linebreak we want to remove:

      join '', @lines  # "returns" the lines into $text

   It's not quite as elegant or simple as your solution, but it did
appear to work well.

Charles DeRykus replied:
Quoted text here. Click to load it

   Oh, wow!  I never knew about the /p switch!  (That's especially odd
since I look up "perldoc perlre" fairly frequently.)  Thanks for
telling me about it.  Since the script I'm working on must run on
machines that aren't guaranteed to have Perl 5.10, I won't use it
right now, but I'll keep it in mind for scripts of my own use.

   Since $input is being modified inside its own while($input =~ m//g)
loop, I might suggest considering saving off pos($input) before the
substr() and then restoring it right after.  Otherwise, the while-
match will start back at the beginning of $input.

   (That might not be a problem in this case, but saving and restoring
the pos() at least ensures that the while-match loop won't revisit
parts of $input that have already been processed.)

   Anyway, thanks for your help and "$input", Ben and Charles.  I
appreciate it!

   -- Jean-Luc

Re: How to do variable-width look-behind?

On Tue, 9 Feb 2010 15:23:34 -0800 (PST), ""

Quoted text here. Click to load it

If this is the case, that you have '\n' within arbitrary locations within
an otherwise valid token, then you have much bigger problems than worrying
about look behind.

So, I'm not sure what you mean by 'arbitrary locations'. I would have to
asume it could be anywhere.

So, your real problem is actually *framing* a valid token, not newlines.
Which is easier said than done. You would need some kind of analizer that
can distinguish a particular token pattern from other stream info, separate
them, clean them up, and put them back together.

Something like below might help. But, if your token stream is a little less
arbitrary, this could be reduced to something simpler.
I was just going to do/keep it for my own curiosity, but I thought I would post
here anyway.


# -sln, 2/10
use strict;
use warnings;

my $input = qq(

N53 E44
  this is some other token text

N 5 and 38N11E more stuff 28N 99   E
  N5000 E4000

56N 410   E
5000N4000 E

  my @tokens = ();
  my $new_input = '';
  my $digs = '\s*\d[\d\s]*';
  my $TOKenRegExp = qr/(?:\b(?:[NS]$digs[EW]$digs|$digs[NS]$digs[EW])\b)/;

  print "Input:\n",'-'x8,"\n";
  print $input,"\n";

  print "New Input:\n",'-'x8,"\n";

  while ($input =~ /( ($TOKenRegExp) | (?:(?!$TOKenRegExp)[^\n])+\n? )/xsg)
    my $tok = $1;
    my $istok = defined $2;
    $tok =~ s/^\s*|\s*$//g;
    next if !length( $tok );

    if ($istok) {
        $tok =~ s/\n//g;
        $tok =~ s/(?<=\d)\s+(?=\d)//;
        $tok =~ s/^[NS]\K\s+(?=\d)//;
        $tok =~ s/\d\K\s+(?=[EW]$)//;
        $tok =~ s/\s+/ /g;
        push @tokens, $tok;
        $tok .= "\n";
    } else {
        $tok .= ' ';
    $new_input .= $tok;
  print $new_input,"\n";

  print "Found Tokens:\n",'-'x8,"\n";
  for (@tokens) {
    print "'$_'\n";
  print '-'x8,"\n";




N53 E44
  this is some other token text

N 5 and 38N11E more stuff 28N 99   E
  N5000 E4000

56N 410   E
5000N4000 E

New Input:
55N 40E
N51 E42
N53 E44
this is some other token text N 5 and 38N11E
more stuff 28N 99E
N5000 E4000
59N 48E
56N 410E

Found Tokens:
'55N 40E'
'N51 E42'
'N53 E44'
'28N 99E'
'N5000 E4000'
'59N 48E'
'56N 410E'

Site Timeline