preg_match at offset

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I want to split a given string into tokens which are defined by regexes:

// example tokens - a bit more complex in real
$tokens = array(
     'NUMBER' => '~^\d+~',
     'NAME'   => '~^[A-Za-z]+~',
     'ANY'    => '~^.~' ); // make sure there is always a match

while ( $input !== '' )
     foreach ( $tokens as $name => $regex )
         if ( preg_match( $regex, $input, $data ) )
             addNewTokenToResult( $name, $data );
             // remove the matched token from the input
             $input = substr( $input, strlen( $data[0] ) );

(Code just to illustrate my approach, not actually tested.)

The problem is that I must cut off each found token from the input
string. This requires the string to be copied in memory and is therefore
not very efficient. Also, there is the extra substr() call each time. So
I thought it would be more elegant if preg_match could match a token at
a specified offset within the input (i.e. the "^" anchor would not match
the start of the input string, but at the offset position).

In other words, I would like something like this:

     $string = "abcfoobarfoo";
     $offset = 3; // corresponds to the first "f" in the string

     // return true having matched the first "foo"
     preg_match_at_offset( $offset, '/^foo/', $string );
     // return false, as "abc" is not at offset 3
     preg_match_at_offset( $offset, '/^abc/', $string );

At first I thought the offset parameter of preg_match could do this, but
the manual (< ) says:

"Note: Using offset is not equivalent to passing substr($subject,
$offset) to preg_match() in place of the subject string, because pattern
  can contain assertions such as ^, $ or (?<=x)."

So the only way to achieve what I want seems to be to use substr() to
cut off everything before the offset and do a preg_match on the rest,
using the "^" anchor.

Is there no way to have a pattern match at a specific offset only?


Ce n'est pas parce qu'ils sont nombreux avoir tort qu'ils ont raison!

Re: preg_match at offset

On Sat, 01 Nov 2008 14:06:24 +0100, Thomas Mlynarczyk wrote:
Quoted text here. Click to load it

It might be more helpful to let us know what you're really working=20

Quoted text here. Click to load it

Match would fail on +7, -7, 5.123, -5.123e-10, .5, etc.

Quoted text here. Click to load it

And if a name contains characters like,A1,A9,B6,BC, or =
=C3=A4? You should=20
use "\w", which is locale-specific.  Although, in your case, this=20
regex would be more suitable:

  // exclude "\d" and "_" from "\w"
  'NAME'  => '~[^\W\d_]+~';

Quoted text here. Click to load it

It seems like your tokenizer will be quite inefficient with so much=20
use of regex.  Is there any way you can approach your problem so that=20
you don't need to use regex for simple tokens?  For example, using=20
is_numeric() would be faster than a regex.  Instead of using the=20
'ANY' approach, maybe you could just use a 'continue' statement in=20
your loop.

Quoted text here. Click to load it

Read up on the "\G" assertion keeping track of match positions.

$email = str_replace('sig.invalid', '', $from);

Re: preg_match at offset

On Sun, 02 Nov 2008 06:59:38 GMT, Curtis wrote:
Quoted text here. Click to load it
Sorry, forgot to get rid of the semi-colon:

  'NAME'  => '~[^\W\d_]+~',

$email = str_replace('sig.invalid', '', $from);

Re: preg_match at offset

On Mon, 03 Nov 2008 23:52:51 GMT, dyer85@sig.invalid wrote:
Quoted text here. Click to load it
Seems I misread the manual, and your post.  My above quoted text is
rubbish, sorry.

When I ran a test, both using the /A modifier and the \A assertion at
the beginning, both tests failed:

$s = 'abcfoobar';
$m = array();

echo preg_match('/foobar/A', $s, $m, null, 3); // 1
echo preg_match('/\Afoo/', $s, $m, null, 3); // 0
echo preg_match('/\Afoo/', substr($s,3), $m, null, 0); // 1

So, ISTM, regardless of offset, \A will look at the start of the
string, so the assertion seems to fail when an offset is greater than
0.  Yet the /A modifier seems to adjust with the offset.

Sorry for the mistake.

$email = str_replace('sig.invalid', '', $from);

Site Timeline