Regex: deleting non-matching words

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I have input strings where some words start with an underscore. The plan
is to remove all words that do NOT strt with an underscore and simply
keep the rest. So for example starting with
"word1 word2 _word3 word4 word5 _word6 _word7 word8"

I'm trying to end up with
"_word3 _word6 _word7"

The expression I have got so far is s/.*?(_[a-z0-9]+).*?/ $1/gi;
and my understanding is as follows:
The first ".*?" part removes everything up to the first matching RE
The "(_[a-z0-9]+)" matches any letter/number combination that starts
with an underscore [sidenote: yes, I know: \w+]
The final ".*?" removes everything up to the next match, or up to
the end of the string.

Here's how I have the RE in a program
s/.*?(_[a-z0-9]+).*?/ $1/gi;
print "Have: $_";

and here's how I run it:
echo "word1 word2 _word3 word4 word5 _word6 _word7 word8" | perl

and here's the output I get:
Have:  _word3 _word6 _word7 word8

Question: Why didn't "word8" get eaten like all its precedessors? and
what do I have to do to match it for removal.

If you have time, I'm looking for enlightenment more than solutions. I
am obviously missing something crucial, but all the online tutorials
I've found stop short of explaining this sort of thing.

Re: Regex: deleting non-matching words

pete wrote:
Quoted text here. Click to load it

An RE does not remove anything, as you have suggested. An RE matches
something. A substitute replaces whatever is matched with the
replacement string.

After you have matched '_word3', '_word6', and '_word7', nothing else in
the string matches your RE, so no further substitutions are made.

When a string can be split into words, and each word evaluated based on
its first character, I wouldn't use REs.


use strict;
use warnings;

my $original_string = 'word1 word2 _word3 word4 word5 _word6 _word7 word8';
my @word_list;

for my $word (split ' ', $original_string)
    push @word_list, $word if index($word, '_') == 0;

my $new_string = join ' ', @word_list;
print $new_string;

Re: Regex: deleting non-matching words

Quoted text here. Click to load it

What will your program do if a word has an "interior" underscore, like:

    word1 word2 _word3 word4_and_four word5 _word6 _word7 word8

Try your program on it. Is that what you want to happen in that case?

If not, then you probably want to make use of a "word boundary" (\b)
assertion. See the "Assertions" section in:

    perldoc perlre

Quoted text here. Click to load it

.*? *matches* everything up to the first word that starts with
an underscore.

It is part of an RE there is no "first RE" or "second RE".

The s/// operator takes a *single* regular expression.

Regular expressions never "remove" anything, they only "match"
or "do not match".

It is the s/// *operator* that does the "removing".

Quoted text here. Click to load it

The final .*? has no effect whatsoever. You get the same output
if you remove it.

.*? means "match zero or more, preferring the shortest", so it
always matches zero characters. (only because it is last in
your particular regular expression)

Quoted text here. Click to load it

Because it was never matched by anything.

The s/// operator does nothing when it fails to match.

Quoted text here. Click to load it

Because "this sort of thing" is highly dependent on both the pattern
being matched, and the string that it is being matched against.

You need to "become the regex engine" and walk through its operation
on your particular string. The first match is:

word1 word2 _word3 word4 word5 _word6 _word7 word8

after the 1st iteration of s///g you are left with

 _word3 word4 word5 _word6 _word7 word8

with the regex's pos() pointer as marked, match again from that pos():

 _word3 word4 word5 _word6 _word7 word8

then do the substitution leaving

 _word3 _word6 _word7 word8

match yet again:

 _word3 _word6 _word7 word8


 _word3 _word6 _word7 word8

match again: match fails, no substitution is performed s///g is all done.

If this task was for me to do, I would either use a m//g in list context:

    $_ = join ' ', /\b(_[a-z0-9]+)/g;

or separate out the words and find which ones start with an underscore:

    $_ = join ' ', grep /^_/, split;

Tad McClellan
email: perl -le "print scalar reverse qq/moc.liamg0cm.j.dat/"
The above message is a Usenet post.
I don't recall having given anyone permission to use it on a Web site.

Re: Regex: deleting non-matching words

Thanks Charles, Scott and Tad
I do appreciate the time you've all taken to put me right. The big thing
I was missing was the REGEX pointer and how it iterates through input
string. I did think that the trailing ".*?" would match everything after
the last bracketed part of the RE but now I think I understand better
why it doesn't.

I have to admit, I've never used Perl's grep - though I use the Unix/Linux
versions frequently. Looks like I have a new tool to play with!


Re: Regex: deleting non-matching words

Quoted text here. Click to load it

The problem is that once you've matched a
target substring, ie,  _[a-z0-9]+  then the
regex .*? lazily stops as soon as possible
since .*? says match any character 0 or more
times minimally (also termed lazily). So the
regex lazily chooses 0 and completes a match.

That works but then the only glitch is that
the lazy .*? fails to consume the rest of the
string once the final target_word7 is found
and you're left with ' word8'.

One way to fix that:

 s/ .*?               # match minimally
   ( _[a-z0-9]+ | $ ) # up to target or eol
  / $1/gix;

Now  the regex matches_word7, but then
tries to match one of two alternatives:

  Either: _[a-z0-9]+
  or:     end-of-line

The former isn't found but latter is and
the rest of the string is consumed up to
the end-of-line just before \n.

Charles DeRykus

Site Timeline