How to find all the strings in a long that are at most of n different characters of a tes...

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on
Hi,

Suppose I have a long string \$a, and a test string \$b.

I want to fine all the substrings in \$a, whose length is the same as
\$b with at most n mismatches.

For example, string 'abcdef' and string 'aacdxf' have two mismatches
at the 2nd character and the 5th character.

I'm wondering if this can be done easily in perl. Can I use regular
expression to solve this problem?

Thanks,
Peng

Re: How to find all the strings in a long that are at most of n different characters of a test string?

Not easily, and probably fairly slow.
You need some heuristic algorithym.

sln

----------------------------

use strict;
use warnings;

my \$str     = 'aacdxfo   sdfbsabcrxfodfbdfb';
my \$pattern = 'abcdef';
my \$misses  = 2;

my @tmp = split '',\$pattern;
my \$pstr;
for (@tmp) {
\$pstr .= "(?:\$_|(.))";
}
\$pstr = '('.\$pstr.')';
my \$rxpattern = qr/\$pstr/i;

print @tmp,"\n",\$rxpattern,"\n\n";

while (\$str =~ /\$rxpattern/g )
{
my \$cnt = 0;
for my \$i (2..(@tmp+1)) {
last if ((\$cnt += defined( \$-[\$i])) > \$misses);
}
if (\$cnt > \$misses) {
pos(\$str) = \$-[0]+1;
} else {
print "\$cnt bad chars, but found a close match: '\$1'\n";
}
}

__END__

abcdef
(?i-xsm:((?:a|(.))(?:b|(.))(?:c|(.))(?:d|(.))(?:e|(.))(?:f|(.))))

2 bad chars, but found a close match: 'aacdxf'
2 bad chars, but found a close match: 'abcrxf'

Re: How to find all the strings in a long that are at most of n different characters of a test string?

On Nov 19, 7:55=A0pm, s...@netherlands.com wrote:

Hi,

I can not find what '\$-[\$i]' means in your code on online perl
documentation. Would you please let me know its meaning? Where is its
documentation?

Thanks,
Peng

Thanks,
Peng

Re: How to find all the strings in a long that are at most of n different characters of a test string?

It is accessing the array named @-

perldoc perlvar

--
email: perl -le "print scalar reverse qq/moc.noitatibaher0cmdat/"

Re: How to find all the strings in a long that are at most of n different characters of a test string?

Peng Yu wrote:

You can find out about the @- array in:

perldoc perlvar

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order.                            -- Larry Wall

Re: How to find all the strings in a long that are at most of n different characters of a test string?

It seems that the perldoc is not easy to understand by a newbie. While
it explains @-, the perldoc keeps on referring to others things that I
don't understand. Can somebody help explain @- for me with more
examples?

BTW, I'd think that the perldoc be made more readable for newbies. I'm
not sure whether other newbies would have the same feeling.

Thanks,
Peng

Re: How to find all the strings in a long that are at most of n different characters of a test string?

PY> BTW, I'd think that the perldoc be made more readable for
PY> newbies. I'm not sure whether other newbies would have the same
PY> feeling.

sources of documentation for newbies, but precious few for experts, and
so I think the perldocs are at a very sweet spot.

the documentation that you don't understand.  Indeed, that's how many of
us learned as much as we did.

Charlton

--
Charlton Wilbur
cwilbur@chromatico.net

Re: How to find all the strings in a long that are at most of n different characters of a test string?

I total understand what you mean.

At this moment, I only be interested in understanding how @- works. I
tried to read that perldoc page, but I feel it is hard to follow.
Since that section of perldocs is not very readable to newbies, I'm
wondering if there is any more introductory material that explains it
with more examples.

I tried to look the more introductory materials on @- online but I
don't find any.

I think as I understand perl more, I shall be able to read perldoc
more easily.

Thanks,
Peng

Re: How to find all the strings in a long that are at most of n different characters of a test string?

...

I looked at the perldoc perlvar page to remind myself what it said. I am
unable to figure out how one could make it clearer:

<blockquote>
This array holds the offsets of the beginnings of the last
successful submatches in the currently active dynamic scope.
\$-[0] is the offset into the string of the beginning of the
entire match. The *n*th element of this array holds the offset
of the *n*th submatch, so \$-[1] is the offset where \$1 begins,
\$-[2] the offset where \$2 begins, and so on.
</blockquote>

Consider the example below:

C:\DOCUME~1\asu1\LOCALS~1\Temp> cat t.pl
#!/usr/bin/perl

use strict;
use warnings;

my \$s = '123abcABC';

if ( \$s =~ /^([0-9]+([0-9]))[a-z]+([A-Z]([A-Z]+))\$/ ) {
print "@-\n";
print "\$_\n" for \$1, \$2, \$3, \$4;
}

__END__

In this case, a successful match would produce

\$1 = 123 (offset = 0)
\$2 = 3   (offset = 2)
\$3 = ABC (offset = 6)
\$4 = BC  (offset = 7)

The entire match starts at offset 0, so we have @- = (0, 0, 2, 6, 7).

Sinan
--
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc /

Re: How to find all the strings in a long that are at most of n different characters of a test string?

Thank you the example. I see what it means. In general, I think the

I agree that adding more examples would make the manual to long to be
read by experts. But I think that this can be solved by having two
sets of document---one is with examples, the other is without
examples.

Thanks,
Peng

Re: How to find all the strings in a long that are at most of n different characters of a test string?

It is a reference, concise and accurate (as much as possible), it is not
and never will be a tutorial.

Then you will have to look for a tutorial or dig down into those other
things. Very often it is not possible or not desirable to explain
something without making references to other concepts. And @- is
certainly not the right place to explain grouping in regular
expressions.

But there are examples in that section:
<quote>
After a match against some variable \$var:

"\$`" is the same as "substr(\$var, 0, \$-[0])"
"\$&" is the same as "substr(\$var, \$-[0], \$+[0] - \$-[0])"
"\$'" is the same as "substr(\$var, \$+[0])"
"\$1" is the same as "substr(\$var, \$-[1], \$+[1] - \$-[1])"
"\$2" is the same as "substr(\$var, \$-[2], \$+[2] - \$-[2])"
"\$3" is the same as "substr \$var, \$-[3], \$+[3] - \$-[3])"
</quote>

You will find that either you are repeating yourself all over the place
or you start using references all over the place because Perl's concept
are so intertwined and cannot be explained stand-alone.

Not to mention the ensuing maintenance nightmare when you have to fix
something all over the place.

jue

Re: How to find all the strings in a long that are at most of n different characters of a test string?

What are you defending, a reference?? A reference is and has always been
another word for a POS! It is just a list of prototypes.
A reference has been and always will be nothing of significance at all.
Used by somebody who already knows it all to adjust for typo's and syntax.
I hope those people don't actually write the referece.

Why the f*ck not? Can't explain something you have no idea where you
learned it from, then blowhard to someone else needing answer's? Get off man..

You my friend should not be involved in Perl. You have cactus up your
ass, every move you make when reading posts on this group drive's that
POINT home brother!!!

Go fly a friggin kite friend..

sln

Re: How to find all the strings in a long that are at most of n different characters of a test string?

I didn't understand the above examples because it refers to \$-[0],
which was exactly what I wanted to understand.

I agree your points. But I think that the perldoc shall have as less
cyclic references as possible. If the example that you mentioned in
the perldoc is not cyclic referred, it would easier to read.

Thanks,
Peng

Re: How to find all the strings in a long that are at most of n different characters of a test string?

<snip>

Are you joking? How can one explain what @- is without explaining the
meaning of its elements?

You do know that \$x[1] refers to the second element of @x, right?

I don't see such cyclic references in the documentation for @-.

Sinan

--
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc /

Re: How to find all the strings in a long that are at most of n different characters of a test string?

Did you quote from a newer version than 5.10.0? It's slightly different
in my docs (only checked 5.8.8 and 5.10.0).

If it didn't contain \$-[0], it wouldn't be an example for the use of
\$-[0], would it?

I don't see any cyclic references in the section about @-. \$-[0] is
explained in the very first sentence:

@-      \$-[0] is the offset of the start of the last successful match.

To understand the examples, it helps to know what \$`, \$&, etc. are.
These are explained elsewhere in the same document, and none of them
is explained by referring to @-. No cycles there.

hp

Re: How to find all the strings in a long that are at most of n different characters of a test string?

C:\Documents and Settings\asu1> perl -v

This is perl, v5.10.0 built for MSWin32-x86-multi-thread
(with 5 registered patches, see perl -V for more detail)

Binary build 1004 [287188] provided by ActiveState
http://www.ActiveState.com
Built Sep  3 2008 13:16:37

Sinan

--
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc /

Re: How to find all the strings in a long that are at most of n different characters of a test string?

On Thu, 20 Nov 2008 01:55:07 GMT, sln@netherlands.com wrote:

Follow-up:

Below is a modified version that chops the time in half.

The fastest solution conforming to the original problem statement
is from John Krahn. Because you can't fight math when posing
1-dimensional problems like this.

However, I'm posting this modified version incase the problem set
expands to multi-character heuristics.

The modifications include staying away from the @- array altogether.
In my opinion, it is just a eval{} in disguise anyway, don't know
don't care. Position is set every time. The pos() function is free,
position array, on the otherhand, incurrs 5-10 times the overhead.
Its to be avoided at all costs on performance related regexp issues.

Staying away from the @- array but keeping the algorithym means doing
something thats not supposed to be done, but it works:

\$i = 1;  # is not a reference
\$\$i      # \$1

I didn't see any side affects except you have to manage the pos() each time.
See: FAQ 7.29: How can I use a variable as a variable name?

That being said Expanding notation to multi-character substrings
and to add some intelligence (possibly) to dynamically generated regexp's,
might be a help in the future.

For instance, you could expand the definition of matching items like so:

@terms = (
#  string - required - generated rxterm
#  ----------------------------------------
'Merry ' , 1,      #  (?:Merry )
'C',     , 0,      #  (?:C|(.))
'h',     , 0,      #  (?:h|(.))
'r',     , 0,      #  (?:r|(.))
'i',     , 0,      #  (?:i|(.))
's',     , 0,      #  (?:s|(.))
't',     , 0,      #  (?:t|(.))
'mas',   , 0,      #  (?:mas|(.))
);

But then your getting into heuristics.
Good luck!

sln

--------------------------------

use strict;
use warnings;

use Benchmark ':hireswallclock';

my (\$t0,\$t1,\$tdif);

my \$pattern = 'aacdxf';
my \$misses  = 2;

my @tmp = split '',\$pattern;
my \$pstr;
for (@tmp) {
\$pstr .= "(?:\$_|(.))";
}
\$pstr = '('.\$pstr.')';

my \$rxpattern = qr/\$pstr/i;
my \$extent = @tmp+1;
my (\$i,\$cnt,\$lastpos) = (0,0,0);

print @tmp,"\n",\$rxpattern,"\n\n";

my \$fname = 'c:\tempMEG_FILE.txt';
open my \$fh, \$fname or die "can't open \$fname...";

\$t0 = new Benchmark;

while (<\$fh>)
{
chomp;
\$lastpos = 0;

while ( /\$rxpattern/g )
{
\$cnt = 0;
for \$i (2..\$extent) {
last if ((\$cnt += defined( \$\$i)) > \$misses);
}
if (\$cnt <= \$misses) {
print "\$cnt bad chars, but found a close match: '\$1'\n";
}
pos() = ++\$lastpos;
}
}
\$t1 = new Benchmark;

close \$fh;

\$tdif = timediff(\$t1, \$t0);
print "the code took:",timestr(\$tdif),"\n";

__END__

Re: How to find all the strings in a long that are at most of n different characters of a test string?

wrote:

Not too hard the simple way.
-------------------------------------------------------------------
#!/usr/bin/perl
use strict; use warnings;

my \$a =
"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";
my \$b = "aacdxf";

my @a = split //,\$a;
my @b = split //,\$b;
my \$limit = 2;
my \$cnt;
my \$lenb = length \$b;
my @substrings = ();

print "\nmatching '\$a' against '\$b'\n";

OUTER:
for (my \$i = 0; \$i <= \$#a-\$lenb+1; \$i++) {
\$cnt = 0;
for (my \$j = 0; \$j <= \$#b; \$j++) {
\$cnt++ unless \$a[\$i+\$j] eq \$b[\$j];
next OUTER if \$cnt > \$limit;
}
my \$sub = substr(\$a,\$i,\$lenb);
#    push @substrings, \$sub;            # alternate output
print "match '\$sub' at offset \$i\n";
}

__END__

matching
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx'
against 'aacdxf'
match 'abcdef' at offset 0
match 'bacdef' at offset 6
match 'aacdxf' at offset 21
match 'aacdef' at offset 28
match 'bacdef' at offset 38
match 'aacdxf' at offset 53
match 'aacdef' at offset 60
match 'aacdxf' at offset 66
match 'aacdfx' at offset 72

---------------------------------------------------------------
--
"Usenet is like a herd of performing elephants with diarrhea:
massive, difficult to redirect, awe-inspiring, entertaining, and a
source of mind-boggling amounts of excrement when you least
expect it." (Gene Spafford, 1992).

Re: How to find all the strings in a long that are at most of n different characters of a test string?

If \$a could be very long, say, of the order of MB, will this approach
be much slower than the code given by sln?

Thanks,
Peng

Re: How to find all the strings in a long that are at most of n different characters of a test string?

Pilcrow wrote:

"abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx";

Or more simpler as:

#!/usr/bin/perl
use strict;
use warnings;

my \$a =
'abcdefbacdefabbdefaaaaacdxfaaacdefcdefbacdefabbdefaaaaacdxfaaacdefaacdxfaacdfx';
my \$b = 'aacdxf';

my \$limit = 2;
my \$lenb = length \$b;

print "\nmatching '\$a' against '\$b'\n";

while ( \$a =~ /(?=(.))/sog ) {
next if ( \$1 ^ \$b ) =~ tr///c > \$limit;
print "match '\$1' at offset \$-[0]\n";
}

__END__

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order.                            -- Larry Wall