# Best way to search for a string which has N% in a character class?

#### Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

•  Subject
• Author
• Posted on
Hi,

Suppose that I want to search for a substring which has say 50%
letters are in a letter class say [A-D]. Note that there is some
ambiguity at the two ends of the substring. But other than that, this
problem is well defined.

It seems that this problem can not (or can not easily, please let me
know if there is a way) be formulated in regex. Since perl is strong
in processing string, I think that there might be a good way to search
for such strings in perl. Does anybody have some good way in search
this type of substring?

Regards,
Peng

## Re: Best way to search for a string which has N% in a character class?

On 3/2/2012 17:29, Peng Yu wrote:

/[A-D][^A-D]|[^A-D][A-D]/

Kiuhnm

## Re: Best way to search for a string which has N% in a character class?

On 3/2/2012 18:25, Kiuhnm wrote:

A more interesting problem is find the longest substring with that
property (mine was the shortest).
Let's simplify the problem and consider binary strings:
X =     0  1  1  1  1  0  0  0  1  0  0  1  1  0  0  1
Let
Y = (0) 1  0 -1 -2 -3 -2 -1  0 -1  0  1  0 -1  0  1  0
be the array where the i-th element is
(the number of 0s seen so far) -
(the number of 1s seen so far)
If Y[m] = Y[n] then from m to n we must have run through a sequence with
an equal number of 0s and 1s.
Finding the longest substring boils down to finding m and n such that
Y[m] = Y[n]
and m-n is maximized.
In our example, the longest substring is the entire string (indeed we
have (0).....0 in Y).

If you don't want 50/50 but 33.(3)/66.(6), use
2 * (the number of 0s seen so far) -
(the number of 1s seen so far)
or vice versa:
X =     0  1  1  1  1  0  0  0  1  0  0  1  1  0  0  1
Y = (0) 2  1  0 -1 -2  0  2  4  3  5  7  6  5  7  9  8
The best we can do is from 2 (excluded) to 2:
1  1  1  1  0  0

--->
#!/usr/bin/perl

use 5.010;

say (my \$str = 'aaaaaaaaabcdefabcdefabcdefaaaaaaaa');
my @v = split //, \$str;
my @y = (0, map { state \$v;
my \$cur = \$v += (/[a-c]/ ? -1 : 1)
} @v
);
my %m;
my \$maxIdx;
for (@y) {
state \$cnt = 0;
\$m->[0] //= \$cnt;
\$m->[1] = \$cnt - \$m->[0];
\$maxIdx = \$_ if \$m->[1] > \$m->[1];
\$cnt++;
}
my \$from = \$m->[0];
my \$to = \$from + \$m->[1] - 1;
say "The longest substring goes from \$from to \$to:\n  ", @v[\$from..\$to];
<---

Kiuhnm

## Re: Best way to search for a string which has N% in a character class?

On 03/02/12 10:29, Peng Yu wrote:

What have you tried?????????????????

Using 'tr' and 'length' would probably help you.

From perldoc perlop:

y/SEARCHLIST/REPLACEMENTLIST/cds
[...]Transliterates all occurrences of the characters found in the
search list with the corresponding character in the replacement list.
It returns the number of characters replaced or deleted.

Using that you can get the number of characters in the class.
e.g. \$cnt = tr/[A-D]/[A-D]/;

Using 'length' you can find how many characters are in the string.

perldoc -f length

Divide one by the other, multiply by 100 and you have the percent.

## Re: Best way to search for a string which has N% in a character class?

"man perlop" continues

Note that "tr" does not do regular expression character classes
such as "\d" or "[:lower:]".  The <tr> operator is not equivalent
to the tr(1) utility.  If you want to map strings between
lower/upper cases, see "lc" in perlfunc and "uc" in perlfunc, and
in general consider using the "s" operator if you need regular
expressions.

The expression
tr/[A-D]/[A-D]/;
will translate [ to [ and ] to ], so they will be included in the
count.  A-D works because that's a special case in tr.  Also,

If the "/d" modifier is used, the REPLACEMENTLIST is always
interpreted exactly as specified.  Otherwise, if the
REPLACEMENTLIST is shorter than the SEARCHLIST, the final
character is replicated till it is long enough.  If the
REPLACEMENTLIST is empty, the SEARCHLIST is replicated.  This
latter is useful for counting characters in a class or for
squashing character sequences in a class.

So if you really want a range of characters like A thru D,
tr/A-D//
works.  If you want all digits, or all alphabetics, or some other
character class, you need to use s/// instead.

--
Tim McDaniel, tmcd@panix.com

## Re: Best way to search for a string which has N% in a character class?

On 03/02/12 13:06, Tim McDaniel wrote:

[...]

Thanks for the correction.

## Re: Best way to search for a string which has N% in a character class?

wrote:

I don't think that you understand my question.

Suppose that I have a string \$str which the concatenation of \$str1,
\$str2 and \$str3, where both \$str1 and \$str3 have less than 50% of [A-
D] and \$str2 have more than 50% of [A-D].

I need to discovered from \$str where \$str2 starts and ends. I don't
see how tr and length alone can address this question.

## Re: Best way to search for a string which has N% in a character class?

On 3/2/2012 21:53, Peng Yu wrote:

That's not clear at all.

Kiuhnm

## Re: Best way to search for a string which has N% in a character class?

[snip]

[snip]

%50 of what? Without boundry conditions, the type of regex solution
your thinking of is impossible.

The way you state your problem is that [A-D] can exist randomly
in sequence or between [^A-D] characters.

The the only thing you state as known is the total length of random
length strings after cattenation and before the %50 over/under content
of each.

You can slide a regex frame over the final string but ther is not enough
information about boundry conditions to get real information.
There is just more unknowns than there are equations.

For instance,
-  if the length of each substring were the same it could be
solved, but this way would not need a regex.
-  if the [A-D] were adjacent, still the start/end could not be
determined, only the knowledge that this match of  > %50 is in
the substring that needs to be found, but still no begin/end information

I think it was a nice try though, futile, but nice.

-sln

## Re: Best way to search for a string which has N% in a character class?

ISTM you are right the regex engine is a good match for this sort of
thing: it's already got the run-through-the-string and the backtracking
logic you need.

our \$d;
m{ (
(?>
(?{ \$d = 0 })
(?:
[A-D] (?{ \$d++ })
|   .     (?{ \$d-- })
)*
)
(?(?{ \$d }) (?!))
) }sx;

(It's necessary for \$d to be 'our' due to a longstanding bug in the
implementation of (?{}).)

I think that works correctly?

Ben