Do you have a question? Post it now! No Registration Necessary. Now with pictures!
- Robert Inder
April 6, 2005, 3:27 pm
I'm interested in comparing the "coverage" of regular expressions.
In particular, if I have two regular expressions, I want to know
whether it is possible to construct a string that will match
both, or indeed that will match one and not the other.
So, for instance, suppose I believe that if an object has a
descripiton string matching regexp R1, it is of type A.
Someone now tells me that an object with a
descripiton string matching regexp R2 is of type B. What can I tell
about (the sets of things of type) A and B by looking at R1 and R2?
So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.
Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.
But how do I formalise (i.e. program) this comparison process?
This strikes me as a very general problem/issue and there
just has to be some work on formalising and solving it. But I don't
know how to find it: Google searches for things like "regular
expression comparison" are swamped by algorithms for matching regexps
against strings and so forth, which isn't what I want.
Can anyone point me in the right direction?
Robert.
--
__ To avoid the spam trap, mail me
|_) _ |_ _ ._ |- | _ _| _ ._ at bcs.org.uk, not deadspam.com.
| \(_)|_)(-'| |_ || |(_|(-'| '
Best viewed in Ebriated.
Re: Comparing Regular Expressions
>
> I'm interested in comparing the "coverage" of regular expressions.
>
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match
> both, or indeed that will match one and not the other.
[snip]
That will be utterly non-trivial.
Consider the partial problem of finding a string that a given regex
matches. Since there are patterns that don't match anything (/$.^/),
there can't be a general solution. At the very least you'll need a
complete regex parser before you can tackle this problem.
Anno
Re: Comparing Regular Expressions
Robert Inder wrote:
> I'm interested in comparing the "coverage" of regular expressions.
>
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match
> both, or indeed that will match one and not the other.
My instinct says that will be "hard". Not just "hard" in the common
sense, but hard in the CompSci sense of being equivalent to the halting
problem (i.e. insoluble in the general case in finite time).
Note I'm reading this in clp.misc. I suspect the nettizens of
comp.theory will be much more clued-up on this.
Re: Comparing Regular Expressions
My recollection is that this is not the case. Certainly equivalence
of recursive languages is undecidable, but regular languages are
highly restricted special cases.
I may be remembering wrong, but I think regex equivalence is
NP-complete. If so, this would mean that there would be an
exponential-time algorithm for solving it. A quick google search
finds various mentions of various versions of the problem being in
PSPACE or being EXP-complete, which means that it is not undecidable.
So the problem is hard, but not impossible.
Re: Comparing Regular Expressions
> Subject: Re: Comparing Regular Expressions
> Date: Wed, 13 Apr 2005 23:22:52 GMT
> Mark Jason Dominus wrote:
>>> My instinct says that will be "hard". Not just "hard" in the common
>>> sense, but hard in the CompSci sense of being equivalent to the
>>> halting problem (i.e. insoluble in the general case in finite time).
>>
>> My recollection is that this is not the case. Certainly equivalence
>> of recursive languages is undecidable, but regular languages are
>> highly restricted special cases.
> Except that Perl's REs (as most other REs that are in common use
> today) are not"regular" in the Computer Science classification
> any longer but have been extended.
Right now, I'm just interested in the what can and can't be done.
So if some restrictions on the regexps would let me get a handle on
comparing them, I'd be intrigued.
Even some pointers to work that "merely" dealt with regular RE's
would be a great help.
> jue
Robert.
--
__ To avoid the spam trap, mail me
|_) _ |_ _ ._ |- | _ _| _ ._ at bcs.org.uk, not deadspam.com.
| \(_)|_)(-'| |_ || |(_|(-'| '
Best viewed in Ebriated.
Re: Comparing Regular Expressions
I think both Torben Ægidius Mogensen
(7zhdikt6sn.fsf@app-2.diku.dk) and me
(pxbu0mk9l0w.fsf@news.bourguet.org) have outlined the
algorithms you want. It would help to give you better
responses if you state what you think is missing.
The algorithm for going from a RE to a FSM is probably in
every introduction book on compiler.
Yours,
--
Jean-Marc
Re: Comparing Regular Expressions
flex gives a warning when a rule can never be matched. So
you have an easy way to check if two regular expressions are
identical: input two files consisting simply of the two
expressions in both order. If flex gives that warning for
both files, they are identicals, if flex gives a warning
only for one file, the second in that file regexp is a
subset of the first, if flex never give a warning, both
expressions are able to match strings that the other can
not.
Yours,
--
Jean-Marc
Re: Comparing Regular Expressions
I think this has been covered before:
1) Make minimal DFA's for both RE's.
2) Compare the DFA's for equality (equivalent graphs).
Alternatively,
1) Make NFA's for both RE's.
2) Check bisimulation of the two NFA's.
Both have exponential worst-case time.
The above is compairing for equality. If you want to comapre for
subset, you just replace bisimulation with simulation in the second
method or you construct the difference automaton between the two DFA's
and check for emptyness.
Torben
Re: Comparing Regular Expressions
> I'm interested in comparing the "coverage" of regular expressions.
For which definition or regular expressions? I see this is crossposted
to comp.lang.perl.misc and comp.theory and the default meaning of
regular expression is quite different in the two groups.
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match both,
> or indeed that will match one and not the other.
For the comp.theory adapted definition, it is possible to construct
FSM accepting the language described by a regular expressions.
Starting from the FSMs and using a variant of the algorithm used to
minimize them it is easy to construct an FSM which accept the union of
the two languages. As there is a know relation between the states in
the new FSM and the states in the old, it is easy to classify the
accepting states of the new FSM as having correspondance in only one
or both of the original FSM.
Yours,
--
Jean-Marc
Re: Comparing Regular Expressions
> I'm interested in comparing the "coverage" of regular expressions.
>
> In particular, if I have two regular expressions, I want to know
> whether it is possible to construct a string that will match
> both, or indeed that will match one and not the other.
Regular languages are closed under intersection and set difference, so
you can not only find strings (if any exist) in the intersection and
difference sets, you can construct regular expressions or DFA's that
describe all such strings.
> So, for instance, suppose I believe that if an object has a
> descripiton string matching regexp R1, it is of type A.
> Someone now tells me that an object with a
> descripiton string matching regexp R2 is of type B. What can I tell
> about (the sets of things of type) A and B by looking at R1 and R2?
>
> So if R1= "^a.*$" and R2 = "^b.*$", I know that A and B are disjoint.
>
> Whereas if R1= "foo" and R2 = "foobar", I know that B is a subset of A.
Say what? The set containing only the string "foobar" is not a subset
of the set containing only "foo". Or do you consider a string to
match a regular expression if a prefix of the string does?
> But how do I formalise (i.e. program) this comparison process?
>
> This strikes me as a very general problem/issue and there
> just has to be some work on formalising and solving it. But I don't
> know how to find it: Google searches for things like "regular
> expression comparison" are swamped by algorithms for matching regexps
> against strings and so forth, which isn't what I want.
If you have a DFA D1 with states s_0 ... s_m for A and a DFA D2 with
states t_0 ... t_n for B, you can construct a DFA D3 for the
intersection of A and B in the following way:
1. The start state of D3 is the pair (s_0,t_0) of starting states of
D1 and D2.
2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
transition on symbol c from t_k to t_l, D3 has a transition from
(s_i,t_k) to (s_j,t_l) on c.
3. If s_i is accepting in D1 and t_k is accepting in D2, (s_i,t_k) is
accepting in D3.
For A\B (A minus B), the construction is:
0. Add a non-accepting state t_(n+1) to D2. If there is a state t_k
in D2 and a symbol c such that t_k has no transition on c, add a
transition from t_k to t_(n+1) on c. There is a transition from
t_(n+1) to t_(n+1) on all symbols. This way, D2 has transitions
on all symbols from all states.
1. The start state of D3 is the pair (s_0,t_0) of starting states of
D1 and D2.
2. If D1 has a transition on symbol c from s_i to s_j and D2 has a
transition on symbol c from t_k to t_l, D3 has a transition from
(s_i,t_k) to (s_j,t_l) on c.
3. If s_i is accepting in D1 and t_k is not accepting in D2,
(s_i,t_k) is accepting in D3.
Torben
Site Timeline
- » FAQ 1.3 Which version of Perl should I use?
- — Next thread in » PERL Discussions
- » FAQ 6.7 How can I match a locale-smart version of "/[a-zA-Z]/"?
- — Previous thread in » PERL Discussions
- » s suffix question
- — Newest thread in » PERL Discussions
- » Anyone Using ESET NOD32??
- — The site's Newest Thread. Posted in » Anti-Virus Software