Regular expression help needed

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View

I am writing in PHP and trying to work with regular expressions
on records in a multilanguage database. I understand regexp basics,
but have bitten off more than I can chew here and really need help.

The problem is to do with generating all strings that match a
pattern defined in terms of brackets and slashes.  Here, brackets
mean that something is optional and slashes are used to show

In the examples below, all the matches I need to generate are
shown below the original string.

Here are some simple brackets:
homard ( la) bordelaise
   homard bordelaise
   homard  la bordelaise

And brackets used in combination with slashes:

Words can also be bracketed:
crale(s) (froide(s))
   crale froide
   crales froides
   crale froides  (even though these last
   crales froide   two aren't proper French)

And slashes used for whole word alternatives:
jablecn zvin/trdl
  jablecn zvin
  jablecn trdl
cuerpo, con/de
  cuerpo, con
  cuerpo, de

Maybe the expression would have to decide on a whole word
substitution rather than a single letter - e.g "cuerpo, con/de"
isn't "cuerpo, cde" - if the "slashed" term is more than a single

I think I can see how to generate a set of rules based on the
above, but I have no idea how to implement the logic of them using
preg functions.

And thanks in advance...


Re: Regular expression help needed

I'd like more context about how you will be using these grammars. Can't
say a whole lot without knowing more.

If the human-readable form is not carved in stone, I can offer these

Using square brackets rather than parens for optional content would be
consistent with what people see in spreadsheet textbooks and other
computer training aids

 aardappel(en)   becomes  aardappel[en]

Using the vertical pipe character for alternatives would also be more

  agrio/a   becomes agrio|a

Since the parens is no longer used for optional content, it can be used
for grouping, which improves readability

  agrio|a  can also be written agri(o|a)

and it also makes it easy to disambiguate things

  cuerpo, con/de  becomes  cuerpo, (con)|(de)

If it is possible to do without a human-readable form, then all these
rules could be expressed directly in regex strings, suitable for
plugging into preg_match() and preg_replace:

 aardappel(en)   becomes  '/aardappel(en)?/'
 cuerpo, con/de  becomes  '/cuerpo, (con|de)/'
 douillon(s)/douillen(s) becomes '/douill[eo]ns?/'

PS: I think 'agrio/a/(s)' would also match 'agri' and 'agris' according
to the rules? I think what was wanted was ' agrio/a(s)'.

Re: Regular expression help needed

Will Woodhull wrote:
Quoted text here. Click to load it

Hi Will

Very many thanks for taking the time to reply.  I should have
explained more about the context of things, sorry.  A friend has
created the database and I am implementing its searchable, web-based

The search patterns wouldn't be seen by the users.  I could try to
talk my friend into changing them to regular-expression-friendly
versions, but I am not sure how readable she would find them for her
own use.

Thanks for your help - and good point about agrio/a(s)!

Best wishes and thanks again,


Re: Regular expression help needed

Karin Jensen wrote:

Quoted text here. Click to load it

She wouldn't find regexes very useful in desk work.

The end result has to be a function that for each of your friend's
'search patterns' would convert the many variants to a single token.
Both "aardappel" and "aardappelen" are converted to some arbitrary and
unique value like "". This tokenizer function is applied to
both the search string given by the visitor and to each target as it is
pulled from the database, then the tokenized versions of these are

The tokenizer has to work with preg_replace() regex, but the database
provides the search patterns in a different syntax-- and sometimes
there are ambiguities in that syntax. For this and for a number of
other reasons it makes sense to build the logic into a look-up table,
such as a predefined array

 # look-up from database search string to regex
   $rule['aardappel(en)'] = '/aardappel(en)?/';
   $rule['agrio/a(s)'] = '/agri[oa]s?/';
   $rule['cuerpo, con/de'] = '/cuerpo,\w(con|de)/';

A second predefined look-up table using the same keys can hold the

   $token['aardappel(en)'] = '';
   $token['agrio/a(s)'] = '';
   $token['cuerpo, con/de'] ='';

Then the PHP logic would be something like

   function tokenize($givenstring) {
     $tokenized = $givenstring;
     foreach ($rule as $key => $regex) {
         $tokenized = preg_replace($regex, $token[$key], $tokenized);
     return $tokenized;

This approach would be easy to debug, maintain, and extend. It would
also be possible to move the look-up tables out of PHP and into the
database-- which might be a good or bad thing to do.


Re: Regular expression help needed

Karin Jensen schrieb:
Quoted text here. Click to load it
Hi Karin,
Hi Will

I thing the problem can't be really solved with regular expressions,
perhaps because the creators of human languagage didn't know php ;-)

Just to put in my 2cents:

1.generate a index with the soundex() of the field you want to have
(because the most differences in the words are at the end I suggest to
use only the first few cars of the word.)

soundex has been rewritten to several languages.

2. you will find many matches in your Database.
   now use levenshtein to sort the results.

Surely it won't handle all queries. But for a solution to find different
declinations of words it might be sufficient enough and it ist much
easier than implementing a rule set for a natural language in php.



Re: Regular expression help needed

Hi Jo,

Joachim Weiß wrote:

Quoted text here. Click to load it
Quoted text here. Click to load it

I haven't worked with soundex but I think it is not the solution here.
Soundex would return very different values for some of the expressions
that Karin will be working with. And at other times a soundex approach
will run into a new set of difficulties with homologues-- words that
have the same sound but very different meanings-- that I think would
raise severe difficulties.

Historically the approach to this kind of variant recognition problem
is to reduce all the variants in the target and in each possible match
to the same tokens and perform the test on theses tokenized
representatives. In this particular case, the problem is compounded
because the variants are expressed in a type of human-recognizable
grammar that will need to be translated, somehow, into expressions the
computer understands. These are all pattern recognition problems in
text strings-- which is exactly the kind of problem that regular
expressions were designed to handle.

Site Timeline