Do you have a question? Post it now! No Registration Necessary. Now with pictures!
- Posted on
- Ted Byers
November 18, 2008, 8:15 pm
rate this thread
Imagine I have two hashes, one with a name as the key and an integer
ID as the value. The names are guaranteed to be unique and correct.
The second hash also has names as the key (and the value in this one
doesn't matter), but being manually typed they are not guaranteed to
be correct. They may even have multiple values that correspond to the
a given key in the first hash. This potential of many to one arises
due to the different ways typos (and different abbreviations) can
alter a given string.
A major complication is that, because the data in the second hash
comes from a different feed using a different protocol, it is
guaranteed that there will never be a perfect match between any key in
the first hash and any key in the second hash. The only guarantee,
regarding the data in the second hash, is that there is only one key
in the first that corresponds to the key in the second. One pattern
we see a lot is that in some cases, the name string includes
whitespace between the names provided, while in others there is no
whitespace: so FredEdwardSmith would need to be recognized as the same
as Fred Edward Smith. Another pattern includes an arbitrary number of
digits before the name, after or both. And then there issues with
different spelling conventions (e.g. color vs colour) and regular
typos (e.g. FredEdwardSmyth).
The problem is to create a hash that maps all keys in the second hash
to the ID used as the value in the first hash. This is in a context
where nothing is known until run time: at run time, both sets of data
have been loaded into a DB, and our script retreives the data from
there. This data is dynamic so there is little chance of seeing the
same data twice (but the second data feed changes much more frequently
than the first).
Now, when we actually look at the data ourselves, it is obvious which
correct name applies to the names from the second feed. Our problem
is how to make a script that is as good at seeing correct matches
between the first and second sets of data as the human eye is.
My first thought was to use regular expressions for this, but nothing
I have read so far sheds light on how to use them on imperfect data.
Are regular expressions able to deal with this, or is there a perl
package that is better suited to this problem?
Re: Can regular expressions be used to choose among several imperfect matches?
Canonicalize the data by eliminating all whitespace and leading/following
For this, maybe String::Approx or the other modules discussed in the
perldoc for String::Approx.
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate