SELECT duplicates without looping

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Hi All,

I have a fairly large table with approx 30K rows that updates every
night via a cron script that automatically downloads the 2 new csv's.

The problem is the files are downloaded from two completely different
systems and I've recently found duplicates!

To make matters worse, the two feeds do not share a common unique key.

The following snippet is an example of the data where row 1 is from
feed 1 and row 2 is from feed 2

12345|2069|Happy St.|NULL|A-2|

So my question is: How can I identify duplicate records using multiple
columns without looping through the entire recordset in PHP?

I have successfully found other duplicates by COUNT(key) ...GROUP BY
key HAVING COUNT(key) >= 2.

I tried using CONCAT to combine the columns into one string but didn't
work because CONCAT fails if any column is NULL.

Any help would be greatly appreciated.


Re: SELECT duplicates without looping wrote:
Quoted text here. Click to load it

You can do a self-join:

SELECT t1.key, t2.key
FROM tbl AS t1 INNER JOIN tbl AS t2 ON
  CONCAT_WS(' ', t1.streetnum, t1.streetname, t1.streetsuffix, t1.unitnum) =
  CONCAT_WS(' ', t2.streetnum, t2.streetname, t2.streetsuffix, t2.unitnum)

That catches the specific example you give above.  Unfortunately,
finding duplicates in such free-form data is one of those limitless
problems.  There are so many ways that the data can be "the same" that
it's hard to automate.

For instance, what if `streetname` contains the string such as "Happy
St., suite A-2"? Is that equal to "Happy St #A2"?  How can you make an
expression to compare these?

Database analysts are often assigned with multi-week projects to de-dupe
vast sets of data.  It's unlikely that you'll ever catch 100% of the
duplicates through a single query, but you can catch some reasonable set
of them.  Then it becomes a question of how much work is it worth to
catch the last few cases?

Sometimes the most successful method is to sort the dataset by
streetname (or other likely key) and then use your eyeballs to spot
duplicates.  A very tedious task, but much more likely to spot all the
uncommon cases of duplicates.

Quoted text here. Click to load it

Read the description of CONCAT_WS() here:

That function skips NULL arguments, instead of returning NULL for the
whole expression.

Bill K.

Re: SELECT duplicates without looping

Wow, that's awesome!

I appreciate your help.

While I've got your attention, what are some good resources for
advanced query writing?

I feel like I have grasped the basics of MySQL and am looking to go to
the next level.

Again, I thank you for you help.


Re: SELECT duplicates without looping wrote:
Quoted text here. Click to load it

I learned some from Joe Celko's books "SQL for Smarties" and "SQL
Programming Style".  He also has a book "SQL Puzzles & Answers" but I
haven't read that one.

And also reading the questions and answers on comp.databases and various
MySQL newsgroups is very worthwhile.

There's no substitute for experience.  Try exercises of complex queries.

But don't feel like you have to write the most complex query when you're
actually committing it in code for a project!  The simplest code that
gets the job done is often the best, to help improve the code's
readability, extensibility, and maintenance.

Also, keep in mind that not every problem is best solved with a single
huge query that does everything.  SQL was designed to be combined with
application code, and often that's the best way to massage query results
into an application data structure.  Also, don't assume that a single
query always works faster than a multi-query solution.

I think these points are just as important for advanced usage of SQL as
learning every complex syntax you can use in a query.  You're advanced
if you can recognize when it's appropriate and necessary to use a
complex query, and when you can do the job with a simpler one (or more
than one).

Bill K.

Site Timeline