problem w/ duplicates

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
Let's say I have a table of users, and each user has a list of
categories.  I could store each user's categories as TEXT with
delimeters like "cat1|cat2|cat3"

But then I need to be able to get a full list of everyone's categories,
without duplicates.  Retrieving all the categories, exploding them, and
then removing the duplicates is a bit slow.  Is there a better method?

Re: problem w/ duplicates

Quoted text here. Click to load it

Your solution violates 1st normal form and leads to many problems. Create a
table with 2 columns: userid, categoryid, and make the these columns the


Re: problem w/ duplicates

Rich Ryan wrote:
Quoted text here. Click to load it

make what the key? the "these" columns?? i don't understand what you

and if i did it that method, wouldn't there be a lot of excessive data?

userid | categories
0 | life
0 | work
0 | web
1 | life
1 | work
1 | starcraft
2 | work
2 | starcraft
2 | programming

something like that..? i mean..i guess it works, but it seems like
wasted space. thought there might be a way to group everyone who has
the same category.

Re: problem w/ duplicates

| userID | user     | (userID is Primary)
| 01       | john     |
| 02       | paul     |
| 03       | george |
| 04       | ringo    |

| categoryID | category         | (categoryID is Primary)
| 01             | lead vocals     |
| 02             | lead guitar      |
| 03             | keyboard        |
| 04             | harmonica      |
| 05             | backing vocals|
| 06             | drums            |
| 07             | rhythm guitar  |
| 08             | bass guitar     |

| userID | categoryID | (the primary is made from both together!)
| 01       | 01             |
| 01       | 03             |
| 01       | 04             |
| 01       | 05             |
| 01       | 07             |
| 02       | 01             |
| 02       | 05             |
| 02       | 07             |
| 02       | 08             |
| 03       | 02             |
| 03       | 05             |
| etc      | etc            |

Mark wrote:
Quoted text here. Click to load it

Re: problem w/ duplicates wrote:
Quoted text here. Click to load it

hm. thank you for clearing that up. is having 3 tables faster/more
efficient/save more space than having two tables? i guess there are
less strings stored, but there are twice as many rows necessary..

Re: problem w/ duplicates

Mark wrote:
Quoted text here. Click to load it

Well, I'm definitely not qualified to comment on efficiency but I'm
sure there must be lots out there on the performance comparisons of
flat tables vs normalized dbs.

In this simple example, there's probably not a lot in it. By putting
the categories in a separate table, I'm reducing the risk of errors in
user input - or at least making those errors more consistent! If the
categories also had descriptions, for instance, then the performance
benefits would become more apparent.

But the db I've suggested is a poor example for demonstrating the real
benefits of (at least some degree of) normalization. I'd write a better
one - but there's SO many well-written tutorials already out there on
db construction and normalization that it hardly seems worth it :-)

Re: problem w/ duplicates

Indeed, in the example provided the three table is faster. Overall though,
the performance of your tables depend on the number of rows. If your number
of rows in any of the tables is not going to surpass maybe 400,000 or
500,000 rows, the denormalized solution is the way to go; however, there is
a point of diminishing return that you need to be aware of by examining the
logs and the performance of your database.

Hope this helps.

On 7/11/06 02:33, in article, "Mark"

Quoted text here. Click to load it

Re: problem w/ duplicates

Mark wrote:
Quoted text here. Click to load it


As Rich first pointed out Normalization is important for data quality
and performance. The number of tables and rows alone does not determine
performance. Proper primary and foreign keys between related tables
along with indexes on columns used for conditions in WHERE clauses has
a greater effect on performance, since that controls how the query is
parsed and how the data engine determines how to fetch the data.
Whereas, hacking delimiters for nested values will tend to slow things

Imagine that you went to the library and the books were simply stacked
on the shelves in whatever order could cram in the most books on the
least shelves using the least staff. That'd be easy for the library to
file the books, but a hassle for patrons to find the books...


Re: problem w/ duplicates

Skarjune wrote:
Quoted text here. Click to load it

Thanks for explaining all this to me guys :) I knew there was a proper
or better method to approach this problem, but I guess I'm still sort
of new to databases, and wasn't sure what it was.  Perhaps I'll google
some stuff on normalization and find out more.  Anyways, this should
solve my problem.  Thanks a ton!

Site Timeline