#### Do you have a question? Post it now! No Registration Necessary. Now with pictures!

## Re: hash

Hashing itself is not really an algorithm; it's a data structure.

Some particular thing you do with hashing may well be slower than binary

search, but without knowing what what the OP is doing it's difficult to

speculate.

[...]

--

Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst

Will write code for food.

"We must do something. This is something. Therefore, we must do this."

-- Antony Jay and Jonathan Lynn, "Yes Minister"

## Re: hash

True

hash is a data structure and there is a an internal logic of how to retrieve

values. If there is a big number of keys then hash may have conflict keys;

these conflicts are stored as lists where its retrieval is slow (n).

Retrieving btree values is much faster O(log n). There is no particular

problem, I have a conversation with my colleague discussing this and maybe

there is a need for a module to do it easy.

## Re: hash

That has nothing to do with the number keys but only with how the keys

are distributed across the buckets. Degenerated hashes where all the

keys point into the same bucket can happen with any number of keys. But

the probability is very small, therefore on average hashes are

significantly faster than trees.

If they are stored as simple lists, then yes. But there are other ways

how buckets can be organized. If Perl uses a simple list or a more

complex structure I do not know.

If Perl uses simple lists then I presume that the problem of hash key

conflicts is not very relevant.

Unfortunately for trees O(log n) it is for all cases, including best

case and typical case, while hashes are O(1) for both, best as well as

typical case.

jue

## Re: hash

That's the worst case.

The average case depends on the load factor.

Perl uses simple lists - see PerlGuts Illustrated[1]. This is not a

problem. Since perl keeps the load factor (the ratio between number of

elements and slots) below 1 and it can change the seed of the hash

function if a hash degenerates, the lists are always very short -

certainly shorter than the depth of a binary tree with the same number

of elements.

hp

[1] http://cpansearch.perl.org/src/RURBAN/illguts-0.42/index-14.html

--

_ | Peter J. Holzer | Deprecating human carelessness and

|_|_) | Sysadmin WSR | ignorance has no successful track record.

| | | hjp@hjp.at |

__/ | http://www.hjp.at/ | -- Bill Code on asrg@irtf.org

## Re: hash

[...]

This doesn't exactly make sense: Let's assume I'm storing 1024

key-value pairs in a hash table and what I end up with are 128 lists of

length 8. Ignoring the overhead for calculating the hash values, this

means that I can locate each of these 1024 pairs with doing at most 8

key comparisons. If the same 1024 key-value pairs where organized as

some kind of balanced binary search tree, that would be log2(1024) =

10 key comparisons. Since '128 pointers' isn't exactly a lot of data

nowadays, it should be possible to double the size of the table and

thus end up with 256 list of length 4 and so on.

NB: This is a theoretical consideration and 'in practice' things

aren't that simple. But it should be sufficient to show that 'searcing

on a linked list is slow while ...' doesn't hold water.

## Re: hash

On Sun, 13 May 2012 19:11:39 +0100, Rainer Weikusat wrote:

If you end up with so many lists of that length, something is seriously

wrong with your hash table. It's either to small, or you have a very bad

hashing function.

George was talking about a btree, not a binary tree. Those are fairly

different beasts.

HTH,

M4

If you end up with so many lists of that length, something is seriously

wrong with your hash table. It's either to small, or you have a very bad

hashing function.

George was talking about a btree, not a binary tree. Those are fairly

different beasts.

HTH,

M4

## Re: hash

The perl hash implementation always keeps the load factor <= 1, so a

hash table with 1024 elements has at least 1024 slots. Of course it's

possible that 896 of these slots are empty and 128 slots have 8 elements

each, but as you say that would mean a very bad hash function or a

deliberate attack. To guard against the latter possibility, perl >= 5.8

(IIRC) monitors the chain length and changes the seed of the hash

function if a chain grows too long.

But I think Rainer was trying to come up with a worst-case scenario

where a hash is still faster (requires fewer comparisons) than a binary

tree, as the next paragraph shows:

Yes, I noted that too in an earlier posting, but btrees are normally

used for on-disk data structures and they don't implement a binary

search, so it seems likely that George really meant binary trees, not

btrees.

hp

--

_ | Peter J. Holzer | Deprecating human carelessness and

|_|_) | Sysadmin WSR | ignorance has no successful track record.

| | | hjp@hjp.at |

__/ | http://www.hjp.at/ | -- Bill Code on asrg@irtf.org

## Re: hash

On Mon, 14 May 2012 10:00:55 +0200, Peter J. Holzer wrote:

Btrees are used all over the place, not just on-disk. OTOH, in memory you

are much more likely to actually use a red-black tree or similar, but the

interface and performance characteristics are so similar to a btree that

they are often called btrees as well.

Possible, even probable.

M4

Btrees are used all over the place, not just on-disk. OTOH, in memory you

are much more likely to actually use a red-black tree or similar, but the

interface and performance characteristics are so similar to a btree that

they are often called btrees as well.

Possible, even probable.

M4

## Re: hash

Since a B-tree is a more general tree structure than a binary search

tree, every binary search tree is also a B-tree, just not vice

versa. While that's not consistent with a statement you made in other

subthread, it is surely 'not incorrect'. But this "data structures I

heard of!"-bingo is IMO pretty pointless.

## Re: hash

And I 'feel' that you neither wrote anything even remotely related to

the original lookup question nor something which could be considered

'a point' at all, just a bunch of rambling statements about various

data structures. This may be wrong but if you can't be bothered with

expressing yourself clearly in face of a misunderstanding, whatever

you meant to express doesn't matter.

## Re: hash

This was an example supposed to illustrate that 'searching in linked

list is slow while ...' doesn't make sense.

Apart from that: If I have a fixed size table (128 slots in this

example) and my key/value pairs are distributed evenly onto these 128

slots (yielding 128 lists of length 8), my hash function quite

obviously did the best job it is theoretically capable of doing.

And I was writing about a data structure with the performance

characteristics he mentioned. Even if he was really referring to 'a

B-tree' and not just using btree as abbreviations for 'binary tree'

(which I very strongly suspect), that doesn't matter for this example.

## Re: hash

Big-O notation only tells you how an algorithm scales with increasing n.

It doesn't tell you how it performs for any particular n (especially not

for a small n).

<nitpick>

The data structure is called a "hash table". There is also the "hash

function". Both are frequently abbreviated as "hash". "Hashing" may

refer to computing a hash function or to filling a hash table. In

both cases it is an algorithm.

</nitpick>

A degenerated hash is not necessary.

A successful lookup of an element in a hash consists of the following

steps:

1) compute the hash of the lookup key (cost: c1 * length(lookup_key))

2) compute the location of element in the hash (cost: negligible)

3) compare the lookup key to the element key

(cost c2 * length(lookup_key) if successfull, almost zero if

unsuccessfull, because two strings with the same hash collision are

unlikely to have a common prefix).

4) If unsuccessfull, compute the location of the next element (this may

be a simple pointer lookup or involve the computation of a new hash)

and continue at 3.

If the hash is properly constructed, step 4 is very rare, so the access

time is

c1

*** length(lookup**

___key) + c2 *____length(lookup___key)

For a binary tree, you have to descend d levels, and do a string

comparison at every level, so the cost is

c2 * length(common

___prefix(lookup___key, node1_key)) +

c2 * length(common

___prefix(lookup___key, node2_key)) +

...

c2 * length(lookup_key)

(the common prefix is likely to get longer as we descend down the tree)

Note that the last term is the same, so we can eliminate it.

It follows that the hash access is faster than the binary tree[1] access

if the computation of the hash is faster than the (d-1) partial string

compares. This depends on the length of lookup_key, the distribution of

the keys in the dictionary, the hash algorithm, the depth of the tree,

how much of your data is already in the CPU cache, etc.

In short, it is impossible to say that a hash is faster than a binary

tree or vice versa. You have to benchmark particular implementations for

particular data.

There are some general observations, though:

The tree gets slower with increasing depth while the hash is quite

insensitive to the number of elements, so the hash has an advantage

for a large number of elements.

For a hash lookup you have to compute a hash of the entire key up front,

while for the tree you only have to compare the common prefix (+1 byte),

so the tree has an advantage if the keys are long (and don't have a

common prefix).

The hash is likely to be more cache friendly because you need to

look only at two contiguous memory locations while in the tree you need

to look at (d+1) memory locations.

Also ACK.

hp

[1] George also wrote "btree", which is not a binary tree. For a btree

it gets even more complicated, but btrees are normally used for disk

data structures, not in memory, although some in-memory data structures

(e.g. Judy arrays) use similar techniques to exploit caching.

--

_ | Peter J. Holzer | Deprecating human carelessness and

|_|_) | Sysadmin WSR | ignorance has no successful track record.

| | | hjp@hjp.at |

__/ | http://www.hjp.at/ | -- Bill Code on asrg@irtf.org

#### Site Timeline

- » Windows control in PERL
- — Next thread in » PERL Discussions

- » Taint mode help
- — Previous thread in » PERL Discussions

- » s suffix question
- — Newest thread in » PERL Discussions

- » Adblock Testscript problem
- — The site's Newest Thread. Posted in » HTML Markup Language