List:Internals« Previous MessageNext Message »
From:Michael Widenius Date:October 30 2008 2:21am
Subject:Re: preferred size of hash array with hash_init()
View as plain text  
Hi!

>>>>> "MARK" == MARK CALLAGHAN <mdcallag@stripped> writes:

<cut>

>> Based on what I just read from
>> http://dev.mysql.com/sources/doxygen/mysql-5.1/libmysql__r_2hash_8c-source.html
>> 
>> it looks like we use a dynamic array for the buckets and a linked list for
>> the data elements within the bucket. Did I read that correctly?

MARK> See earlier in the thread for the reference to coalesced hashing. The
MARK> links are between buckets. When two items hash to the same bucket, the
MARK> duplicates must find an empty bucket and the buckets are 'linked'.

MARK> The implementation misses an optimization. The original hash value for
MARK> entries is not saved, so when the chain of entries is followed, key
MARK> comparisons are always done rather than only done when the (unmasked)
MARK> hash values match.

This is just done to save space; We need to do one hash calculation once
for the first lookup to check if we had a match or not. For other
entries in the linked list we don't have to do this.

MARK> Putting N entries into N buckets makes it very likely that there will
MARK> be conflicts. Using a 'sparse' array makes that less likely at the
MARK> cost of a larger array. For a small amount of extra memory, conflicts
MARK> can be reduced and as the dynamic array already has allocated that
MARK> extra memory in most cases this is worth exploring.

Yes, there is some conflicts but there are other advantages:

- We can grow dynamicly with O(1)
- We don't need a separate space for conflicts
- We can do any number of inserts+updates+deletes without degrading
  the hash.
- At any point, we know exact how much memory we need.

MARK> There is a lot of great analysis of hash functions on the web now.
MARK> Search for burtlebob or visit wikipedia. I am not a whiz on this, but
MARK> there are hash functions designed to work with a mask that is a power
MARK> of 2 rather than use mod prime to compute the bucket number.

I wrote the current hash functions 15 years ago so it's not perfect.

However, I did then spend days on it and the current hash functions
was the best one that I found that works for all the following data:

- CHAR fields
- 4 byte Integers with increasing value
- Random integers
- Floats

Regards,
Monty
Thread
preferred size of hash array with hash_init()MARK CALLAGHAN28 Oct
  • Re: preferred size of hash array with hash_init()Sergei Golubchik28 Oct
    • Re: preferred size of hash array with hash_init()MARK CALLAGHAN28 Oct
      • Re: preferred size of hash array with hash_init()Michael Widenius30 Oct
    • Re: preferred size of hash array with hash_init()MARK CALLAGHAN29 Oct
      • Re: preferred size of hash array with hash_init()Sergei Golubchik29 Oct
        • Re: preferred size of hash array with hash_init()MARK CALLAGHAN29 Oct
          • Re: preferred size of hash array with hash_init()Shawn Green29 Oct
            • Re: preferred size of hash array with hash_init()MARK CALLAGHAN29 Oct
              • Re: preferred size of hash array with hash_init()Michael Widenius30 Oct
            • Re: preferred size of hash array with hash_init()Michael Widenius30 Oct
Re: preferred size of hash array with hash_init()Sergei Golubchik29 Oct