Hi!
>>>>> "MARK" == MARK CALLAGHAN <mdcallag@stripped> writes:
<cut>
>> Based on what I just read from
>> http://dev.mysql.com/sources/doxygen/mysql-5.1/libmysql__r_2hash_8c-source.html
>>
>> it looks like we use a dynamic array for the buckets and a linked list for
>> the data elements within the bucket. Did I read that correctly?
MARK> See earlier in the thread for the reference to coalesced hashing. The
MARK> links are between buckets. When two items hash to the same bucket, the
MARK> duplicates must find an empty bucket and the buckets are 'linked'.
MARK> The implementation misses an optimization. The original hash value for
MARK> entries is not saved, so when the chain of entries is followed, key
MARK> comparisons are always done rather than only done when the (unmasked)
MARK> hash values match.
This is just done to save space; We need to do one hash calculation once
for the first lookup to check if we had a match or not. For other
entries in the linked list we don't have to do this.
MARK> Putting N entries into N buckets makes it very likely that there will
MARK> be conflicts. Using a 'sparse' array makes that less likely at the
MARK> cost of a larger array. For a small amount of extra memory, conflicts
MARK> can be reduced and as the dynamic array already has allocated that
MARK> extra memory in most cases this is worth exploring.
Yes, there is some conflicts but there are other advantages:
- We can grow dynamicly with O(1)
- We don't need a separate space for conflicts
- We can do any number of inserts+updates+deletes without degrading
the hash.
- At any point, we know exact how much memory we need.
MARK> There is a lot of great analysis of hash functions on the web now.
MARK> Search for burtlebob or visit wikipedia. I am not a whiz on this, but
MARK> there are hash functions designed to work with a mask that is a power
MARK> of 2 rather than use mod prime to compute the bucket number.
I wrote the current hash functions 15 years ago so it's not perfect.
However, I did then spend days on it and the current hash functions
was the best one that I found that works for all the following data:
- CHAR fields
- 4 byte Integers with increasing value
- Random integers
- Floats
Regards,
Monty