> > At a previous job, we tested a 32-bit hash function by running it
> > against hundreds of thousands of unique URL's stored in our
> > database. We found one collision. A 64-bit hash is billions of
> > times better (4 billion, to be exact).
>
> Good to know. I wonder how many collisions I'd find if I ran it over
> every URL listed in the directory www.yahoo.com.
>
> Which 64 bit hash function did you use? Invent your own, or something
> "off the shelf"?
>
We found a public domain one on the net see
http://www.burtleburtle.net/bob/hash/evahash.html for some sample code. It's only a
32-bit hash though. However, that same page appears to have instructions for a 64-bit
hash function as well, but I haven't tried it at all. I'd be curious to know how many
collisions you find hashing all the URL's in yahoo's database :) I don't know how long
that would take, but if you do it I'd like to hear the results.
Since the hash function takes a key and an initial value, you could try running it with
two different initial values and/or keys. This would give you effectively a 128-bit
hash, which you could store across two fields in MySQL. I'm guessing that the 64-bit
hash will probably be good enough though.
Steve Meyers