52.219 Hashing
So far, to find something in a tree, or in a list, we have searched.
However, there is another technique called hashing.
Assume we are
looking for some element e in a set S, where
S may be implemented as
a vector. We apply some function to e, hash(e), and this delivers the
position of e in S, and we can then go directly to
that location to get e
or information on e. For example, e might be the key to a record,
such as someone's name, and we wish to extract details/info on that
person. e might be a telephone number and we want to know address, or
e might be address and we want telephone number
(Question: consider this ... how do you think directory enquiries
works? You know, you phone up BT and ask for the phone number of
Brad Cox, lives in Anstruther, I think. If it's an on-line system
(ie, they don't lug around all the telephone directories for all of
the cities and regions in the UK, do they? Just not physically possible
is it? ) how does it work? Gulp!)
There are 2 broad kinds of hashing, open hashing, and closed
hashing.
Open Hashing
The essential idea is that the (possibly infinite) set of potential
set members is partitioned into a finite number of classes.
We have a vector, lets say of n element, where each element of the
vector points to a list. So, we have n classes, and each class is
stored as a list. Therefore, hash(e) delivers the class of e.
We will call e the "key" and hash(e) "the hash value" of
e.
Sometimes this vector is referred to as a "bucket table" where each
element can be thought of as a bucket containing a group of entries.
What we now want to do is define a hash function that distributes
keys evenly (uniformly) over the bucket table, otherwise we
under-utilise the bucket table, and have buckets that contain too
much, resulting in too much sequential search. In worst case we would
have everything hashing to one bucket and only sequential search.
It is not always clear that we can select a function hash so that a
typical set will have its members distributed fairly evenly across
the buckets.
see /home/s7/pat/scheme/pub/hash-open.scm
Closed Hashing
A closed hash table keeps the members of the set in the bucket table
rather than using that table to store list headers. Consequently only
one element is in any bucket. So, what happens if hash(x) = hash(y)
when x and y are different? That is, what happens when we have a
collision?
We have a "rehash strategy". A "collision" occurs when we have i = hash(x)
and the ith element of the table is already occupied (by something
other than x). The rehash strategy chooses a sequence of alternative
locations, hash-1(x), hash-2(x), hash-3(x), ..., within the table.
We try each one of these locations in turn until we encounter an
empty (unoccupied) location. If none exists, and we have exhausted the
rehash strategy, we assume that the table is full and that we cannot
insert x.
A simple rehashing strategy, called "linear rehashing" is as
follows:
hash-i(x) = (hash(x) + i) mod B
where there are B buckets in the hash table.
Initially the table is empty, and each bucket holds a special value
"empty", ie a value different from any value that we might insert
into the table.
see /home/s7/pat/scheme/pub/hash-closed.scm
More Complex Rehash Strategy
The above linear rehashing strategy is simple (and that's nice) but
probably not great. What might be better is to have an ordered list
of hashing functions. That is, if the 1st hashing function results in
a collision, use the second rehashing function, and so on, until
the item is found, or an empty location is found, or we have tried
all of the hashing functions. One technique is to generate a random
list of integers, for example '(9 2 6 1 5 3 4 7 8), when creating the
hash table, and storing this list with the hash table. When a
collision occurs on hash(e) we add the first number from this list to
hash(e), and so on until the list has been exhausted.
Deletion from Hash Table
When we delete from an open hash table, we just hash to the class,
and then delete from a list (and this has already been covered).
When it is a closed hash table, things are a wee bit more
complicated. We hash (and possibly rehash) to the bucket and mark the
contents of the bucket as deleted, using some symbol that cannot
occur as a valid key. So, assume we have deleted item Ex. Fair and well,
but when we later search for an item Ey, and its key hashes to the same
bucket as the deleted key Ex, and is marked as deleted, do we assume
that Ey has been deleted? No, we must continue searching for Ey
until we find it, hash to an unoccupied entry, or run out of
rehashes. The same holds true if we later search for Ex. It will
eventually hash to the deleted bucket, but again we cannot assume
that it is not present, and we must again try all rehashes to be sure.
We could get smart, and introduce a flag with each bucket, to say if
the item is present or deleted. Consequently if we hash Ex and find
the key but it is deleted, we know for sure that it is not present.
Similarly, when we hash Ey and collide with a deleted key, we can
determine if it was Ey that was deleted. Of course, we have a small
space overhead with the flag.
Deletions are a bit of a pain, and we can eradicate them by
rebuilding the hash table every so often. That is, get the
non-deleted entries in the table, and hash them into a new table.
Really, this is yet another form of garbage collection.
An example of Open Hashing
The following text (an abstract from a paper I wrote) was hashed into
an open hash table of 20 buckets, using the hash function hash-1 (see
source file). That is, we convert each character in a word to its
ascii number, and sum that, then take the sum modulo the number of
buckets. Here is the text:
An empirical study of randomly generated binary constraint satisfaction
problems reveals that for problems with a given number of variables, domain
size, and connectivity there is a critical level of constraint tightness at
which a phase transition occurs. At the phase transition problems change from
being soluble to insoluble, and the difficulty of problems increases
dramatically. A theory developed by Williams and Hogg, and independently
developed by Smith, predicts where the hardest problems should occur. It is
shown that the theory is in close agreement with the empirical results, except
when problems are small and involve few constraints.
What now follows is a trace of the scheme session. The open hashing
software is loaded
(and again, look at the source please as it is
part of the notes), then a 20 bucket open hash table is created,
where the keys are strings (test for membership is string=?), and the
hashing function is hash-1 (see source file).
The above data file is
then hashed into the 20 bucket open hash table (via fn file->table)
t20. We then look at the contents of the hash table (via show t20).
Each displayed line give a list, which has the bucket number (first
bucket is zero), the number of items in the bucket, and then the
contents of the bucket.
> (load "pub/hash-open")
;loading "pub/hash-open"
;done loading "pub/hash-open.scm"
;Evaluation took 83 mSec (0 in gc) 978 cons work
#
> (define t20 (make-open-hash-table 20 hash-1 string=?))
;Evaluation took 0 mSec (0 in gc) 44 cons work
#
> (file->table "pub/hash.data" t20)
;Evaluation took 200 mSec (0 in gc) 17475 cons work
#
> (show t20)
(0 2 (occur is))
(1 3 (it hogg the))
(2 2 (few predicts))
(3 3 (critical size generated))
(4 1 (with))
(5 2 (tightness binary))
(6 2 (results williams))
(7 6 (hardest theory to and for an))
(8 3 (constraints problems satisfaction))
(9 5 (except smith phase number study))
(10 2 (randomly empirical))
(11 4 (involve dramatically which connectivity))
(12 4 (are agreement developed domain))
(13 6 (insoluble at variables that constraint of))
(14 4 (when close change reveals))
(15 5 (in should independently difficulty occurs))
(16 3 (from level there))
(17 5 (small increases being given a))
(18 1 (soluble))
(19 4 (shown where by transition))
;Evaluation took 16 mSec (0 in gc) 426 cons work
#
>
Conclusion
It might look dark for hashing, and for closed hashing in particular.
Don't interpret it this way please. There are good hashing
strategies, and we need to explore/investigate/experiment before we
come up with a good one for a given application. If we get a good
one, then we can reduce sorting and searching (in that application)
to an order O(n) for sorting, and O(1) for searching. That is
optimal! Even if we haven't got a good hashing function, we can mix
and match technologies to get really good performance. For example,
if we are dealing with alpha-numeric keys, we might use an open
hashing strategy, with a hash value that is the integer conversion of
the first character in the key. This then gives us 26 classes, and we
can arrange each class as a sorted list or vector. We can then search
within lists using a binary chop. This could be very quick indeed.
You should also realise, that the hash table might be a static thing.
It might be built up once, and used over and over again, many
millions of queries, over a period of weeks, months, years, whatever.
Consequently, it might result in a huge saving if we get it to high
efficiency.
Some Things to Consider
I don't know the answer to these questions, but I think they are
interesting. Have you used the online facilities in the library? How
are they implemented. Have you ever used BT's directory enquiries?
How does that work, because there are more than 60 million people in
the UK. There are many (and increasing numbers of) very large data
bases out there, and we want fast access, and access on multiple
attributes. Getting information out quickly is of great importance.