# Search by Hashing

## Searching

Searching a list of values is a common task. An application program might retrieve a student record, bank account record, credit record, or any other type of record using a search algorithm. Some of the most common search algorithms are serial search, binary search and search by hashing. The tool for comparing the performance between the different algorithms is called run-time analysis. Here, we present search by hashing, and discuss the performance of this method. But first, we present a simple search method, the serial search and its run-time analysis.

### Serial Search

In a serial search, we step through an array (or list) one item at a time looking for a desired item. The search stops when the item is found or when the search has examined each item without success. This technique is probably the easiest to implement and is applicable to many situations.

#### Serial Search - Analysis

The running-time of serial search is easy to analyze. We will count the number of operations required by the algorithm, rather than measuring the actual time. For searching an array, a common approach is to count one operation each time that the algorithm accesses an element of the array.

#### 1. Worst-case

Usually, when we discuss running times, we consider the "hardest" inputs, for example, a search that requires the algorithm to access the largest number of array elements. This is called the worst-case running time.

For serial search, the worst-case running time occurs when the desired item is not in the array. In this case, the algorithm accesses every element.

Thus, for an array of n elements, the worst-case time for serial search requires n array accesses.

#### 2. Average-case

An alternative to worst-case running time, is the average-case running time, which is obtained by averaging the different running times for all inputs of a particular kind. For example, if our array contains ten elements, then if we are searching for the target that occurs at the first location, then there is just one array access. If we are searching for the target that occurs at the second location, then there are two array accesses. And so on through the final target, which requires ten accesses. The average of all these searches is:

```(1+2+3+4+5+6+7+8+9+10)/10 = 5.5
```

We can generalize this expression for an array of size n:

```(1+2+...+n)/n = (n*(n+1)/2)/n = (n+1)/2
```

Thus, for an array of n elements, the average-case running time for serial search requires (n+1)/2 array accesses.

Note: Both worst-case time and average-case time are O(n), but nevertheless, the average case is about half the time of the worst-case.

#### 3. Best-case

A third way to measure running time is called best-case, and as the name suggests, it takes the most optimistic view. The best-case running time is defined as the smallest of all the running times on inputs of a particular size. For serial search, the best-case occurs when the target is found at the front of the array, requiring only one array access.

Thus, for an array of n elements, the best-case time for serial search requires just 1 array access.

Unless the best-case behavior occurs with high probability, the best-case running time is generally not used during analysis.

## Hashing

Hashing has a worst-case behavior that is linear for finding a target, but with some care, hashing can be dramatically fast in the average-case. Hashing also makes it easy to add and delete elements from the collection that is being searched.

Suppose we want to store information about each student in a database, so that we can later retrieve information about any student simply using his/her student ID. To be specific, suppose the information about each student is an object of the following form, with the student ID stored in the key field:

```struct Student
{
int key;         // the student ID
long phone;      // phone number
};
```

We call each of these objects a record. Of course, there might be other information in each student record.

If student IDs are all in the range `0..99`, we could store the records in an array of the following type, placing student ID k in location `data[k]`:

```Student data[100];  // array of 100 records
```

The record for student ID k can be retrieved immediately since we know it is in `data[k]`.

What, however, if the student IDs do not form a neat range like `0..99`. Suppose that we only know that there will be a hundred or fewer and that they will be distributed in the range `0..9999`. We could then use an array with 10,000 components, but that seems wasteful since only a small fraction of the array will be used. It appears that we have to store the records in an array with 100 elements and to use a serial search through this array whenever we wish to find a particular student ID. If we are clever, we can store the records in a relatively small array and still retrieve students by ID much faster than we could by serial search.

To illustrate this, suppose that we know that the student IDs will be:

```0, 100, 200, ... , 9800, 9900
```

In this case, we can store the records in an array called `data` with only 100 components. We'll store the record with student ID k at location:

```data[k/100]
```

Note: We are using integer division, so `/` gives the quotient.

If we want to retrieve information for student ID 700, we compute `700/100` and obtain the index 7. The record for student ID 700 is stored in array component `data[7]`.

This general technique is called hashing. Each record requires a unique value called its key. In our example the student ID is the key, but other, more complex keys are sometimes used. A function called the hash function, maps keys to array indices.

Suppose we name our hash function `hash`. If a record has a key of k, then we will try to store that record at location `data[hash(k)]`. Using the hash function to compute the correct array index is called hashing the key to an array index. The hash function must be chosen so that its return value is always a valid index for the array. In our example:

```hash(k) = k / 100
```

Given this hash function and keys that are multiples of 100, every key produces a different index when it was hashed. Thus, `hash` is a perfect hash function. Unfortunately, a perfect hash function cannot always be found.

### Collisions

Suppose we no longer have a student ID 400, but we have 399 instead. The record with student ID 300 will be stored in `data[3]` as before, but where will student ID 399 be placed? Student ID 399 hashes to `data[399/100] = data[3]`. So there are now two different records that belong in `data[3]`. This situation is known as a collision.

In this case, we could redefine our hash function to avoid the collision, but in practice you do not know the exact numbers that will occur as keys, and therefore, you cannot design a hash function that is guaranteed to be free of collisions. Typically, though, you do know an upper bound on how many keys there will be.

#### Solution

The usual approach is to use an array size that is larger than needed. The extra array positions make the collisions less likely. A good hash function will distribute the keys uniformly throughout the locations of the array. If the array indices range from 0 to 99, then you might use the following hash function to produce an array index for a record with a given key:

```hash(k) = k % 100
```

This hash function always produces a value in the range 0 to 99.

One way to resolve collisions is to place the colliding record in another location that is still open. This storage algorithm is called open-addressing. Open addressing requires that the array be initialized so that the program can test if an array position already contains a record.

With this method of resolving collisions, we still must decide how to choose the locations to search for an open position when a collision occurs...There are 2 main ways to do so.

#### 1. Linear Probing

One way of dealing with collisions is the following algorithm:

1. For a record with key given by key, compute the index `hash(key)`.
2. If `data[hash(key)]` does not already contain a record, then store the record in `data[hash(key)]` and end the storage algorithm.
3. If the location `data[hash(key)]` already contains a record then try `data[hash(key)+1]`. If that location already contains a record, try `data[hash(key)+2]`, and so forth until a vacant position is found. When the highest numbered array position is reached, simply go to the start of the array. For example, if the array indices are `0..99`, and `98` is the key, then try `98,99,0,1` and so on, in that order. Searching for a vacant spot in this manner is called linear probing.

#### 2. Double Hashing

There is a problem with linear probing. When several different keys hash to the same location, the result is a cluster of elements, one after another. As the table approaches its capacity, these clusters tend to merge into larger and larger clusters. This is the problem of clustering.

Clustering makes insertions take longer because the insert function must step all the way through a cluster to find a vacant location. Searches require more time for the same reason. The most common technique to avoid clustering is called double hashing. Replace step 3. in the above algorithm with the following step:

3. Let `i = hash(key)`. If the location `data[i]` already contains a record then let ```i = (i+hash2(key)) % SIZE```, and try the new `data[i]`. If that location already contains a record, then let ```i= (i+hash2(key)) % SIZE```, and try that `data[i]`, and so forth until a vacant position is found.

Searching for a vacant spot in this manner is called double hashing.

Important Note:

• With double hashing, we could return to our starting position before we have examined every available location. An easy way to avoid this problem is to make sure that the array size is relatively prime with respect to the value returned by `hash2()` (in other words, these two numbers must not have any common factor, apart from 1).

• `hash2(key)`must return a value in the range [1..SIZE-1]. Two possible implementations are:

```1. hash2(key) = 1 + (key % (SIZE - 2)
2. hash2(key) = max(1, (key / SIZE) % SIZE)
```

Keep in mind that there exist many possible candidates.

### Chained hashing

In open addressing, each array element can hold just one entry. When the array is full, no more records can be added to the table. One possible solution is to resize the array and rehash all the entries. This would require a careful choice of new size and probably require each entry to have a new hash value computed.

A better approach is to use a different collision resolution method called chained hashing, or simply chaining, in which each component of the hash table's array can hold more than one entry. We still hash the key of each entry, but upon collision, we simply place the new entry in its proper array component along with other entries that happened to hash to the same array index.

The most common way to implement chaining is to have each array element be a linked list. The nodes in a particular linked list will each have a key that hashes to the same value.

### Run-Time Analysis of Hashing

The worst-case for hashing occurs when every key hashes to the same array index. In this case, we may end up searching through all the records to find the target just as in serial search.

The average-case performance of hashing is complex, particularly if deletions are allowed. We will give three different formulas for the three versions of hashing: open addressing with linear probing, open addressing with double hashing, and chained hashing. The three formulas depend on how many records are in the table.When the table has many records, there are many collisions and the average time for a search is longer. We define the load factor `alpha` as follows:

```alpha = # occupied table locations / table size
```

For open address hashing, each array element holds at most one item, so the load factor can never exceed 1. But with chaining, each array position can hold many records, and the load factor might be higher than 1. In the following table, we give formulas for the average-case performance of the three hashing schemes along with numerical examples.

 Load factor (alpha) Open addressing with linear probing (1+1/(1-alpha))/2 Open addressing with double hashing -ln(1-alpha)/alpha Chained hashing 1+alpha/2 0.5 1.50 1.39 1.25 0.6 1.75 1.53 1.30 0.7 2.17 1.72 1.35 0.8 3.00 2.01 1.40 0.9 5.50 2.56 1.45 1.0 N/A N/A 1.50 2.0 N/A N/A 2.00 4.0 N/A N/A 3.00

Notes:

• All formulas give the approximate average number of table elements examined in a successful search.
• The first two formulas assume a non-full hash table and no deletions. The third formula is valid even with deletions.

You are given a template implementation of a hash table using open addressing with linear probing. Here is the source code:

Compile the main file `testtab.cpp`, and run the program with different load factors. Write your results on a piece of paper.

The program prompts the user for a load factor and then inserts in and searches a hash table. While searching, the program counts the total number of accesses to array elements for all the keys that are searched, and then outputs the average number of accesses to the array. This "experiment" value is then displayed with the "theoretical" value.

```// Fill in -
```

In `table.tem`, you will have to implement the functions `hash2()` and `next_index2(),` and change two calls to `next_index()` to `next_index2()`.
In `testtab.cpp`, you will have to change the function `print_average_case()`.