INDEX ABOUT HASHING IN GENERAL

Transcription

1 INDEX ABOUT HASHING IN GENERAL 1. URL INTRODUCTION HASH TABLE STRUCTURE OPERATIONS CREATE A TABLE DELETE A TABLE STRING LOOKUP INSERT A STRING 6 4. HASHING FUNCTIONS INTRODUCTION SORTS OF HASHING FUNCTIONS DIVISION METHOD MULTIPLICATION METHOD VARIABLE STRING ADDITION METHOD COLLISIONS AND SOLUTIONS CHAINING RE-HASHING LINEAR PROBING OVERFLOW AREA EXAMPLE 12 1

2 1. URL HASH TABLE cuments/sman/volume/hashtables_files/s_has.htm 2. INTRODUCTION In computer science, a hash table is a data structure that provides fast lookup of a record indexed by a key where the domain of the key is too large for simple indexing. Like arrays, hash tables can lookup one element independently on the size of the structure. Hash tables are often used to implement associative arrays. Conceptually, an associative array is composed of a collection of keys and a collection of values, and each key is associated with one value. Hash tables allow reasonably simple implementations of the essential operations of associative arrays: lookup of a record given a key; setting a record at a given a key; and removing a record with a given key. Hash tables use an array to hold, or reference, the stored records. Each slot of the array is indexed by a range reduction of the key. The range reduction is achieved by a hash function. However, various keys will, potentially, map to the same array index. Thus a collision resolution strategy is used. Note however that hash tables do not naturally support any particular ordering of the records in the table, unlike a binary tree requires that keys be orderable. 2

3 3. HASH TABLE STRUCTURE A hash table is made up of two parts: an array (the actual table where the data to be searched is stored) and a mapping function, known as a hash function. The hash function is a mapping from the input space to the integer space that defines the indices of the array. In other words, the hash function provides a way for assigning numbers to the input data such that the data can then be stored at the array index corresponding to the assigned number. To resolve the collision it s used the separate chaining method which requires a slight modification to the data structure. Instead of storing the data elements right into the array, they are stored in linked lists. Each slot in the array then points to one of these linked lists. When an element hashes to a value, it is added to the linked list at that index in the array. This kind of hash table is the most known and used. You can see an example in the following figure: Next it s shown the hash table structure used in separate chaining method: typedef struct _hash_table_t_ { int size; /* the size of the table */ list_t **table; /* the table elements */ hash_table_t; typedef struct _list_t_ { char *string; /* a pointer to the stored data */ struct _list_t_ *next; /* a pointer to the next data */ list_t; 3

4 3.1 OPERATIONS CREATE A TABLE hash_table_t *create_hash_table(int size) { hash_table_t *new_table; if (size<1) return NULL; /* invalid size for table */ /* Attempt to allocate memory for the table structure */ if ((new_table = malloc(sizeof(hash_value_t))) == NULL) { return NULL; /* Attempt to allocate memory for the table itself */ if ((new_table->table = malloc(sizeof(list_t *) * size)) == NULL) { return NULL; /* Initialize the elements of the table */ for(i=0; itable[i] = NULL; /* Set the table's size */ new_table->size = size; return new_table; This function creates and returns a new empty hash table whose size depends on the data passed by value as parameter (int size). The cost in time of executing this operation depends on the size of the table DELETE A TABLE void free_table(hash_table_t *hashtable) { int i; list_t *list, *temp; if (hashtable==null) return; /* Free the memory for every item in the table, including the * strings themselves. */ for(i=0; isize; i++) { list = hashtable->table[i]; while(list!=null) { temp = list; list = list->next; free(temp->str); free(temp); /* Free the table itself */ free(hashtable->table); free(hashtable); 4

5 This function frees up the memory used to store the hash table. As it s commented on the code, at first it s deleted the memory associated with every item in the table and after that it s possible to free the table (the vector whose each slot pointed to a list of items). The cost of this operation in time depends on the size of the hash table. Before explaining the functions about insertion and lookup, it s introduced the hash function that will be used in these two operations: unsigned int hash(hash_table_t *hashtable, char *str) { unsigned int hashval; /* we start our hash out at 0 */ hashval = 0; for(; *str!= '\0'; str++) hashval = *str + (hashval << 5) - hashval; return hashval % hashtable->size; For each character to insert into the hash table, we multiply the old hash by 31 and add the current character. We should remember that shifting a number left is equivalent to multiplying it by 2 raised to the number of places shifted. So that, multiplying hashval by 32 is equivalent to shifting 5 places left. And the latter is a more efficient operation. Thus, the shifting operation will be executed instead of multiplying. The returned value is the slot of the hash table where the new character will be stored. This value is calculated as hashval modulus the hashtable size. The hashval is the value obtained by shifting STRING LOOKUP list_t *lookup_string(hash_table_t *hashtable, char *str) { list_t *list; unsigned int hashval = hash(hashtable, str); { for(list = hashtable->table[hashval]; list!= NULL; list = list->next) if (strcmp(str, list->str) == 0) return list; return NULL; 5

6 This function goes to the correct list based on the hash value and sees if str is in the list. If it is, a pointer to the list element is returned. If it isn t, a pointer to NULL is returned. In the worst case every data element hashed to the same value. So doing a lookup means really doing a straight linear search on a linked list. Therefore, the search operation is back to depending on the size of the structure. However, the probability of that happening is so small and most lookups are independent on the size of the hash table. This case happens when each slot just store one item INSERT A STRING int add_string(hash_table_t *hashtable, char *str) { list_t *new_list; list_t *current_list; unsigned int hashval = hash(hashtable, str); /* Attempt to allocate memory for list */ if ((new_list = malloc(sizeof(list_t))) == NULL) return 1; /* Does item already exist? */ current_list = lookup_string(hashtable, str); if (current_list!= NULL) return 2; /* item already exists, don't /* Insert into list */ new_list->str = strdup(str); new_list->next = hashtable->table[hashval]; hashtable->table[hashval] = new_list; return 0; Inserting a string is almost the same as looking up a string. Firstly, the string is hashed, then it s accessed to the correct place in the array and finally the new string is inserted at the beginning. This operation consumes the same steps as the string lookup. 6

7 4. HASHING FUNCTIONS 4.1 INTRODUCTION A hashing function maps keys to integers, usually to get an even distribution on a smaller set of values. If the hash function is uniform, or equally distributes the data keys among the hash table indices, then hashing effectively subdivides the list to be searched. The worst-case behaviour occurs when all keys hash to the same index. Then we simply have a single linked list that must be sequentially searched. Consequently, it is important to choose a good hash function. There are four main characteristics of a good hash function: 1) The hash value is fully determined by the data being hashed. 2) The hash function uses all the input data. 3) The hash function "uniformly" distributes the data across the entire set of possible hash values. 4) The hash function generates very different hash values for similar strings. Let's examine why each of these is important: Rule 1: If something else besides the input data is used to determine the hash, then the hash value is not as dependent upon the input data, thus allowing for a worse distribution of the hash values. Rule 2: If the hash function doesn't use all the input data, then slight variations to the input data would cause an inappropriate number of similar hash values resulting in too many collisions. Rule 3: If the hash function does not uniformly distribute the data across the entire set of possible hash values, a large number of collisions will result, cutting down on the efficiency of the hash table. Rule 4: In real world applications, many data sets contain very similar data elements. We would like these data elements to still be distributable over a hash table. 7

8 So let's take as an example the following hash function: int hash(char *str, int table_size) { int sum; // Make sure a valid string passed in if (str==null) return -1; // Sum up all the characters in the string for( ; *str; str++) sum += *str; // Return the sum mod the table size return sum % table_size; Which rules does it break and satisfy? Rule 1: Satisfies. The hash value is fully determined by the data being hashed. The hash value is just the sum of all the input characters. Rule 2: Satisfies. Every character is summed. Rule 3: Breaks. From looking at it, it isn't obvious that it doesn't uniformly distribute the strings, but if you were to analyze this function for a large input you would see certain statistical properties bad for a hash function. Rule 4: Breaks. Hash the string "bog". Now hash the string "gob". They're the same. Slight variations in the string should result in different hash values, but with this function they often don't. So this hash function isn't so good SORTS OF HASHING FUNCTIONS DIVISION METHOD (TABLESIZE = PRIME) A HashValue, from 0 to (HashTableSize - 1), is computed by dividing the key value by the size of the hash table and taking the remainder. For example: typedef int HashIndexType; HashIndexType Hash(int Key) { return Key % HashTableSize; Selecting an appropriate HashTableSize is important to the success of this method. To obtain a more random scattering, HashTableSize should be a prime number not too close to a power of two. 8

9 4.2.2 MULTIPLICATION METHOD (TABLESIZE = 2^n) The multiplication method may be used for a HashTableSize that is a power of 2. The Key is multiplied by a constant, and then the necessary bits are extracted to index into the table. Knuth recommends using the fractional part of the product of the key and the golden ratio, or (sqrt(5) - 1)/2. For example, assuming a word size of 8 bits, the golden ratio is multiplied by 2 8 to obtain 158. The product of the 8-bit key and 158 results in a 16-bit integer. For a table size of 2 5 the 5 most significant bits of the least significant word are extracted for the hash value. The following definitions may be used for the multiplication method: /* 8-bit index */ typedef unsigned char HashIndexType; static const HashIndexType K = 158; /* 16-bit index */ typedef unsigned short int HashIndexType; static const HashIndexType K = 40503; /* 32-bit index */ typedef unsigned long int HashIndexType; static const HashIndexType K = ; /* w=bitwidth(hashindextype), size of table=2**m */ static const int S = w - m; HashIndexType HashValue = (HashIndexType)(K * Key) >> S; For example, if HashTableSize is 1024 (2 10 ), then a 16-bit index is sufficient and S would be assigned a value of = 6. Thus, we have: typedef unsigned short int HashIndexType; HashIndexType Hash(int Key) { static const HashIndexType K = 40503; static const int S = 6; return (HashIndexType)(K * Key) >> S; 9

10 VARIABLE STRING ADDITION METHOD (TABLESIZE = 256) To hash a variable-length string, each character is added, modulo 256, to a total. A HashValue, range 0-255, is computed: typedef unsigned char HashIndexType; HashIndexType Hash(char *str) { HashIndexType h = 0; while (*str) h += *str++; return h; 5. COLLISIONS AND SOLUTIONS In the small number of cases, where multiple keys map to the same integer, then elements with different keys may be stored in the same "slot" of the hash table. It is clear that when the hash function is used to locate a potential match, it will be necessary to compare the key of that element with the search key. Various techniques are used to manage this problem: 5.1. CHAINING One simple scheme is to chain all collisions in lists attached to the appropriate slot. This allows an unlimited number of collisions to be handled and doesn't require a priori knowledge of how many elements are contained in the collection. The tradeoff is the same as with linked lists versus array implementations of collections RE-HASHING Re-hashing schemes use a second hashing operation when there is a collision. If there is a further collision, we re-hash until an empty "slot" in the table is found LINEAR PROBING One of the simplest re-hashing functions is +1 (or -1), ie on a collision, look in the neighbouring slot in the table. 10

11 The figure gives you a practical demonstration of the effect of linear probing: it also implements a quadratic re-hash function so that you can compare the difference: As h(j)=h(k), so the next hash function, h1 is used. And as a second collision occurs h2 is used OVERFLOW AREA Another scheme will divide the pre-allocated table into two sections: the primary area to which keys are mapped and an area for collisions, normally termed the overflow area. When a collision occurs, a slot in the overflow area is used for the new element and a link from the primary slot established as in a chained system. This is essentially the same as chaining, except that the overflow area is pre-allocated and thus possibly faster to access. As with re-hashing, the maximum number of elements must be known in advance, but in this case, two parameters must be estimated: the optimum size of the primary and overflow areas. 11

12 6. EXAMPLE Let's take a simple example. First, we start with a hash table array of strings. Let's say the hash table size is 12: 12

13 Next we need a hash function. Let's assume a simple hash function that takes a string as input. The returned hash value will be the sum of the ASCII characters that make up the string mod the size of the table: int hash(char *str, int table_size) { int sum; /* Make sure a valid string passed in */ if (str==null) return -1; /* Sum up all the characters in the string */ for( ; *str; str++) sum += *str; /* Return the sum mod the table size */ return sum % table_size; Now that we have a framework in place, let's try using it. First, let's store a string into the table: "Steve". We run "Steve" through the hash function, and find that hash("steve",12) yields 3: 13

14 Let's try another string: "Spark". We run the string through the hash function and find that hash("spark",12) yields 6. We insert it into the hash table: 14

15 Let's try another: "Notes". We run "Notes" through the hash function and find that hash("notes",12) is 3. We insert it into the hash table: We can see that a hash function doesn't guarantee that every input will map to a different output. So that there is always the chance that two inputs will hash to the same output. This indicates that both elements should be inserted at the same place in the array, and this is impossible. As it s explained before this phenomenon is known as collision. To resolve the collision it s used the separate chaining method which requires a slight modification to the data structure. Instead of storing the data elements right into the array, they are stored in linked lists. Each slot in the array then points to one of these linked lists. When an element hashes to a value, it is added to the linked list at that index in the array. 15

16 Let's look at the above example again, this time with our modified data structure: Again, let's try adding "Steve" which hashes to 3: 16

17 And "Spark" which hashes to 6: Now we add "Notes" which hashes to 3, just like "Steve": 17