## hash objects
* hash tables (also known as hash maps) are data structure that organizes data using hash functions, in order to support quick insertion and search (map keys to buckets).
---
### collision
* ideally, a perfect hash function will be a 1-1 mapping between the key and the buckets. however, in most cases, a hash function is not perfect and there is a tradeoff between the amount of buckets and the capacity of a bucket. if the hash function is not a 1-1 mapping, collisions must be handled:
- how to organize the values in the same bucket?
- what if too many values are assigned to the same bucket?
- how to search for a target value in a specific bucket?
* popular collision techniques are:
- separate chaining: a linked list is used for each value, so that it stores all the collided items.
- open addressing: all entry records are stored in the bucket array itself. when a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found.
----
### keys
* when the order of each element in the string/array doesn't matter, you can use the sorted string/array to calculate a key.
* if you only care about the offset of each value, you can use it as the key.
* in a tree, you might want to directly use the `Node()` class as key sometimes, but in general, the serialization of the subtree might be a better idea.
* in a matrix, you might want to use the row index or the column index as key. sometimes you might want to aggregate the values in the same diagonal line.
---
### implementing a hash set
* the difference between a hash set and a hash map is that the set can never have repeated elements.
* to implement a hash set data structure, you need to implement:
- a hash function (to assign an address to store a given value), and
- a collision handling (since the nature of a hash function is to map a value from a space A to a corresponding smaller space B).
* overall, there are several strategies to resolve the collisions:
- separate chaining: for value with the same hash key, we keep them in a bucket, and each bucket is independent of each other.
- open addressing: whenever there is a collision, we keep on probing the main space with certain strategy until a free slot is found.
- 2-choices hashing: we use two hash functions rather than one, and pick the generated address with fewer collisions.
* if we were to implement separate chaining, the primary storage underneath a hashset is a continuous memory as array, where each element in this array corresponds to a bucket that stores the actual values.
* given a value, first we generate a key for the value via the hash function (the generated key serves as the index to locate the bucket).
* once the bucket is located, we then perform the desired operation on the bucket (such as add, remove, and contain).
* use a prime number as the base of the module to reduce collisions.
```python
class HashSet:
def __init__(self, size):
self.size = size
self.bucket = [Bucket() for _ in range(self.size)]
def _get_hash_key(self, key):
return key % self.size
def add(self, element: int):
bucket_index = self._get_hash_key(element)
self.bucket[bucket_index].insert(element)
def remove(self, element: int)L
bucket_index = self._get_hash_key(element)
self.bucket[bucket_index].delete(element)
def contains(self, element: int:
bucket_index = self._get_hash_key(element)
return self.bucket[bucket_index].exists(element)
````
#### buckets as linked lists
* a good choice for buckets are linked lists, as their time complexity for `insert` and `delete` is constant (once the position to be updated is located).
* you just need to be sure you never insert repeated elements.
* time complexity for search is `O(N/K)` where `N` is the number of all possible values and `K` is the number of predefined buckets (the average size of bucket is `N/K`).
* space complexity is `O(K+M)`, where `K` is the number of predefined buckets, and `M` is the number of unique values that have been inserted in the HashSet.
* lastly, to optimize search, we could maintain the buckets as sorted lists (and obtain `O(log(N))` time complexity for the lookup operation).
* however, `insert` and `delete` are linear time (as elements would need to be shifted).
```python
class Node:
def __init__(self, value=None, next=None):
self.value = value
self.next = next
class Bucket:
def __init__(self):
self.head = Node(0)
def insert(self, value):
if not self.exists(value):
self.head.next = Node(value, self.head.next)
else:
print(f'node {value} already exists')
def delete(self, value):
prev = self.head
current = self.head.next
while current is not None:
if current.value == value:
prev.next = current.next
return True
prev = current
current = current.next
return False
def exists(self, value):
current = self.head.next
while current is not None:
if current.value == value:
return True
current = current.next
return False
```
#### buckets as binary search trees
* another option for a bucket is a binary search tree, with `O(log(N))` time complexity for search, insert, and delete. in addition, bst can not hold repeated elements, just like sets.
* time complexity for search is `O(log(N/K)`, where `N` is the number of all possible values and `K` is the number of predefined buckets.
* space complexity is `O(K+M)` where `K` is the number of predefined buckets, and `M` is the number of unique values in the hash set.
```python
class Node:
def __init__(self, value=None):
self.val = value
self.left = None
self.right = None
class Bucket:
def __init__(self):
self.tree = BSTree()
def insert(self, value):
self.tree.root = self.tree.insert(self.tree.root, value)
def delete(self, value):
self.tree.root = self.tree.delete(self.tree.root, value)
def exists(self, value):
return (self.tree.search(self.tree.root, value) is not None)
class BSTree():
def __init__(self):
self.root = None
def search(self, root, value) -> Node:
if root is None or value == root.val:
return root
return self.search(root.left, value) if val < root.value \
else self.search(root.right, value)
def insert(self, root, value) -> Node:
if not root:
return Node(value)
if value > root.val:
root.right = self.insert(root.right, value)
elif value == root.val:
return root
else:
root.left = self.insert(root.left, value)
def successor(self, root):
# one step right and then all left
root = root.right
while root.left:
root = root.left
return root.value
def predecessor(self, root):
# one step left and then always right
root = root.left
while root.right:
root = root.right
return root.value
def delete(self, root, key) -> Node:
if not root:
return None
if key > root.val:
root.right = self.delete(root.right, key)
elif key < root.val:
root.left = self.delete(root.left, key)
else:
if not (root.left or root.right):
root = None
elif root.right:
root.val = self.sucessor(root)
root.right = self.delete(root.right, root.val)
else:
root.val = self.predecessor(root)
root.left = self.delete(root.left, root.val)
return root
```
----
### implementing a hash map
* same as before, we need to tackle two main issues: hash function design and collision handling.
* a good approach is using a module function with an array or linked list. at this time, there is no constraint for repeated numbers.
```python
class Bucket:
def __init__(self):
self.bucket = []
def get(self, key):
for (k, v) in self.bucket:
if k == key:
return v
return -1
def put(self, key, value):
found = False
for i, k in enumerate(self.bucket):
if key == k[0]:
self.bucket[i] = (key, value)
found = True
break
if not found:
self.bucket.append((key, value))
def remove(self, key):
for i, k in enumerate(self.bucket):
if key == k[0]:
del self.bucket[i]
```
* del is an `O(N)` operation, as we would copy all the `i` elements. to make it `O(1)` we could swap the element we want to remove with the last element in the bucket.