## hash objects
* hash tables (also known as hash maps) are data structure that organizes data using hash functions, in order to support quick insertion and search (map keys to buckets).


--- ### collision
* ideally, a perfect hash function will be a 1-1 mapping between the key and the buckets. however, in most cases, a hash function is not perfect and there is a tradeoff between the amount of buckets and the capacity of a bucket. if the hash function is not a 1-1 mapping, collisions must be handled: - how to organize the values in the same bucket? - what if too many values are assigned to the same bucket? - how to search for a target value in a specific bucket? * popular collision techniques are: - separate chaining: a linked list is used for each value, so that it stores all the collided items. - open addressing: all entry records are stored in the bucket array itself. when a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found.
---- ### keys
* when the order of each element in the string/array doesn't matter, you can use the sorted string/array to calculate a key. * if you only care about the offset of each value, you can use it as the key. * in a tree, you might want to directly use the `Node()` class as key sometimes, but in general, the serialization of the subtree might be a better idea. * in a matrix, you might want to use the row index or the column index as key. sometimes you might want to aggregate the values in the same diagonal line.
--- ### implementing a hash set
* the difference between a hash set and a hash map is that the set can never have repeated elements. * to implement a hash set data structure, you need to implement: - a hash function (to assign an address to store a given value), and - a collision handling (since the nature of a hash function is to map a value from a space A to a corresponding smaller space B). * overall, there are several strategies to resolve the collisions: - separate chaining: for value with the same hash key, we keep them in a bucket, and each bucket is independent of each other. - open addressing: whenever there is a collision, we keep on probing the main space with certain strategy until a free slot is found. - 2-choices hashing: we use two hash functions rather than one, and pick the generated address with fewer collisions. * if we were to implement separate chaining, the primary storage underneath a hashset is a continuous memory as array, where each element in this array corresponds to a bucket that stores the actual values. * given a value, first we generate a key for the value via the hash function (the generated key serves as the index to locate the bucket). * once the bucket is located, we then perform the desired operation on the bucket (such as add, remove, and contain). * use a prime number as the base of the module to reduce collisions.
```python class HashSet: def __init__(self, size): self.size = size self.bucket = [Bucket() for _ in range(self.size)] def _get_hash_key(self, key): return key % self.size def add(self, element: int): bucket_index = self._get_hash_key(element) self.bucket[bucket_index].insert(element) def remove(self, element: int)L bucket_index = self._get_hash_key(element) self.bucket[bucket_index].delete(element) def contains(self, element: int: bucket_index = self._get_hash_key(element) return self.bucket[bucket_index].exists(element) ````
#### buckets as linked lists
* a good choice for buckets are linked lists, as their time complexity for `insert` and `delete` is constant (once the position to be updated is located). * you just need to be sure you never insert repeated elements. * time complexity for search is `O(N/K)` where `N` is the number of all possible values and `K` is the number of predefined buckets (the average size of bucket is `N/K`). * space complexity is `O(K+M)`, where `K` is the number of predefined buckets, and `M` is the number of unique values that have been inserted in the HashSet. * lastly, to optimize search, we could maintain the buckets as sorted lists (and obtain `O(log(N))` time complexity for the lookup operation). * however, `insert` and `delete` are linear time (as elements would need to be shifted).
```python class Node: def __init__(self, value=None, next=None): self.value = value self.next = next class Bucket: def __init__(self): self.head = Node(0) def insert(self, value): if not self.exists(value): self.head.next = Node(value, self.head.next) else: print(f'node {value} already exists') def delete(self, value): prev = self.head current = self.head.next while current is not None: if current.value == value: prev.next = current.next return True prev = current current = current.next return False def exists(self, value): current = self.head.next while current is not None: if current.value == value: return True current = current.next return False ```
#### buckets as binary search trees
* another option for a bucket is a binary search tree, with `O(log(N))` time complexity for search, insert, and delete. in addition, bst can not hold repeated elements, just like sets. * time complexity for search is `O(log(N/K)`, where `N` is the number of all possible values and `K` is the number of predefined buckets. * space complexity is `O(K+M)` where `K` is the number of predefined buckets, and `M` is the number of unique values in the hash set.
```python class Node: def __init__(self, value=None): self.val = value self.left = None self.right = None class Bucket: def __init__(self): self.tree = BSTree() def insert(self, value): self.tree.root = self.tree.insert(self.tree.root, value) def delete(self, value): self.tree.root = self.tree.delete(self.tree.root, value) def exists(self, value): return (self.tree.search(self.tree.root, value) is not None) class BSTree(): def __init__(self): self.root = None def search(self, root, value) -> Node: if root is None or value == root.val: return root return self.search(root.left, value) if val < root.value \ else self.search(root.right, value) def insert(self, root, value) -> Node: if not root: return Node(value) if value > root.val: root.right = self.insert(root.right, value) elif value == root.val: return root else: root.left = self.insert(root.left, value) def successor(self, root): # one step right and then all left root = root.right while root.left: root = root.left return root.value def predecessor(self, root): # one step left and then always right root = root.left while root.right: root = root.right return root.value def delete(self, root, key) -> Node: if not root: return None if key > root.val: root.right = self.delete(root.right, key) elif key < root.val: root.left = self.delete(root.left, key) else: if not (root.left or root.right): root = None elif root.right: root.val = self.sucessor(root) root.right = self.delete(root.right, root.val) else: root.val = self.predecessor(root) root.left = self.delete(root.left, root.val) return root ```
---- ### implementing a hash map
* same as before, we need to tackle two main issues: hash function design and collision handling. * a good approach is using a module function with an array or linked list. at this time, there is no constraint for repeated numbers.
```python class Bucket: def __init__(self): self.bucket = [] def get(self, key): for (k, v) in self.bucket: if k == key: return v return -1 def put(self, key, value): found = False for i, k in enumerate(self.bucket): if key == k[0]: self.bucket[i] = (key, value) found = True break if not found: self.bucket.append((key, value)) def remove(self, key): for i, k in enumerate(self.bucket): if key == k[0]: del self.bucket[i] ```
* del is an `O(N)` operation, as we would copy all the `i` elements. to make it `O(1)` we could swap the element we want to remove with the last element in the bucket.