mirror of
https://github.com/autistic-symposium/master-algorithms-py.git
synced 2025-04-30 12:46:11 -04:00
299 lines
9.3 KiB
Markdown
299 lines
9.3 KiB
Markdown
## hash objects
|
|
|
|
|
|
<br>
|
|
|
|
* hash tables (also known as hash maps) are data structure that organizes data using hash functions, in order to support quick insertion and search (map keys to buckets).
|
|
|
|
<br>
|
|
|
|
|
|
|
|
<p align="center">
|
|
<img src="https://github.com/go-outside-labs/master-python-with-algorithms-py/assets/138340846/aa798e45-d53b-45b9-9f95-0e508eb923d7" width="60%" align="center" style="padding:1px;border:1px solid black;">
|
|
</p>
|
|
|
|
|
|
<br>
|
|
|
|
---
|
|
|
|
### collision
|
|
|
|
<br>
|
|
|
|
* ideally, a perfect hash function will be a 1-1 mapping between the key and the buckets. however, in most cases, a hash function is not perfect and there is a tradeoff between the amount of buckets and the capacity of a bucket. if the hash function is not a 1-1 mapping, collisions must be handled:
|
|
- how to organize the values in the same bucket?
|
|
- what if too many values are assigned to the same bucket?
|
|
- how to search for a target value in a specific bucket?
|
|
|
|
* popular collision techniques are:
|
|
- separate chaining: a linked list is used for each value, so that it stores all the collided items.
|
|
- open addressing: all entry records are stored in the bucket array itself. when a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some probe sequence, until an unoccupied slot is found.
|
|
|
|
|
|
<br>
|
|
|
|
----
|
|
|
|
### keys
|
|
|
|
<br>
|
|
|
|
* when the order of each element in the string/array doesn't matter, you can use the sorted string/array to calculate a key.
|
|
|
|
* if you only care about the offset of each value, you can use it as the key.
|
|
|
|
* in a tree, you might want to directly use the `Node()` class as key sometimes, but in general, the serialization of the subtree might be a better idea.
|
|
|
|
* in a matrix, you might want to use the row index or the column index as key. sometimes you might want to aggregate the values in the same diagonal line.
|
|
|
|
<br>
|
|
|
|
---
|
|
|
|
### implementing a hash set
|
|
|
|
<br>
|
|
|
|
* the difference between a hash set and a hash map is that the set can never have repeated elements.
|
|
|
|
* to implement a hash set data structure, you need to implement:
|
|
- a hash function (to assign an address to store a given value), and
|
|
- a collision handling (since the nature of a hash function is to map a value from a space A to a corresponding smaller space B).
|
|
|
|
|
|
* overall, there are several strategies to resolve the collisions:
|
|
- separate chaining: for value with the same hash key, we keep them in a bucket, and each bucket is independent of each other.
|
|
- open addressing: whenever there is a collision, we keep on probing the main space with certain strategy until a free slot is found.
|
|
- 2-choices hashing: we use two hash functions rather than one, and pick the generated address with fewer collisions.
|
|
|
|
|
|
* if we were to implement separate chaining, the primary storage underneath a hashset is a continuous memory as array, where each element in this array corresponds to a bucket that stores the actual values.
|
|
* given a value, first we generate a key for the value via the hash function (the generated key serves as the index to locate the bucket).
|
|
* once the bucket is located, we then perform the desired operation on the bucket (such as add, remove, and contain).
|
|
* use a prime number as the base of the module to reduce collisions.
|
|
|
|
<br>
|
|
|
|
```python
|
|
class HashSet:
|
|
|
|
def __init__(self, size):
|
|
self.size = size
|
|
self.bucket = [Bucket() for _ in range(self.size)]
|
|
|
|
def _get_hash_key(self, key):
|
|
return key % self.size
|
|
|
|
def add(self, element: int):
|
|
bucket_index = self._get_hash_key(element)
|
|
self.bucket[bucket_index].insert(element)
|
|
|
|
def remove(self, element: int)L
|
|
bucket_index = self._get_hash_key(element)
|
|
self.bucket[bucket_index].delete(element)
|
|
|
|
def contains(self, element: int:
|
|
bucket_index = self._get_hash_key(element)
|
|
return self.bucket[bucket_index].exists(element)
|
|
````
|
|
|
|
<br>
|
|
|
|
|
|
#### buckets as linked lists
|
|
|
|
<br>
|
|
|
|
* a good choice for buckets are linked lists, as their time complexity for `insert` and `delete` is constant (once the position to be updated is located).
|
|
* you just need to be sure you never insert repeated elements.
|
|
|
|
* time complexity for search is `O(N/K)` where `N` is the number of all possible values and `K` is the number of predefined buckets (the average size of bucket is `N/K`).
|
|
|
|
* space complexity is `O(K+M)`, where `K` is the number of predefined buckets, and `M` is the number of unique values that have been inserted in the HashSet.
|
|
|
|
* lastly, to optimize search, we could maintain the buckets as sorted lists (and obtain `O(log(N))` time complexity for the lookup operation).
|
|
* however, `insert` and `delete` are linear time (as elements would need to be shifted).
|
|
|
|
<br>
|
|
|
|
```python
|
|
class Node:
|
|
def __init__(self, value=None, next=None):
|
|
self.value = value
|
|
self.next = next
|
|
|
|
|
|
class Bucket:
|
|
def __init__(self):
|
|
self.head = Node(0)
|
|
|
|
def insert(self, value):
|
|
if not self.exists(value):
|
|
self.head.next = Node(value, self.head.next)
|
|
else:
|
|
print(f'node {value} already exists')
|
|
|
|
def delete(self, value):
|
|
prev = self.head
|
|
current = self.head.next
|
|
while current is not None:
|
|
if current.value == value:
|
|
prev.next = current.next
|
|
return True
|
|
prev = current
|
|
current = current.next
|
|
return False
|
|
|
|
def exists(self, value):
|
|
current = self.head.next
|
|
while current is not None:
|
|
if current.value == value:
|
|
return True
|
|
current = current.next
|
|
return False
|
|
```
|
|
|
|
|
|
<br>
|
|
|
|
#### buckets as binary search trees
|
|
|
|
<br>
|
|
|
|
* another option for a bucket is a binary search tree, with `O(log(N))` time complexity for search, insert, and delete. in addition, bst can not hold repeated elements, just like sets.
|
|
|
|
* time complexity for search is `O(log(N/K)`, where `N` is the number of all possible values and `K` is the number of predefined buckets.
|
|
|
|
* space complexity is `O(K+M)` where `K` is the number of predefined buckets, and `M` is the number of unique values in the hash set.
|
|
|
|
<br>
|
|
|
|
```python
|
|
class Node:
|
|
def __init__(self, value=None):
|
|
self.val = value
|
|
self.left = None
|
|
self.right = None
|
|
|
|
|
|
class Bucket:
|
|
def __init__(self):
|
|
self.tree = BSTree()
|
|
|
|
def insert(self, value):
|
|
self.tree.root = self.tree.insert(self.tree.root, value)
|
|
|
|
def delete(self, value):
|
|
self.tree.root = self.tree.delete(self.tree.root, value)
|
|
|
|
def exists(self, value):
|
|
return (self.tree.search(self.tree.root, value) is not None)
|
|
|
|
|
|
class BSTree():
|
|
def __init__(self):
|
|
self.root = None
|
|
|
|
def search(self, root, value) -> Node:
|
|
if root is None or value == root.val:
|
|
return root
|
|
|
|
return self.search(root.left, value) if val < root.value \
|
|
else self.search(root.right, value)
|
|
|
|
def insert(self, root, value) -> Node:
|
|
if not root:
|
|
return Node(value)
|
|
|
|
if value > root.val:
|
|
root.right = self.insert(root.right, value)
|
|
elif value == root.val:
|
|
return root
|
|
else:
|
|
root.left = self.insert(root.left, value)
|
|
|
|
def successor(self, root):
|
|
# one step right and then all left
|
|
root = root.right
|
|
while root.left:
|
|
root = root.left
|
|
return root.value
|
|
|
|
def predecessor(self, root):
|
|
# one step left and then always right
|
|
root = root.left
|
|
while root.right:
|
|
root = root.right
|
|
return root.value
|
|
|
|
def delete(self, root, key) -> Node:
|
|
if not root:
|
|
return None
|
|
|
|
if key > root.val:
|
|
root.right = self.delete(root.right, key)
|
|
|
|
elif key < root.val:
|
|
root.left = self.delete(root.left, key)
|
|
|
|
else:
|
|
if not (root.left or root.right):
|
|
root = None
|
|
elif root.right:
|
|
root.val = self.sucessor(root)
|
|
root.right = self.delete(root.right, root.val)
|
|
else:
|
|
root.val = self.predecessor(root)
|
|
root.left = self.delete(root.left, root.val)
|
|
|
|
return root
|
|
```
|
|
|
|
<br>
|
|
|
|
----
|
|
|
|
### implementing a hash map
|
|
|
|
<br>
|
|
|
|
* same as before, we need to tackle two main issues: hash function design and collision handling.
|
|
|
|
* a good approach is using a module function with an array or linked list. at this time, there is no constraint for repeated numbers.
|
|
|
|
<br>
|
|
|
|
```python
|
|
class Bucket:
|
|
|
|
def __init__(self):
|
|
self.bucket = []
|
|
|
|
def get(self, key):
|
|
for (k, v) in self.bucket:
|
|
if k == key:
|
|
return v
|
|
return -1
|
|
|
|
def put(self, key, value):
|
|
found = False
|
|
for i, k in enumerate(self.bucket):
|
|
if key == k[0]:
|
|
self.bucket[i] = (key, value)
|
|
found = True
|
|
break
|
|
if not found:
|
|
self.bucket.append((key, value))
|
|
|
|
def remove(self, key):
|
|
for i, k in enumerate(self.bucket):
|
|
if key == k[0]:
|
|
del self.bucket[i]
|
|
```
|
|
<br>
|
|
|
|
* del is an `O(N)` operation, as we would copy all the `i` elements. to make it `O(1)` we could swap the element we want to remove with the last element in the bucket.
|
|
|