mirror of
https://github.com/GrapheneOS/hardened_malloc.git
synced 2024-12-19 12:54:33 -05:00
add documentation on scalability design choices
This commit is contained in:
parent
41df5005e8
commit
e4061899aa
101
README.md
101
README.md
@ -404,6 +404,107 @@ size for 2048 byte spacing and the next spacing class matches the page size of
|
||||
classes required to avoid substantial waste from rounding. Further slab
|
||||
allocation size classes may be offered as an option in the future.
|
||||
|
||||
## Scalability
|
||||
|
||||
## Small (slab) allocations
|
||||
|
||||
As a baseline form of fine-grained locking, the slab allocator has entirely
|
||||
separate allocators for each size class. Each size class has a dedicated lock,
|
||||
CSPRNG and other state.
|
||||
|
||||
The slab allocator's scalability will primarily come from dividing up the slab
|
||||
allocation region into separate arenas assigned to threads. The arenas will
|
||||
essentially just be entirely separate slab allocators with the same sub-regions
|
||||
for each size class. Having 4 arenas will simply require reserving a region 4
|
||||
times as large and choosing the correct metadata based on address, similar to
|
||||
how finding the slab and slot index within the slab already works. The part
|
||||
that's still open to different design choices is how arenas are assigned to
|
||||
threads. One approach is statically assigning arenas via round-robin like the
|
||||
standard jemalloc implementation, or statically assigning to a random arena.
|
||||
Another option is dynamic load balancing via a heuristic like `sched_getcpu`
|
||||
for per-CPU arenas, which would offer better performance than randomly choosing
|
||||
an arena each time while being more predictable for an attacker. There are
|
||||
actually some security benefits from this assignment being completely static,
|
||||
since it isolates threads from each other. Static assignment can also reduce
|
||||
memory usage since threads may have varying usage of size classes.
|
||||
|
||||
When there's substantial allocation or deallocation pressure, the allocator
|
||||
does end up calling into the kernel to purge / protect unused slabs by
|
||||
replacing them with fresh `PROT_NONE` regions along with unprotecting slabs
|
||||
when partially filled and cached empty slabs are depleted. There will be
|
||||
configuration over the amount of cached empty slabs, but it's not entirely a
|
||||
performance vs. memory trade-off since memory protecting unused slabs is a nice
|
||||
opportunistic boost to security. However, it's not really part of the core
|
||||
security model or features so it's quite reasonable to use much larger empty
|
||||
slab caches when the memory usage is acceptable. It would also be reasonable to
|
||||
attempt to use heuristics for dynamically tuning the size, but there's not a
|
||||
great one size fits all approach so it isn't currently part of this allocator
|
||||
implementation.
|
||||
|
||||
### Thread caching (or lack thereof)
|
||||
|
||||
Thread caches are a commonly implemented optimization in modern allocators but
|
||||
aren't very suitable for a hardened allocator even when implemented via arrays
|
||||
like jemalloc rather than free lists. They would prevent the allocator from
|
||||
having perfect knowledge about which memory is free in a way that's both race
|
||||
free and works with fully out-of-line metadata. It would also interfere with
|
||||
the quality of fine-grained randomization even with randomization support in
|
||||
the thread caches. The caches would also end up with much weaker protection
|
||||
than the dedicated metadata region. Potentially worst of all, it's inherently
|
||||
incompatible with the important quarantine feature.
|
||||
|
||||
The primary benefit from a thread cache is performing batches of allocations
|
||||
and batches of deallocations to amortize the cost of the synchronization used
|
||||
by locking. The issue is not contention but rather the cost of synchronization
|
||||
itself. Performing operations in large batches isn't necessarily a good thing
|
||||
in terms of reducing contention to improve scalability. Large thread caches
|
||||
like TCMalloc are a legacy design choice and aren't a good approach for a
|
||||
modern allocator. In jemalloc, thread caches are fairly small and have a form
|
||||
of garbage collection to clear them out when they aren't being heavily used.
|
||||
Since this is a hardened allocator with a bunch of small costs for the security
|
||||
features, the synchronization is already a smaller percentage of the overall
|
||||
time compared to a much leaner performance-oriented allocator. These benefits
|
||||
could be obtained via allocation queues and deallocation queues which would
|
||||
avoid bypassing the quarantine and wouldn't have as much of an impact on
|
||||
randomization. However, deallocation queues would also interfere with having
|
||||
global knowledge about what is free. An allocation queue alone wouldn't have
|
||||
many drawbacks, but it isn't currently planned even as an optional feature
|
||||
since it probably wouldn't be enabled by default and isn't worth the added
|
||||
complexity.
|
||||
|
||||
The secondary benefit of thread caches is being able to avoid the underlying
|
||||
allocator implementation entirely for some allocations and deallocations when
|
||||
they're mixed together rather than many allocations being done together or many
|
||||
frees being done together. The value of this depends a lot on the application
|
||||
and it's entirely unsuitable / incompatible with a hardened allocator since it
|
||||
bypasses all of the underlying security and would destroy much of the security
|
||||
value.
|
||||
|
||||
## Large allocations
|
||||
|
||||
The expectation is that the allocator does not need to perform well for large
|
||||
allocations, especially in terms of scalability. When the performance for large
|
||||
allocations isn't good enough, the approach will be to enable more slab
|
||||
allocation size classes. Doubling the maximum size of slab allocations only
|
||||
requires adding 4 size classes while keeping internal waste bounded below 20%.
|
||||
|
||||
Large allocations are implemented as a wrapper on top of the kernel memory
|
||||
mapping API. The addresses and sizes are tracked in a global data structure
|
||||
with a global lock. The current implementation is a hash table and could easily
|
||||
use fine-grained locking, but it would have little benefit since most of the
|
||||
locking is in the kernel. Most of the contention will be on the `mmap_sem` lock
|
||||
for the process in the kernel. Ideally, it could simply map memory when
|
||||
allocating and unmap memory when freeing. However, this is a hardened allocator
|
||||
and the security features require extra system calls due to lack of direct
|
||||
support for this kind of hardening in the kernel. Randomly sized guard regions
|
||||
are placed around each allocation which requires mapping a `PROT_NONE` region
|
||||
including the guard regions and then unprotecting the usable area between them.
|
||||
The quarantine implementation requires clobbering the mapping with a fresh
|
||||
`PROT_NONE` mapping using `MAP_FIXED` on free to hold onto the region while
|
||||
it's in the quarantine, until it's eventually unmapped when it's pushed out of
|
||||
the quarantine. This means there are 2x as many system calls for allocating and
|
||||
freeing as there would be if the kernel supported these features directly.
|
||||
|
||||
## Memory tagging
|
||||
|
||||
Integrating extensive support for ARMv8.5 memory tagging is planned and this
|
||||
|
Loading…
Reference in New Issue
Block a user