lib/sdt_alloc: introduce stack and buddy page allocators #1657

etsal · 2025-04-10T19:57:04Z

Following up from #1626, introduce a arena memory buddy allocator along with a secondary stack allocator. These allocators remove the need for creating a unique TID for each allocation of data structures with interchangeable instances, e.g., cpumasks. The buddy allocator handles arbitrary allocation sizes: Currently there is the limitation that allocations are page-aligned, but this limitation will be removed imminently. The stack allocator is used to bootstrap the main buddy allocator and doubles as a way to temporarily buffer freed allocator chunks to be possibly reused by later allocations.

htejun · 2025-04-10T20:48:40Z

lib/sdt_alloc.bpf.c

+static void scx_stk_push(struct scx_stk *stack)
+{
+	scx_ring_t *ring = stack->current;
+	int rind = stack->cind;


ring and rind look a bit too similar. Maybe use ridx for index?

htejun · 2025-04-10T20:52:47Z

scheds/include/lib/sdt_task.h

+};
+
+/*
+ * We bring memory into the allocator 1MiB at a time.


Can this be smaller - e.g. like 128k? Is this dictated by the maximum size of allocation that we wanna support?

Yes, right now this is tied to the maximum allocation size but we can make it smaller. The rationale is, since we are doing page-sized allocations and have 8 order bits the maximum allocation is 2 ^ 20 bytes. We can have fewer orders available, e.g., 5, to cap allocations at 1MiB (EDIT: meant 128 KiB here, 2 ^ (12 + 5).

htejun · 2025-04-10T20:58:57Z

lib/sdt_alloc.bpf.c

+
+	/* Populate with the elements. */
+	for (i = zero; i < nelems && can_loop; i++) {
+		ring->elems[i] = (void __arena *)data;


Hmmm.. so we have the link structs separate from the allocated memory regions. Would it be possible to use the free regions to link the elemens? The memory isn't being used while free anyway and can be used to carry any per-element metadata.

Got it, so basically typecast one of the elements into a ring and consume it when we need additional elements. I like the idea of typecasting freely between allocatable objects and rings, because it solves the current problem that we are allocating a new ring every time we bring new data into the allocator to avoid ever having to allocate extra rings during a free operation. This way we do not have overly long, mostly empty rings.

htejun · 2025-04-10T21:01:02Z

lib/sdt_alloc.bpf.c

+		/* Check if the buddy is not in the free list. */
+		if (chunk->order_indices[order] != buddy_index &&
+		    buddy_header->prev_index == SCX_BUDDY_CHUNK_ITEMS && 
+		    buddy_header->next_index == SCX_BUDDY_CHUNK_ITEMS)


Hmm... wouldn't it be more compact to represent free state with bitmaps?

IIUC, does that mean keeping a bit for each min allocation? We could do that but then how would we keep order information and find an appropriately large allocation in O(1)? Would we not have to introduce scanning the map/bitmap in the allocation path?

Would keeping bitmap per order work? Let's say, for simpicity, we only support 32bytes, 64bytes and 128 bytes allocations:

128b bitmap: ___1___0 64b bitmap: _0_0_1_0 32b bitmap: 00000010

The total size of bitmaps that we need to keep is ~2 * MAX_ALLOC / MIN_ALLOC, right? For allocating and freeing, we can still use the per-level object lists so that we don't have to scan, but when checking whether buddy is free, we can just look at the neighboring bit at the level of the object. I think the only metadata that needs to be maintained for an allocated area is the order of the allocation, which can be pretty compact (4bits?).

Alternatively, the level can be kept in a separate set of bitmaps too where set bits indicates that the object is at that level:

128b bitmap: ___1___0 64b bitmap: _0_0_1_0

ie. The above would indicate that the first 128byte is a single area which is followed by a 64bit area and then two 32bit areas. Note that we don't need the last level as that's the only size that is left if all other bits are zero. This would reduce the cost of the representation to ~1/2 bit per min size area, I think.

Oh, ~3/4 bit per min size area, right?

After thinking about this a bit I think a bitmap-based approach would work for the order bits but not for removing the *prev pointer (unless I'm missing some way the bitmap can help with the linked list operations). Assuming that we are moving towards the following:

scx_bitmap_t freestate[]; scx_buddy_header_t header { u8 next_ind; };

We are saving space on the order, by encoding it into 3 bits instead of a whole u8, and on the *prev pointer that we now completely remove. These gains are orthogonal, with the order bit savings being very easy to get with the current design with a simple refactoring.

Removing the *prev pointer makes things trickier and is a bit of a tradeoff:

The bitmap is now the source of truth on whether the min memory chunk is allocated or free

Freeing is O(1), we just go to the bitmap and mark the memory as free, then coalesce as needed. We do not remove any elements from the linked lists because, since we do not have a *prev pointer, removing it would be an O(n) operation.

Allocating is trickier, because now it is not guaranteed that the elements in the linked list are actually free or of the right order. We can leave the now allocated elements into the linked list and scanning on allocation until we find the first valid element, removing already allocated blocks from the free list as we encounter them.

A big problem is that if elements in the linked list get coalesced into higher-order blocks and may have had their *next pointer overwritten, we completely break the linked lists. The solution is to coalesce in waves by essentially garbage collecting freed but still-in-list elements, but I am not sure if it is worth it in terms of performance and complexity.

Overall this saves us half the bits we are using right now at the cost of extra overhead complexity (each header field should be a u8). The order bitmap gets us from 24 bits to 20 - I do not think saving the extra bit is worth the complexity of working with a 3-bit bitmap values, and we can expand the available orders to 16 to make use of it. Removing the extra pointer gets us from 20 to 12, but really complicates the design and may introduce non-obvious performance bottlenecks.

etsal · 2025-04-17T02:06:49Z

Optimized stack allocator and refactored order tracking for blocks. Still need to roll into the headers into the freed buddy allocator blocks.

etsal requested a review from htejun April 10, 2025 19:57

etsal force-pushed the arena-buddy-allocator branch from 081e9ad to 6d580ca Compare April 10, 2025 20:35

htejun reviewed Apr 10, 2025

View reviewed changes

etsal force-pushed the arena-buddy-allocator branch from 6d580ca to fb84b03 Compare April 17, 2025 02:05

etsal marked this pull request as ready for review April 18, 2025 15:26

etsal and others added 13 commits April 23, 2025 09:18

p2dq: move task contexts into arenas

53544cb

lib/sdt_alloc: add missing unlock call

f573ced

lib: introduce arena-backed stack for allocations

5c042d8

scx_wd40: use stack-backed allocator for domains

ebfbb62

lib/sdt_alloc: buddy allocator allocation path

0771761

sdt_alloc: buddy allocator free path

82e9282

lib/sdt_alloc: rename rind to ridx to avoid confusion with ring

10d8483

lib/sdt_alloc: allow stack to use ring and data memory interchangeably

d43aa05

lib/sdt_alloc: store buddy header orders as 4-bit values in bitmap

f0f524a

lib/sdt_alloc: factor out header indexing for buddy allocator

7181152

lib/sdt_alloc: replace hardcoded offsets in buddy allocator masks

da74563

lib/sdt_alloc: store free block headers in the memory block itself

9af339d

lib/sdt_alloc: optimize buddy memory allocation for chunk header

814e2a9

etsal force-pushed the arena-buddy-allocator branch from 02caca3 to 814e2a9 Compare April 23, 2025 19:38

etsal added 2 commits April 23, 2025 16:32

Merge branch 'main' into arena-buddy-allocator

3cb3cf1

lib/sdt_alloc: properly unlock stack on allocation error

b1dca33

etsal force-pushed the arena-buddy-allocator branch from 2cb91f8 to b1dca33 Compare April 24, 2025 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/sdt_alloc: introduce stack and buddy page allocators #1657

lib/sdt_alloc: introduce stack and buddy page allocators #1657

etsal commented Apr 10, 2025

htejun Apr 10, 2025

htejun Apr 10, 2025

etsal Apr 10, 2025 •

edited

Loading

htejun Apr 10, 2025

etsal Apr 10, 2025

htejun Apr 10, 2025

etsal Apr 10, 2025

htejun Apr 10, 2025

htejun Apr 10, 2025

htejun Apr 10, 2025

etsal Apr 11, 2025

etsal commented Apr 17, 2025

lib/sdt_alloc: introduce stack and buddy page allocators #1657

Are you sure you want to change the base?

lib/sdt_alloc: introduce stack and buddy page allocators #1657

Conversation

etsal commented Apr 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etsal Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etsal commented Apr 17, 2025

etsal Apr 10, 2025 •

edited

Loading