Skip to content

lib/sdt_alloc: introduce stack and buddy page allocators #1657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

etsal
Copy link
Contributor

@etsal etsal commented Apr 10, 2025

Following up from #1626, introduce a arena memory buddy allocator along with a secondary stack allocator. These allocators remove the need for creating a unique TID for each allocation of data structures with interchangeable instances, e.g., cpumasks. The buddy allocator handles arbitrary allocation sizes: Currently there is the limitation that allocations are page-aligned, but this limitation will be removed imminently. The stack allocator is used to bootstrap the main buddy allocator and doubles as a way to temporarily buffer freed allocator chunks to be possibly reused by later allocations.

@etsal etsal requested a review from htejun April 10, 2025 19:57
@etsal etsal force-pushed the arena-buddy-allocator branch from 081e9ad to 6d580ca Compare April 10, 2025 20:35
static void scx_stk_push(struct scx_stk *stack)
{
scx_ring_t *ring = stack->current;
int rind = stack->cind;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ring and rind look a bit too similar. Maybe use ridx for index?

};

/*
* We bring memory into the allocator 1MiB at a time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be smaller - e.g. like 128k? Is this dictated by the maximum size of allocation that we wanna support?

Copy link
Contributor Author

@etsal etsal Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right now this is tied to the maximum allocation size but we can make it smaller. The rationale is, since we are doing page-sized allocations and have 8 order bits the maximum allocation is 2 ^ 20 bytes. We can have fewer orders available, e.g., 5, to cap allocations at 1MiB (EDIT: meant 128 KiB here, 2 ^ (12 + 5).


/* Populate with the elements. */
for (i = zero; i < nelems && can_loop; i++) {
ring->elems[i] = (void __arena *)data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.. so we have the link structs separate from the allocated memory regions. Would it be possible to use the free regions to link the elemens? The memory isn't being used while free anyway and can be used to carry any per-element metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so basically typecast one of the elements into a ring and consume it when we need additional elements. I like the idea of typecasting freely between allocatable objects and rings, because it solves the current problem that we are allocating a new ring every time we bring new data into the allocator to avoid ever having to allocate extra rings during a free operation. This way we do not have overly long, mostly empty rings.

/* Check if the buddy is not in the free list. */
if (chunk->order_indices[order] != buddy_index &&
buddy_header->prev_index == SCX_BUDDY_CHUNK_ITEMS &&
buddy_header->next_index == SCX_BUDDY_CHUNK_ITEMS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... wouldn't it be more compact to represent free state with bitmaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, does that mean keeping a bit for each min allocation? We could do that but then how would we keep order information and find an appropriately large allocation in O(1)? Would we not have to introduce scanning the map/bitmap in the allocation path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would keeping bitmap per order work? Let's say, for simpicity, we only support 32bytes, 64bytes and 128 bytes allocations:

128b bitmap: ___1___0
 64b bitmap: _0_0_1_0
 32b bitmap: 00000010

The total size of bitmaps that we need to keep is ~2 * MAX_ALLOC / MIN_ALLOC, right? For allocating and freeing, we can still use the per-level object lists so that we don't have to scan, but when checking whether buddy is free, we can just look at the neighboring bit at the level of the object. I think the only metadata that needs to be maintained for an allocated area is the order of the allocation, which can be pretty compact (4bits?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, the level can be kept in a separate set of bitmaps too where set bits indicates that the object is at that level:

128b bitmap: ___1___0
 64b bitmap: _0_0_1_0

ie. The above would indicate that the first 128byte is a single area which is followed by a 64bit area and then two 32bit areas. Note that we don't need the last level as that's the only size that is left if all other bits are zero. This would reduce the cost of the representation to ~1/2 bit per min size area, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ~3/4 bit per min size area, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this a bit I think a bitmap-based approach would work for the order bits but not for removing the *prev pointer (unless I'm missing some way the bitmap can help with the linked list operations). Assuming that we are moving towards the following:

scx_bitmap_t freestate[];

scx_buddy_header_t header {
    u8 next_ind;
}; 

We are saving space on the order, by encoding it into 3 bits instead of a whole u8, and on the *prev pointer that we now completely remove. These gains are orthogonal, with the order bit savings being very easy to get with the current design with a simple refactoring.

Removing the *prev pointer makes things trickier and is a bit of a tradeoff:

  1. The bitmap is now the source of truth on whether the min memory chunk is allocated or free

  2. Freeing is O(1), we just go to the bitmap and mark the memory as free, then coalesce as needed. We do not remove any elements from the linked lists because, since we do not have a *prev pointer, removing it would be an O(n) operation.

  3. Allocating is trickier, because now it is not guaranteed that the elements in the linked list are actually free or of the right order. We can leave the now allocated elements into the linked list and scanning on allocation until we find the first valid element, removing already allocated blocks from the free list as we encounter them.

  4. A big problem is that if elements in the linked list get coalesced into higher-order blocks and may have had their *next pointer overwritten, we completely break the linked lists. The solution is to coalesce in waves by essentially garbage collecting freed but still-in-list elements, but I am not sure if it is worth it in terms of performance and complexity.

Overall this saves us half the bits we are using right now at the cost of extra overhead complexity (each header field should be a u8). The order bitmap gets us from 24 bits to 20 - I do not think saving the extra bit is worth the complexity of working with a 3-bit bitmap values, and we can expand the available orders to 16 to make use of it. Removing the extra pointer gets us from 20 to 12, but really complicates the design and may introduce non-obvious performance bottlenecks.

@etsal etsal force-pushed the arena-buddy-allocator branch from 6d580ca to fb84b03 Compare April 17, 2025 02:05
@etsal
Copy link
Contributor Author

etsal commented Apr 17, 2025

Optimized stack allocator and refactored order tracking for blocks. Still need to roll into the headers into the freed buddy allocator blocks.

@etsal etsal marked this pull request as ready for review April 18, 2025 15:26
@etsal etsal force-pushed the arena-buddy-allocator branch from 02caca3 to 814e2a9 Compare April 23, 2025 19:38
@etsal etsal force-pushed the arena-buddy-allocator branch from 2cb91f8 to b1dca33 Compare April 24, 2025 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants