Custom Allocators Demystified

saagarjha · on Oct 13, 2020

Do note that "custom allocator" can often just be "take what malloc gives you and carve it up yourself" rather than "interpose malloc"; there's usually no need to drop the system allocator completely. This lets you experiment with your own custom pools or arenas without having to go all in and make something general-purpose that works for every object in your program.

> Over the years, I’ve wasted many hours debugging memory issues. With just a hash table and a set of linked lists, you can track the history of all your allocations. That makes it pretty easy to track down use after free errors, double free errors, and memory leaks. Using guard pages around your allocations, you can detect overflows.

No, just use address sanitizer–it's meant for this. Odds are your custom allocator might have a few bugs in itself and they will make you life more annoying if you're trying to use it for only this.

klyrs · on Oct 13, 2020

I just went from "wtf is an address sanitizer" to "what's the difference between this and valgrind" to an ASAN acolyte in about 1 minute.

http://btorpey.github.io/blog/2014/03/27/using-clangs-addres...

Article mentions JVM noise, but my actual experience is with python noise. Good riddance; this is going into my CI test suite today.

gpderetta · on Oct 13, 2020

Yes! The sanitizers have been a huge game changer for me.

phkahler · on Oct 13, 2020

Having used sanitizers to find a couple C++ bugs I am convinced that Rust is a better solution to that problem. I would never have found the C++ issues without extra tooling, so just use Rust, learn the rules and let the compiler make you write correct code. To clarify, yes ASAN is awesome but using a language where you don't need it is better.

gpderetta · on Oct 13, 2020

How does rust help fix bugs in a C++ code base?

jonhohle · on Oct 14, 2020

When there's no more C++ there will be no bugs in the C++ code base ;-)

j/k, though I've fixed memory bugs, overflow, use after free, etc. written by people much smarter than me. without static analyzers, valgrind, and afl its possible that some would still exist nearly a decade later.

millstone · on Oct 14, 2020

Any custom allocator in Rust will use a lot of unsafe code.

phkahler · on Oct 14, 2020

I would hope all the analysis Rust does can inform the choice of built-in allocator so you dont need to roll your own.

Also, any custom C++ allocator is going to contain a bunch of unsafe code too.

millstone · on Oct 14, 2020

I'm not sure what you mean by "all the analysis"; by default Rust just calls malloc.

It's true that a C++ custom allocator will have lots of "unsafe" code, but this is really where C++ shines. The tooling is more complete, the aliasing rules are better documented and understood, the language fights you less, the stdlib has explicit support, it is easier to control the codegen. Rust cannot bring its strengths to bear on this sort of problem.

SloopJon · on Oct 14, 2020

> No, just use address sanitizer

In fairness, the next sentence in the post is:

> Techniques like this can be a nice complement to external tools like Valgrind or AddressSanitizer.

saagarjha · on Oct 14, 2020

I feel like it just mentioned Valgrind when I read it.

OnlyOneCannolo · on Oct 14, 2020

You can do interesting things by replacing malloc, like this compacting allocator:

https://github.com/plasma-umass/Mesh

scottlamb · on Oct 14, 2020

One of these things is not like the others.

Slabs and bump/arena allocators have significant advantages over just using the system allocator. You use them in a constrained way which allows for a more efficient implementation—slabs by having only one size of allocation and bump/arena by freeing things all at once (or maybe in LIFO order if you're fancy). Thus, they make sense in a lot of performance-oriented programs.

A buddy allocator has no such constraints. It's general-purpose. Thus, there's no obvious opportunity to do better than the system allocator. And it's pretty wasteful if your allocations aren't close to sizes of 2 (internal fragmentation => wasted RAM and cache). If you have a system allocator available that uses say size classes instead, I have no idea why you'd prefer to use a buddy allocator. For that matter, if your system allocator is a buddy allocator (or more likely: some kind of hybrid), I have no idea why you'd prefer your own implementation.

One constraint might be that you promise your buddy allocator's usage is single-threaded, and thus no locking is necessary. But a production-quality system allocator likely uses thread-local or CPU-local caches to accomplish much the same thing, so I don't think this constraint buys you much efficiency.

If you're implementing a bare-metal system on a microcontroller and need a very small memory allocator implementation for light usage, implementing your own buddy allocator might make sense. But that's a lot of caveats...

dragontamer · on Oct 14, 2020

Thread safety, or the lack of it.

The system-allocator needs to be thread-safe in a large variety of situations.

Your custom allocator doesn't have to be: maybe you know only 1 thread is running. Heck, even knowing that only 32-threads or 64-threads will ever run can make gross simplifications to the synchronization code.

System-allocator needs to be thread-safe with an unbounded number of threads (because you may call pthread_create at any time).

Your custom allocator may know more specifics about when, and where, pthreads are created, as well as whether or not they interact with each other.

---------

Case in point: lets say you malloc a 256MB region per thread. You know that only one thread would ever access this 256MB region, so you write it without any synchronization primitives what so ever.

But you still want a general purpose, multi-size supported allocator to split up the data inside the 256MB region.

You call void* fooptr = malloc(256MB), because grabbing from the system allocator should remain thread-safe (who knows how many other threads are calling malloc?). But afterwards, you can make very specific assumptions about the access patterns of that fooptr.

scottlamb · on Oct 14, 2020

Given that the system allocator likely has a thread-local or CPU-local cache, does that constraint really make things noticeably more efficient? It simplifies your implementation, sure, but the system allocator already exists, and one implementation is simpler than two...

dragontamer · on Oct 14, 2020

Hmm... my multithreaded strategy typically uses task-based parallelism with a thread-pool.

In particular: #pragma omp task

Those omp-tasks can "float" between threads, depending on your implementation. If some task enters a blocking situation (usually a task-barrier), it could switch to another pthread during the implementation.

Ex: Task A is running on Hardware-Thread#10. Task A mallocs something from Thread#10's local malloc. Task A calls barrier, which means Thread#10 "gives up Task A" back to the work queue.

Later, all tasks hit the barrier, and Task A can run again. But which threads run TaskA is left up to the runtime. Thread#25 might be running the task now. At this point, Thread#25 calls free(fooptr), but it now has to be a global-synchronization, since the data came from Thread#10's pool.

----------

Its probably not safe to assume thread-local storage to be sufficient for task-based parallelism.

scottlamb · on Oct 14, 2020

> At this point, Thread#25 calls free(fooptr), but it now has to be a global-synchronization, since the data came from Thread#10's pool.

I don't think this is true. As I understand it, it just gets put into Thread#25's pool rather than returned to Thread#10s. If there's a long-running imbalance—like a producer-consumer pattern in which all the mallocs are from Thread#10 and all the frees are from Thread#25—that will lead to more global synchronization because the allocator will have to repeatedly refill Thread#10's pool and empty Thread#25's pool. But if it's just that there's some shuffling of threads between the allocations and frees but stuff is generally balanced, there's little if any additional cost.

I think you're suggesting using a set of "task-local" custom allocators, and each of those can ignore threads? I suppose that would work, as each task is only on one thread at once, and there has to be a barrier anyway when it hops from thread to thread. But I'm skeptical it's faster than the system allocator. I'd love

nitrogen · on Oct 14, 2020

A buddy allocator has no such constraints. It's general-purpose. Thus, there's no obvious opportunity to do better than the system allocator.

The system allocator isn't guaranteed to be optimal for your workload. Way back in college one of the assignments was to write a custom allocator. The fastest allocator got extra credit. Our school lab was running Linux on SGIs with AMD Opteron CPUs. The native malloc would get 3000 allocs per second in the grading program's benchmark. I managed to top 10000/s with a buddy list allocator with coalescing, if I recall correctly.

One of the weird things was that replacing all structs with pointer arithmetic (managed by preprocessor macros) gave a pretty big boost in speed. I didn't dig too much into the assembly because it was already more than fast enough.

scottlamb · on Oct 14, 2020

I think the system allocator on most platforms is a lot better now than it was then, and if it's not, you can swap in jemalloc easily. I'd recommend this over implementing your own buddy allocator.

nitrogen · on Oct 14, 2020

For sure, writing your own allocator is the last thing you reach for after all else fails. I've never needed to do so in practice. My allocator was fast because it had no thread safety, no debugging features, raw pointer arithmetic (hard to read code), and no/optional error checking.

59nadir · on Oct 14, 2020

> For sure, writing your own allocator is the last thing you reach for after all else fails.

Writing your own general purpose allocator might be the last thing, but custom allocators for well-understood limited/localized behavior is both trivial and something one should do even for simplification in many cases. Freeing an entire arena/resetting an allocation pointer is both much faster and simpler than `malloc`/`new` & `free`/`delete`.

People often misunderstand "custom allocators" to mean "making your own `jemalloc`" and it feels like your comment runs the risk of fueling that fire. Creating your own `jemalloc` is both wasteful in that `jemalloc` performs worse than even the most simple of custom allocators and a waste of your time in that general purpose allocators already exist so if you need something like that you probably can save your time for something else.

dragontamer · on Oct 13, 2020

Hmm, a few more allocators to mention.

0. Garbage Collection: Reference Counting -- C++, Python, Rust. Reference counting augments any of the allocators discussed in the blogpost with a simple garbage-collection scheme.

1. Garbage Collection: Stop and Copy -- This is the easiest garbage collection to implement. This is closely related to the Linear-Allocator, except "free" compiles into a nop. Only "new" matters. When "new" runs out of space (which it will inevitably do), you scan the entire memory space, and copy all "current live objects" to the 2nd heap. (An identically sized heap).

Stop and copy uses up 50% of the space (If you have 32GB available, you can only have 2x16GB heaps, and can only use up to 16GBs total). Still, the gross simplicity of stop-and-copy is cool.

2. Garbage Collection -- Mark and Sweep: is a bit complicated in my experience. It works well in practice, but its harder to program (especially if you want to do it efficiently, with tri-color invariants and all that).

--------

Because of the gross simplicity of alloc() in stop-and-copy, it can be very fast in some use cases! If you can't figure out how to make free() work with linear allocators, just stop-and-copy the darn thing.

-------

Garbage collection is reserved for when your links start to point to each other in ways far more complicated thank linked lists.

If you have cycles in your linked lists ... or your linked lists act like cons pairs (aka: car / cdr) from Lisp, garbage collection is basically almost inevitable. There's no real way to know when to delete a pointer safely if you're using Lisp-style car/cdr pairs. Even reference counting is bad, because cycles are pretty common in that style of programming.

ece · on Oct 14, 2020

The newest JVM GCs are of the Stop and Copy type AFAIK, Z GC and Shenandoah. Though I'd really like to see tcmalloc/jemalloc tied into this.

chrchang523 · on Oct 13, 2020

I've gotten a lot of mileage out of the following two extensions to the linear allocator:

1. A "pop-to" operation. If you're ready to free ALL memory allocated at or after allocation X, you can do this by moving the start-of-free-space pointer back to the start of allocation X.

2. An end-of-free-space pointer, with its own push and pop-to operations.

In combination, these are often enough to get maximum value out of a fixed-size memory workspace. A function can allocate all returned data structures on one side of the workspace, and use the other side of the workspace for all temporary allocations. This approach does require tight coupling so there are a lot of settings where it clearly doesn't belong, but I've found it to be very effective for high-performance scientific code.

dbandstra · on Oct 13, 2020

This is a really nice system. Quake used it back in 1996 to fit the game in 8MB. They called it a "hunk"[0] and it worked pretty much exactly as you said.

The main drawback I find is that you can't use some generic data structure implementations that expect to be able to free memory out of order, unless you're fine with leaking before the final cleanup (if integrated into a generic allocator interface, the "free" method will probably be a no-op). For example, maybe you need a temporary hashmap to help with parsing a text file. It can be interesting to come up with implementations that work without freeing out of order.

Of course, you can always opt to use another, more general purpose allocator, on a block of memory retrieved from the hunk (see Quake's "zones").

[0] https://github.com/id-Software/Quake/blob/master/WinQuake/zo...

snicker7 · on Oct 13, 2020

Would recommend watching the "What's a Memory Allocator, Anyway?" talk from Zig contributor Benjamin Feng:

https://www.youtube.com/watch?v=vHWiDx_l4V0

scottlamb · on Oct 14, 2020

From TFA:

> The trick is to use the allocations themselves as linked list nodes so you don’t have to waste any extra memory for tracking.

There was an article that I'm struggling to find—posted to hn iirc—recommending against this approach. IIRC there were two reasons stated:

* Efficiency due to CPU cache. Freed allocations might have been unused for a long time and thus out of cache. Writing the pointer to the linked list on free puts it back into cache—a whole cache line when you only need 8 bytes or whatever for the singly-linked list—possibly at the expense of something hotter. (I think there are special instructions on many platforms to store without adding to cache, but that's probably not wise either because a bunch of frees might come together and thus need to access a previous one's book-keeping info. Better to keep all the bookkeeping info together in a dense region so you use more of the cache line.)

* Debugging troubles on buggy applications. They're more likely to overwrite adjacent allocations and thus the memory allocator's book-keeping, resulting in hard-to-debug failures. (I recall not being super convinced of this reason—I think having external tracking is insufficient for making it easy to debug this kind of error. I think you'd want to make it possible for ASAN to work well, thus you'd want to implement its "manual poisoning" interface.)

edit: I found this tcmalloc issue talking about not wanting to use singly-linked lists because of cache effects when moving stuff from the thread cache to the central list. Kind of similar. https://github.com/gperftools/gperftools/issues/951

dragontamer · on Oct 13, 2020

One more allocator to mention:

* Fixed size allocator

If you don't need multiple sizes (like malloc(size)), then a fixed-size allocator is significantly simpler to implement, to the point of absurdity.

Step 1: Put all valid memory locations into a stack.

Step 2: alloc with stack.pop();

Step 3: free with stack.push();

The end. You get memory locality, cache-friendliness, O(1) operations, zero external fragmentation. The works. Bonus points: stack.push() and stack.pop() can be implemented with SIMD with help of stream-compaction (http://www.cse.chalmers.se/~uffe/streamcompaction.pdf), and can therefore serve as a GPU-malloc as long as one-size is sufficient.

--------------

The only downsize of "fixed size" allocation is the significant amount of internal fragmentation per node. (Ex: if you only use 8-bytes per node, your 64-byte reservation is mostly wasted). But if you're making a custom-allocator, fixed-size is probably one of the easiest to implement in practice.

-------------

You can extend this to multithreaded operations by simply using atomic-swap, atomic-and, and atomic-or across a shared bitset between threads. 1-bit per fixed-size block. (bit == 1 means something has been alloc(). bit == 0 means something is free()). The stack.push() and stack.pop() operations can run thread-local.

I recommend 64-bytes or 128-bytes as your "fixed size", because 64-bytes is the smallest you get before false-sharing becomes a thing.

---------

If you have a fixed-size allocator but need items to extend beyond a fixed-size array, then learn to use linked lists.

gpderetta · on Oct 13, 2020

Well, if your stack is using a dynamic array as backing store, push is only amortized O(1). You can simply chain all all freed blocks in a singly linked free list without using additional space.

Edit: ... which is what is described in the article which most definitely I had read when I originally made this comment!

dragontamer · on Oct 13, 2020

> Well, if your stack is using a dynamic array as backing store, push is only amortized O(1). You can simply chain all all freed blocks in a singly linked free list without using additional space.

Given the higher latency on linked lists, and the relative simplicity of calculating the stack-size needed, it doesn't seem like going through the linked-list traversal is the best plan to me.

If you repeatedly need extra space for "additional slabs", I suggest an unrolled linked list (https://en.wikipedia.org/wiki/Unrolled_linked_list). Each 1MB slab with 64B blocks gets associated with a new std::array<void*, 16384>. This keeps the relative simplicity of a linked list (O(1) push / pop operations), while keeping the cache-friendliness of a normal array.

gpderetta · on Oct 13, 2020

In this case the linked list will actually be lower latency, as you'll only be touching two locations: the root and the node itself while with the queue you'll have an additional hop through it.

And yes, of course you would allocate your nodes from a linear contiguous (but possibly chunked) fix size bump allocator first.

dragontamer · on Oct 14, 2020

I'm just making this up as I go, but I'm only seeing 1x pointer dereference in the hot-loop (the "if-side" of the following if statement. The "else-side" is more complex, but not really bad)

    if(this->size >= 0){
      return lastLinkedSlab[this->size--];
    } else {
      auto old = lastLinkedSlab;
      lastLinkedSlab =  lastLinkedSlab->next;
      delete old;
      if(lastLinkedSlab == nullptr){
          lastLinkedSlab = new LinkedSlab();
      }
      this->size = 16384 - 1;
      return lastLinkedSlab->heap[this->size--];
    }

Or something like that. I'm not really seeing how this scheme incurs many indirect pointer references.

> And yes, of course you would allocate your nodes from a linear contiguous (but possibly chunked) fix size bump allocator first.

Hmm, that's probably a bit more efficient, but probably not a big deal because its a "once ever" optimization. You usually want an optimization to loop over itself a few times...

gpderetta · on Oct 14, 2020

so, what I have in mind is this. Let's assume in both cases the allocator is static. In the list case you have:

    struct node { node* next; };
    static node * root;

    void * alloc() {
         node * result = root; // 1 load
         if (result) { // fast path
           root = result->next; // 1 load + 1 store
           return result;
         } else {
           //slow path
         }
    }

There is a single load in the critical path (loading the content of result); the additional load of result next and the store are not in the critical path as the result does not depend on them. Also loading the content of result->next will touch a cacheline that will be accessed very soon by the caller of alloc anyway.

In the queue case you have:

    struct queue  { void ** data; int size; }
    static queue q;
    void * alloc() {
         int sz = q.size -1; // 1 ld [a] + 1 sub 
         if (sz >= 0) { // fast path
              size = sz; // 1 store
              void ** data = q.data; // 1 ld [b]
              return data[sz]; // 1 ld [c]
         } else { /* slow path */ }
    }

In this case the chain b->c is in the critical path (a can be in parallel with b, and the sub has minimal latency); the store is not in the critical path. So as you can see there is an additional dereference. Also it will touch the additional cacheline pointed by data+sz.

If you are doing many allocations in a very tight loop, then the store in the linked list example suddenly becomes part of the critical path and can become a problem, while the compiler can extract some parallelism from the size computation in the queue example. If that scenario is important for you then the queue can be preferable.

dragontamer · on Oct 14, 2020

Okay, I had something wrong earlier, but I deleted my earlier, incorrect post. That's what I get for rushing and trying to do this in one go. I think I got it correct this time:

    struct Node{
        char rep[64]; // 64-bytes per node
    };

    struct StackLink {
      struct StackLink* prev;
      Node* nodePtrs[4096];
    };

    static StackLink* last; // Points to nullptr when empty
    static int stackLinkSize;

It'd be a stack, not a queue. So I changed the name to "stack" to better represent the strategy. stackLinkSize doesn't need to be part of the link structure. That's why the fast path is simply:

    return root->nodePtrs[stackLinkSize--];

This basically breaks down into:

    dereference(root + offsetOf(nodePtrs) + stackLinkSize * sizeof(Node))

Just one dereference, if I'm counting correctly.

I think I agree with you that "size" is dereferenced (for dereference #2), but since its "Hot" I don't think it would be the bottleneck for anything. L1 references are very fast.

gpderetta · on Oct 14, 2020

You also need to load root. If you do not count that, then in the linked list case there are no dereferences (root is litterally the pinter to the allocated block).

The stack case will touch three cachelines (root, nodeptrs and the final block). The linked list only two (there is no nodeptrs).

TBH it would be very hard to design a nonncompletely artificial benchmark were une desig make a difference compared to the other. Only way is test on an actual application and check what fits best.

dragontamer · on Oct 14, 2020

> TBH it would be very hard to design a nonncompletely artificial benchmark were une desig make a difference compared to the other. Only way is test on an actual application and check what fits best.

I think I can agree to that.

I should note that I designed this allocator to be used on the GPU. "Size" is shared between a wavefront / SIMD unit. popcnt(exec_mask) determines how many items to alloc at a time. (if 200 threads call alloc, then they will collectively grab 200 pointers all at once, by doing size -= popcount(exec_mask)).

gpderetta · on Oct 15, 2020

oh, in that contest, being able to parallelize allocation this way must be a big win. Of course it wouldn't work with a free list.

imtringued · on Oct 14, 2020

when you're unrolling your linked lists you don't have to go that far. I'd say at least 128 elements and 1024 elements at most. To be fair, this is just based on intuition.

hinkley · on Oct 13, 2020

Isn't this a variant/degenerate case of slab allocation?

dragontamer · on Oct 13, 2020

Hmmm... I'm pretty sure that when I hear "slab allocation", I imagine multiple sizes supported.

In the case of "fixed-size" allocations, you support all "smaller sizes" by returning a large size. Ex: malloc(4) returns a 64-byte block. malloc(20) returns a 64-byte block. malloc(64) returns a 64-byte block. malloc(65) causes assert( size < 64) fails, and your program exits.

malloc(65) would probably create a "new slab", supporting 128-byte blocks under a slab allocator. At least, based on how I normally hear the term used. malloc(34) may return a 64-byte block, but malloc(31) may return 32-byte block (supporting true sizes as power-of-2).

-------------

The "slab allocator" discussed in the blogpost seems to be a fixed-size slab allocator though. So the blogpost's terminology doesn't match my own.

malkia · on Oct 13, 2020

For C++, I walked (halfway) the really long road of replacing std:: with std::pmr:: where I can, safely pluging jemalloc - and solve the performance issues we had with the standard heap allocator (Windows). But std::pmr::string now being 8 bytes bigger caused us slowdowns unexpectedly. Also tons of changes, different types, overall not great.

Then discovered mimalloc, which hooks at low-level malloc/free, and at the same time a coworker also did similar job and now I'm no longer in love with "pmr".

I wish it was success story, but it's not. It probably is, if everyone is on board with it. But hardly this was my case. Also libraries are not - Qt is not, and many others.

pmr is great for things like - try to alloc on the stack, and if not enough continue with heap (under the hood), but the extra 8 bytes, and the fact that do_alloc is virtual call is problem (last one maybe not perf enough, but still hard to sell to performance oriented programmers).

I wish it was designed in way where it could've been sold as such.

gpderetta · on Oct 13, 2020

Hum, pmr is for type erasing allocators, i.e. you need multiple allocators and you do not want to instantiate your templates more than once.

If you want to use jemalloc everywere, just create your own (stateless) allocator wrapper and instantiate your string with it.

In fact you probably don't even need to do that. Doesnt jemalloc provide a dll interposer mode that transparently replaces malloc/free?

malkia · on Oct 14, 2020

No it doesn't. It can't replace allocators in DLLS (e.g. it's possible, but the jemalloc implementation was not able to do, unlike say mimalloc, though special precaution had to be taken during linking with mimalloc).

Aardwolf · on Oct 13, 2020

I know custom allocators are important for tiny platforms and so on, but what's a reason C/C++ can't allocate memory in a satisfactory way by default on those platforms? And how do other languages get away with it?

jeffbee · on Oct 13, 2020

C++ doesn't come out of the box with an allocator at all. Implementations have to provide it. But this article isn't talking about the difference between, say, jemalloc and mimalloc. It's talking about cases where you want to minimize calls to global operator new, or cases where you want to make a lot of allocations but you don't want to have to delete anything. The latter is often a massive advantage in speed. For example if you need to use a std::set<int> within a scope, and it doesn't escape that scope, it will be much faster to provide an arena allocator that allocates the nodes used by std::set, both because it will minimize the necessary calls to global new -- it may even eliminate them if you can safely use enough stack space -- and especially because there isn't a corresponding deallocation for every allocation. You simply discard the entire arena at the end of the scope.

Fans of Java may rightly point out that garbage collection also has this property, but GC brings other costs. Nothing in Java even remotely approximates the performance of an STL container backed by an arena allocator.

pdpi · on Oct 13, 2020

"Satisfactory" means different things for different people in different contexts, so a general-purpose malloc might not do as well as you'd like.

The simplest and fastest allocator you could possibly write is something where malloc just increments a pointer into a chunk of memory, and free is a no-op. This is obviously terrible as your system-wide allocator, but you can allocate an arena from the system malloc, then use the simple allocator to chunk out that arena. This is useful for example if you know that a web request has a reasonable upper bound on memory usage. You just allocate that much memory from the system at the start of the request, use the increment allocator for all request-bound objects, then release the arena at the end of the request. One single call to the system's somewhat heavy malloc/free, many calls to a trivially simple malloc and a no-op free.

Other languages don't so much "get away with it" as they're simply not used for workloads where it matters.

jasonzemos · on Oct 13, 2020

There's no single confluence where every application's needs are provided by a single C/C++ implementation's default allocator. I'll just speak to one example off the top of my head: the default glibc malloc() stores an allocation size near or below the actual allocation data itself. In contrast, an alternative like jemalloc tracks the size in a separate control block. When freeing allocations with the former, that memory has to be touched; with the latter, the control block has to be touched instead. Similarly it can lead to less or more efficient packing of aligned data. All of this yields better or worse performance depending on the application.

gpderetta · on Oct 13, 2020

It is very hard to beat a good off the shelf allocator in the general case, but it is easy in specialized cases, simply because you can make tradeoffs and rely on knowledge the allocator doesn't have.

skohan · on Oct 14, 2020

> but what's a reason C/C++ can't allocate memory in a satisfactory way by default on those platforms?

Custom allocators are important when memory performance is relevant. Allocation/deallocation is really slow compared to using memory which is already allocated to a program, so this can often be one of the biggest performance syncs in a program, especially one which churns through a lot of data (e.g. videogames, simulation, ML training etc.)

Custom allocators can also help with memory coherency: you can pack objects you're using together close together in memory, which minimizes the amount of CPU cache misses, which can also be one of the most expensive parts of execution. It may be difficult or impossible for a compiler to design a more optimal memory layout than a programmer with knowlege of how the program will use the data.

> And how do other languages get away with it?

Custom allocators are most relevant in performance-critical contexts, so you're probably already using C++ because a higher-level language was already too slow for your use-case. In other words, other languages pay the same cost for memory management, but if you're programming in Python you're probably working in a domain where the performance difference doesn't matter.

Even programs written in a "fast" GC'd language like GO will have a performance ceiling largely dictated by memory churn.

alfalfasprout · on Oct 13, 2020

It's not just tiny platforms. For example, you often want allocators with some predictable characteristic. Eg; constant time allocation for a given request size. Or you might want an allocator that keeps certain items contiguous so they'll tend to already be in cache. Custom allocators tend to be used when you have a particular requirement that a general purpose allocator isn't optimal for.

For tiny platforms often it's bare-metal so there's only one running "process" and so something like "malloc" that needs to be aware of memory usage across the system isn't relevant. Moreover, embedded systems tend to care about deterministic performance so an allocator specific to the application can be a better match.

59nadir · on Oct 14, 2020

> And how do other languages get away with it?

General purpose allocators are (basically) never the best for speed, fragmentation, and so on. They can at most be good enough. Basic knowledge about how your memory is going to be used that'll inform the use of a few simple custom allocators is likely to give you several times the performance that you'd see with a general purpose allocator. In the best case you'll literally reset an allocation pointer once per loop and just overwrite memory that you know isn't used anymore, for example, making allocation a write into an already existing buffer and "free" the resetting of the pointer. This has nothing to do with platforms, but rather is just basic removal of dumb code that shouldn't be running. There's zero reason a an actual `free`/`delete` should be used in those cases and using it is likely to slow down that loop considerably.

> And how do other languages get away with it?

Much like general purpose allocators are never the best, there is zero evidence to suggest that garbage collection is ever the optimal choice; it can only ever be good enough. There's also a lot of lore around garbage collectors being magic and doing lots of great things for you to ensure less fragmentation and nice cache locality but people are usually just amazed at how it's "not that bad" in the end.

There's no silver bullet: Garbage collectors and general purpose allocators aren't magic pieces of code written by developers who can conjure up the best code for every use case. Like most general code they're "not bad" at everything but not actually very good at anything. Running less code for your allocation and eliminating things based on your knowledge of how things are going to be used is always going to be better.

phkahler · on Oct 13, 2020

>> I know custom allocators are important for tiny platforms and so on...

For tiny platforms like micro controllers used all over your car, standard practice is to never use dynamic memory allocation. When you have 1K or even 64K of RAM and need to run for a while the only way to be sure you never run out of memory is not to use malloc at all. Fragmentation is a thing. This may mean managing fixed collections of objects much like the allocators in the article, but the linker figures out where to put them at compile time.

imtringued · on Oct 14, 2020

>And how do other languages get away with it?

They don't get away with it. They suffer the consequences of bad performance. Of course depending on your metric they are "faster" in some situations that are common to that language but C++ often lets you avoid temporary heap allocations by using the stack which further increases performance.

detaro · on Oct 14, 2020

> And how do other languages get away with it?

They either bring similar mechanisms, bring a selection of their own allocators that covers enough space or just don't allow influencing it much and get away with it because many applications/users do not care.

fsociety · on Oct 13, 2020

You can allocate pools of memory so that you have a fixed memory footprint. That way you have (almost) no risk of OOMing later on in the process.

This can also give you better control over cache locality of memory.

dnautics · on Oct 13, 2020

It's not just useful for tiny platforms. Consider the erlang virtual machine, which has something like 11 custom allocators internally, for different performance characteristics.

mac01021 · on Oct 13, 2020

Which other languages do you have in mind?

favorited · on Oct 13, 2020

Andrei Alexandrescu's 2015 talk about C++'s allocators is great. He's a very entertaining presenter, and he makes case for extremely composable templated allocators (and why std::allocator isn't that).

https://www.youtube.com/watch?v=LIb3L4vKZ7U

chromatin · on Oct 13, 2020

Great talk.

He and others made these principles manifest in Dlang's std.experimental.allocator [0] which offers composability and is really quite nice, allowing for you to use Dlang's GC, malloc, custom malloc, various strategies for metering out the allocated blocks, and any combination thereof.

The documentation can be hard to navigate and it is easy to miss the submodules listed in the tree on the left hand side, so I would especially draw attention to `building blocks` [1] and `mallocator`, `gc_allocator`, etc.

[0] https://dlang.org/phobos/std_experimental_allocator.html [1] https://dlang.org/phobos/std_experimental_allocator_building...

mhh__ · on Oct 14, 2020

Really should be std.allocator now.

I might bring it up on the forum later.

dtornabene · on Oct 14, 2020

if people are interested in this I'd recommend Silvio Cesare's new hackerspace blog. He's been publishing what appears to be a running series for a year and change now on heap attacks, highly illuminating. Several of the posts are attacks on embedded and allocators other than the main one in glibc

https://blog.infosectcbr.com.au/

fizixer · on Oct 13, 2020

Tangential, but I just learned (TIL) that C is platform-independent in the following way, compared to Java:

- C is "write once, compile everywhere, run everywhere"

- Java is "write once, compile once, run everywhere"

jjtheblunt · on Oct 13, 2020

maybe taking the JIT-compiler inside the JVM into account...

- Java is "write once, partially compile once, finish compiling everywhere, run everywhere"

?

saagarjha · on Oct 13, 2020

Not every JVM includes a JIT; the statement you're replying to is fairly accurate.

fizixer · on Oct 13, 2020

I agree (was just oversimplifying).

I'm starting to think the biggest selling point for Java and dotNET platforms is, not platform-independence, but code obfuscation.

In mid-1990s closed source was big, and Java provided a platform to obfuscate the code before shipment, while retaining the advantages of hardware-specific code-gen (final phase of a compiler), to make it harder for the user to copy the ideas.

These days open source is king, no need to obfuscate, and a reason C is making a comeback of sorts.

to11mtm · on Oct 13, 2020

> I'm starting to think the biggest selling point for Java and dotNET platforms is, not platform-independence, but code obfuscation.

I wouldn't agree for that in the case of .NET Framework.

On one hand, the safety of both languages does make it easier for obfuscators to pull certain tricks i.e. swapping out direct calls for Delegates (memory safe pointers for those unfamiliar with .NET terms) or throwing a bunch of indirect method calls, etc etc because in most cases object lifetime is something you don't think about in LOB .NET.

On the other hand, the problem is that almost any IL can be pulled back into a representation that may not be -fully- comprehensible, but again there are tools to even help with that b/c the C# language spec is well defined enough you know what you need to strip that isn't truly needed for a decompile.

Java/C# got big because of (1) hype, (2) somewhat-filled promises of xplat, (3) somewhat filled promises of better productivity because you don't have to think about most object lifetimes.

We're seeing a shift back to Go/C/Rust/Etc because people are realizing as data grows that a lot of their C#/Java code suddenly isn't so great at huge scale when a GC is churning all the time.

Both C# and Java are working on this in their own way of course; Java is doing a lot of work on their GC, .NET is doing a lot of work to make sure that their IO pipelines for things like sockets are better handled; after that it's up to the dev to decide if they want to go struct-happy.

imtringued · on Oct 14, 2020

Java and .NET have mature decompilers. Obfuscators or minifiers replace meaningful symbol names with gibberish. Meanwhile in natively compiled applications that information never existed in the first place unless it was explicitly added into the symbol table and not removed.

If you want to obfuscate your codebase then write your software in Delphi. The native code it outputs will drive any reverse engineer insane even though there was no attempt at obfuscation...