https://blog.vito.nyc/posts/gil-balm/ Index 14 Dec 2023 Balm in GILead Fast string construction for CPython extensions Speed is found in the minds of people -- Andrei Alexandrescu Python isn't slow. The core eval loop, while slower than a JIT, is no slouch when it comes to dispatching bytecode. There is no reason that business logic written in Python which orchestrates the operation of highly optimized extension libraries should be a bottle neck. Time is precious, get to the point Code and benchmarks are here, fair winds and following seas. Also, hey, if you're real busy because you have a job and you're hiring, I also would like a job. Resume is here. Well, there are two reasons. The first is the infamous Global Interpreter Lock, the GIL, which forces serialization onto operations that might otherwise have been able to be executed in parallel by extensions. This is widely recognized and improving, the future appears GIL-less, or at least GIL-optional. Yet the second I rarely see discussed except among hardcore extension devs: building PyObjects for blessed types like strings and dictionaries is damn expensive. Every call to PyUnicode_New() makes me quiver with fear, no matter how optimized CPython's small object allocator is. Worse yet, as long as the GIL still walks among us, one must hold the GIL to create objects of the blessed types. [doctor_hud] The common solution is to abandon the world of blessed types. No PyLists, no PyUnicodes, not even a measely PyLong. Building your own types makes you strong, like ox, or Mark Shannon. Once a type exists outside the blessed Python type system you are free to create your own allocators and deallocators for it, and you may optimize its operations as you see fit. Of course, you now must do a great deal of work to ensure that TimmysDict interops with all the other Python types, and god forbid it ever encounters a PetersDict. You're also banned by international treaty (and several local state laws) from ever interacting with Python standards like WSGI, which dictate only pure-bred Python types may be used. GIL Balm'ing The uncommon solution to this problem is something my [DEL:dumb:DEL] [DEL:challenged:DEL] creative friends and I call GIL Balm'ing (the name, of course, is derived from a sacred text of the Italian American native faith: raised Catholic). In a sentence, GIL Balm'ing is the act of intrusively adapting a native CPython type to be used outside the context of the GIL. In the simplest case, this means replacing tp_new, tp_free, and tp_dealloc of the type you are Balm'ing with something more suitable (often NULL). Every Python type is a little different and no two applications the same, so the exact steps to get this working depends on the use case. The Python str type serves as a useful example because we have two major use cases that lots of applications benefit from: * Converting string views into strs without performing a copy * Pooling compact str objects for use without grabbing the GIL or allocating Caveats & Complications A natural question might be, if this is so good for string views and compact strings, why not also use it for really big strings where an allocation is going to be necessary? At least we'll get to avoid the GIL, right? Ya, sure, but the entire point of doing something like this is to avoid the allocation at all. If you're going to take the hit on the allocation anyway, doing one more operation while holding the GIL (to allocate an actual string) is not going to matter that much. And as we'll get to later, GIL Balm'ing is not completely transparent and involves deciding what parts of the CPython contract you're going to honor. If the performance gain is negligible, you should use the blessed CPython type, not a Balm'd type. Python str is also useful because its structure is amenable to various Balm'ing strategies. The Measure of a String In order to Balm the str, we must first understand the str. The CPython source code is extensively commented and from it we learn there are three string structures: * Compact ASCII (PyASCIIObject): Promises to store only ASCII bytes, and stores the data immediately following the object in memory * Compact (PyCompactUnicodeObject): Similar to the above, but for Unicode encodings * Legacy String (PyUnicodeObject): Stores data in a separate buffer pointed to by the object, may or may not be ASCII This brings us to the second point about GIL Balm'ing, this is a general technique, not a specific library or code recommendation. Python strings are flexible structures that can be adapted to many applications, which you choose to do is up to you. Will you support Unicode or only ASCII? Do you want to take advantage of the compact string representations or just plug pointers into legacy strings? Now that we know the string, we can adapt it. For this demonstration we'll be targeting the string view and pooling uses described above (with a fallback for strings that don't fit inside our compact representation), but only for ASCII strings. Figure 1 demonstrates a structure we'll be exploring. Figure 1 Assume there are headers for anything that requires one typedef struct { size_t ref_count; char data[]; } RefCountedData; typedef union { PyASCIIObject ascii; struct { // Force alignment on data[] to be correct struct { PyObject_HEAD; Py_ssize_t length; Py_hash_t hash; struct { unsigned int interned : 2; unsigned int kind : 3; unsigned int compact : 1; unsigned int ascii : 1; unsigned int statically_allocated : 1; unsigned int : 6; // Balm state data, lives in unused padding bytes unsigned int balm : 2; unsigned int balm_offset : 16; } state; }; char data[]; }; struct { PyUnicodeObject uc; RefCountedData* rd; }; } BalmString; [capture_hu] The first thing to note is creating a new C type is not a necessary part of GIL Balm'ing, often we can re-use the structures from CPython directly. Here we do so in order to have one unified string type for both of our use cases, string views and compact strings. The second is that we're replicating the internals of the PyASCIIObject inside an anonymous structure. Sometimes this sort of replication is necessary, either because CPython doesn't expose the necessary structures (looking at you, dict), or because we want to do a little smuggling. Stop Right There, Criminal Scum! Union type punning? Storing things in unused padding? Replicating private internal CPython data structures? This is a recipe for disaster! I mean, ya. None of this stuff is safe. There's a reason you won't find this in the official Python documentation. We're leaps and bounds outside the official API and associated safety guarantees. Do what you want cause a pirate is free For BalmString, we want to store some state data inside the structure and PyASCIIObject very conveniently has a bunch of unused padding bits. It's a simple thing to use some of those for our state data. The Author Hasn't Read Knuth There is probably a very efficient way to collect and dispatch a group of pre-allocated objects. If you know of one, please implement it and use it and stuff, that sounds great. We're going to use this list-stack thing in Figure 2. Figure 2 typedef struct BalmStringNode { struct BalmStringNode* next; BalmString str; } BalmStringNode; #define GET_STRUCT_FROM_FIELD(pointer, s_type, field) \ ((s_type*) (((char*) pointer) - offsetof(s_type, field))) #define GET_NODE_PYOBJ(pointer) \ GET_STRUCT_FROM_FIELD(pointer, BalmStringNode, str) typedef struct { mtx_t lock; BalmStringNode* head; } StrPool; static struct { StrPool bigviews; StrPool compacts; } pools; static void balmstr_push(StrPool* pool, PyObject* str) { BalmStringNode* node = GET_NODE_PYOBJ(str); mtx_lock(&pool->lock); node->next = pool->head; pool->head = node; mtx_unlock(&pool->lock); } static BalmString* balmstr_pop( StrPool* pool, BalmStringNode* (*alloc)(size_t), size_t alloc_len ) { mtx_lock(&pool->lock); if(!pool->head) pool->head = alloc(alloc_len); BalmStringNode* node = pool->head; pool->head = node->next; mtx_unlock(&pool->lock); return &node->str; } I actually have no formal education on data structures or algorithms at all. It never slows me down except when I need to know what to call these things. This feels like a singly-linked list, but we provide no capacity for random access / extracting / etc. You can only push and pop from it, and that seems like a stack. But there's no stack pointer or storage array, so a queue? But it's not FIFO. A stack-implemented-using-a-singly-linked-list-thingy. I don't think this code needs too much commentary for those who can read C. For those who can't, thanks for tagging along, we're happy to have you. [book_hu863] We're going to be creating a list or a queue or something, whatever, a pool of pre-allocated BalmStrings. We can push to and pop from the pool instead of having to allocate and deallocate objects from the memory manager. If the pool is empty, we can allocate more BalmStrings using whatever allocation strategy the user wants. The only thing of note is the layout of the BalmStringNode, the next pointer comes before the BalmString itself. This makes things a little more complicated than they need to be, because we have to do some pointer arithmetic to recover a BalmStringNode from a given BalmString instead of pointer casting between them directly. However, recall the layout of a compact Python string, the data immediately follows the object in memory. If we placed the next pointer after the str, it would be considered a part of the string. That's nonsense, so we make do with some macros to recover the BalmStringNode. More interesting are the block allocators in Figure 3. Figure 3 enum BalmStringType { BALM_STRING_VIEW, BALM_STRING_COMPACT, BALM_STRING_BIG, }; BalmStringNode* balmstr_block_alloc(size_t len) { BalmStringNode* base = malloc(sizeof(*base) * len); BalmStringNode* prev = NULL; for(BalmStringNode* v = base; v < base + len; ++v) { v->next = prev; prev = v; v->str = (BalmString) { .ob_base.ob_type = &BalmString_Type, .state = { .kind = PyUnicode_1BYTE_KIND, .ascii = 1, }, }; } return prev; } // Powers of two are cool #define BALM_COMPACT_ALLOCATION 128 #define BALM_COMPACT_MAX_STR \ (BALM_COMPACT_ALLOCATION - sizeof(PyASCIIObject)) BalmStringNode* compactbalmstr_block_alloc(size_t len) { size_t sz = BALM_COMPACT_ALLOCATION * len; BalmStringNode* base = malloc(sz); char* cur = (char*) base; char* end = cur + sz; BalmStringNode* prev = NULL; for( BalmStringNode* v = base; cur < end; v = (BalmStringNode*) (cur += BALM_COMPACT_ALLOCATION) ) { v->next = prev; prev = v; v->str = (BalmString) { .ob_base.ob_type = &BalmStringCompact_Type, .state = { .kind = PyUnicode_1BYTE_KIND, .compact = 1, .ascii = 1, .balm = BALM_STRING_COMPACT, }, }; } return prev; } Both balmstr_block_alloc and compactbalmstr_block_alloc allocate len number of their respective object types and then iterate over the allocation linking up the objects. What's interesting is the initialization we can do here. For the regular balmstr_block_alloc we can only set up the parts of the PyASCIIStringObject state that we know will always be true; its ob_type will be BalmString_Type, it will only contain ASCII strings, and the underlying data will occupy one byte per code unit. For the compactbalmstr_block_alloc we have more information. We can initialize the balm state data that we smuggled into the padding bits. This tell us, given a random BalmString, the nature of the underlying data. Your Type is My Type The entire purpose GIL Balm'ing is to avoid the kosher practice of hand-crafting new Python types. That means we need to steal a pre-existing type and replace its allocators with our own. Figure 4 demonstrates exactly that. Figure 4 static PyTypeObject BalmString_Type; static PyTypeObject BalmStringCompact_Type; static void balmstr_dealloc(PyObject* str) { BalmString* balm = (BalmString*) str; RefCountedData* rd = balm->rd; if(!(--rd->ref_count)) free(rd); balmstr_push(&pools.bigviews, str); } static void balmstrcompact_dealloc(PyObject* str) { balmstr_push(&pools.compacts, str); } void balm_init() { BalmString_Type = PyUnicode_Type; BalmString_Type.tp_new = NULL; BalmString_Type.tp_free = NULL; BalmString_Type.tp_dealloc = balmstr_dealloc; BalmStringCompact_Type = BalmString_Type; BalmStringCompact_Type.tp_dealloc = balmstrcompact_dealloc; init_strpool(&pools.bigviews, balmstr_block_alloc); init_strpool(&pools.compacts, compactbalmstr_block_alloc); } In balm_init (which will be called from the module initialization method of whatever this gets embedded in) we copy the PyUnicode_Type and replace the tp_new with NULL. It won't be valid to create objects of this type within Python. Similarly, we replace tp_dealloc with the appropriate deallocation function. We also replace tp_free and I'm not entirely certain if this is necessary, but we're already playing with fire here so best not tempt the gods. Confessions of a Type Thief These types are forgeries, but pretty good ones. Given we have a function called make_balm() which hands us a BalmString: >>> type(make_balm()) # >>> isinstance(make_balm(), str) True Pretty good right? Enough to get through customs at least. Beware however, they won't stand up under closer inspection from investigators who know what they're looking for: >>> type(make_balm()) == type(str()) False >>> id(type(make_balm())) 140086316104800 >>> id(str) 140086333381056 Additionally, we violated several oaths of a PyUnicode object. We make no guarantees, for example, that our BalmStrings will be null-terminated. This could be problematic if consumed by another C extension. To complete the ruse, we need to ensure an in-use BalmString carries the correct type and metadata with it. For that we have the functions in Figure 5. Figure 5 static void string_view( BalmString* str, RefCountedData* rd, char* data, size_t len ) { str->rd = rd; str->uc.data.any = data; str->uc._base.utf8 = data; str->uc._base.utf8_length = len; str->length = len; } Py_LOCAL_SYMBOL BalmString* New_BalmString(size_t len) { if(len <= BALM_COMPACT_MAX_STR) { BalmString* str = balmstr_pop( &pools.compacts, compactbalmstr_block_alloc, BALM_STRING_ALLOCATION_BLOCK_SIZE ); str->length = len; return str; } RefCountedData* rd = malloc(sizeof(*rd) + len); rd->ref_count = 1; BalmString* str = balmstr_pop( &pools.bigviews, balmstr_block_alloc, BALM_STRING_ALLOCATION_BLOCK_SIZE ); string_view(str, rd, rd->data, len); str->state.balm = BALM_STRING_BIG; return str; } Py_LOCAL_SYMBOL BalmString* New_BalmStringView( RefCountedData* rd, char* data, size_t len ) { BalmString* str = balmstr_pop( &pools.bigviews, balmstr_block_alloc, BALM_STRING_ALLOCATION_BLOCK_SIZE ); string_view(str, rd, data, len); str->state.balm = BALM_STRING_VIEW; return str; } Again, nothing revolutionary here for people who can read C. We've created two functions, New_BalmString and New_BalmStringView, that pop a BalmString object from the appropriate pool depending on what type of string the user wants and how big it is. For compact strings, the only work we need to do is setup the length field, everything else was taken care of in the block allocator. For string view and big strings, we now need to fill in the balm state data and populate the data, length, and utf8 fields appropriately. Additionally, for this particular Balm'ing, we're using a reference count to keep track of when to free the underlying string data for the non-compact string types. This might not be appropriate for your application, especially for the string views. That's another reason why Balm'ing is a technique, not a library. [whale_hu34] Call Me Ishmael What is an optimization? A miserable little pile of code. Unless it is proven, undeniably; trial by benchmark. Let the gods, Turing, Moore, Hennessy, and Patterson, decide which is the best approach. For this trial I have divided Herman Melville's Moby Dick somewhat arbitrarily into 7515 lines. I have attached a binary header to the file which describes the offset and length of each line. The goal is to construct a string for each line and deliver it as a 7515-length tuple to the Python interpreter. Then repeat that exact process another 9999 times and we're done. In other words, construct roughly 7.5 million Python strings as fast as you can. Benchmark #1 [uni_vs_bal] Benchmark repo on Github And we did it, case closed, 2x speed up with this one weird trick. Except, BalmStrings don't use a great layout for constructing millions of strings. If we allocated them in blocks we could go even faster... Benchmark #2 [uni_vs_bal] And we did it, case closed, 5x speed up with this one weird trick. This benchmark is memory bound, it won't benefit from threading. We would just be measuring the overhead of locking and unlocking the GIL. But hey, it might be fun, let's split the 10k runs across 16 threads, 650 runs each. Benchmark #3 [uni_vs_bal] Note that the scale has gained an order of magnitude Disaster for our BalmString implementation. The "naive" UnicodeString baseline is fully serial, it never releases the GIL to begin with, thus what we're measuring is pure locking and contention overhead. BalmString becomes effectively serialized by the act of popping string views from the pool, and the overhead of the highly granular locking strategy is a crushing blow. BalmBlock, which allocates all its strings as a block up front and thus needs almost no locks at all, barely notices the GIL overhead. It is entirely memory bound, with ~100ms of variance from the OS scheduler. For highly parallel code, you could argue this is a 20x speed up. Final Thoughts and Applications GIL Balm'ing isn't for everyone. It maybe shouldn't exist at all, and perhaps I am the fool for even suggesting it. Certainly, it is a fragile, precarious strategy for squeezing speed out of CPython, but benefits from maintaining a very high level of "natural" compatibility with native Python code. Not Just Strings While the focus here is on strings, in the benchmark repo you will also find a Balm'd tuple. This strategy is easily applied to any immutable Python type, and less easily applied to mutable ones. The problem with mutable types is they love-love-love to allocate memory. For example, when Balm'ing a list, one must decide what to do with the append, extend, and insert methods. Each of these might resize the list, invoking the Python allocator, which means they must be replaced in a Balm'd type. The codebases where I personally apply strategies like this are low-latency libraries and services. Application servers like gunicorn can add a full millisecond of latency to requests, when in principle they should be completely transparent. Some C extensions, notably FastWSGI, demonstrate that the act of packaging and forwarding HTTP requests to CPython should take only microseconds. FastWSGI never releases the GIL, because it needs to construct Python objects. It cannot run the application in parallel with processing HTTP headers for the next request due to this restriction. GIL Balm'ing then becomes an act of desperation, of last resort when one is unwilling to sacrifice speed or convenience. Vito Gamberini, Deep Sea Noodle GH TW IG RD