https://dougallj.wordpress.com/2022/07/30/parallelising-huffman-decoding-and-x86-disassembly-by-synchronising-non-self-synchronising-prefix-codes/ Skip to primary content dougallj Search [ ] [Search] Main menu * Home * About Post navigation - Previous Parallelising Huffman decoding and x86 disassembly by synchronising non-self-synchronising prefix codes Posted on July 30, 2022 by dougallj Variable length non-self-synchronising prefix codes (like x86 instructions and Huffman codes) are hard to decode in parallel, as each word must be decoded to figure out its length, before the next can be found and decoded. I came up with a practical method for finding a synchronisation point in the middle of a run of such prefix-codes (provided the code has a maximum length), allowing a stream to be split into parts that can be processed in parallel. The fact that x86 tends to eventually synchronise is a fairly common observation (e.g. when disassembling incorrect addresses in a debugger). For example, consider the following instructions: 488B4808 mov rcx,[rax+0x8] 488B4918 mov rcx,[rcx+0x18] 486310 movsxd rdx,dword [rax] 488D3492 lea rsi,[rdx+rdx*4] 4883C018 add rax,byte +0x18 837CF12000 cmp dword [rcx+rsi*8+0x20],byte +0x0 If we skip two bytes, we first see some incorrect results, but we quickly we get back to real instructions: 4808488B o64 or [rax-0x75],cl 49184863 o64 sbb [r8+0x63],cl 10488D adc [rax-0x73],cl 3492 xor al,0x92 4883C018 add rax,byte +0x18 837CF12000 cmp dword [rcx+rsi*8+0x20],byte +0x0 I like to think of decoding at an invalid offset as giving you an arbitrary result with an arbitrary length. Each time there's a chance you'll land on a valid offset, but if you don't you'll get another arbitrary result and get another chance. In real-world code this does work, and this is the premise of the method. However, in general, this is not guaranteed. For example, the x86 instruction add [rax], al is encoded as 00 00, so if we had a stream of zeroes, and we started in the wrong place, we'd never get back to the right instructions. Method I assume that the prefix code has a maximum length, n, which is 15 for x86 instructions. The idea is to start decoding at n sequential offsets. One of these decoders is guaranteed to be decoding at a valid offset (as invalid offsets are those that are within real words, so there can be at most n-1 contiguous invalid offsets). Each of these decoders will visit a different sequence of offsets in the message, but they'll hopefully all converge with the valid decoder. The synchronisation point is the smallest offset that every decoder can reach. With a few tricks, the implementation ends up looking quite different. To avoid reading more of the message than necessary, we repeatedly find the decoder with the lowest offset and advance it. To avoid redundant decoding, if it lands on the same offset as another decoder, we can discard it, and we can just repeat this process until there's only one decoder remaining. These decoders will always be within n positions of each other, so each can be represented by a single set bit in an n-bit buffer. This code is simplified - in practice this would need bounds-checking and an iteration limit to bail out on hard-or-impossible-to-synchronise streams to avoid hurting performance. size_t find_sync_after(uint8_t *data, size_t offset) { assert(2 <= MAX_CODE_LENGTH && MAX_CODE_LENGTH <= 63); uint64_t probes = (1ull << MAX_CODE_LENGTH) - 1; while (probes != 1) { uint64_t size = decode_length(data, offset); assert(1 <= size && size <= MAX_CODE_LENGTH); offset += 1; probes >>= 1; probes |= 1ull << (size - 1); offset += countTrailingZeros(probes); probes >>= countTrailingZeros(probes); } return offset; } (I also came up with another method to find the first synchronisation point after a given offset p, by decoding backwards through the stream, maintaining a ring-buffer of n target locations (positions after p where a decode starting from the n contiguous locations would land), and continuing until all entries in the ring-buffer converge to the same value, which is the result. This is more complicated and requires more decoding, but I thought it deserved a mention as it's interesting to think about, and it could be a drop-in replacement for an operation based on self-synchronising codes.) Applications This can be used to split the executable section of a large x86 binary into parts for multiple threads to work on, guaranteeing that each will see the same instructions as a single thread performing linear disassembly. Huffman coding is generally latency bound, and decoding multiple streams at once makes it possible to take advantage of instruction-level parallelism (see Fabian Giesen's Reading bits in far too many ways series), but typically this is only practical in formats designed to have multiple independent streams. From Fabian's work, it's clear that this method could be applied to extract parallelism from large blocks of Huffman-coded data, especially where the size is known ahead of time. I have experimented with using this technique for the Huffman-based DEFLATE (zlib/gzip) decoding, which unfortunately does not satisfy these criteria. A DEFLATE stream typically consists of multiple different blocks, each using a different Huffman encoding, where the block size isn't specified ahead of time. However, if we can guess an offset around half-way through the block, and find a synchronisation point, we can exploit instruction-level parallelism to decode both halves of the block at the same time. This introduces a ton of complexity: the second half is decoded into temporary memory, and only after the first half completes can it be copied into place and LZ77 references can be resolved. To make things worse either the first or second half may finish first, and the second half may have overshot the end of the current block, and may need to be discarded, requiring careful error handling. You'd also need some heuristic to guess how big the blocks are, perhaps assuming block sizes tend to be consistent (though currently I just jump to a hardcoded offset). My experiments are preliminary and extremely dubious, using a single file and hand-tuned constants, but showed a roughly 25% speedup on Apple M1. I'd like to say that this shows that latency barriers in decoding existing file formats will be able to be broken in the future, but this result is reliant on the nature of the input data, and much more thorough measurement is needed to see if such a method can improve performance in general, let alone be beneficial enough to justify the complexity. Share this: * Twitter * Facebook * Like this: Like Loading... Related This entry was posted in Uncategorized by dougallj. Bookmark the permalink. Leave a Reply Cancel reply Enter your comment here... [ ] Fill in your details below or click an icon to log in: * * * * Gravatar Email (required) (Address never made public) [ ] Name (required) [ ] Website [ ] WordPress.com Logo You are commenting using your WordPress.com account. ( Log Out / Change ) Twitter picture You are commenting using your Twitter account. ( Log Out / Change ) Facebook photo You are commenting using your Facebook account. ( Log Out / Change ) Cancel Connecting to %s [ ] Notify me of new comments via email. [ ] Notify me of new posts via email. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] D[ ] Blog at WordPress.com. * Follow Following + [wpcom-] dougallj [ ] Sign me up + Already have a WordPress.com account? Log in now. * + [wpcom-] dougallj + Customize + Follow Following + Sign up + Log in + Copy shortlink + Report this content + View post in Reader + Manage subscriptions + Collapse this bar %d bloggers like this: [b]