https://dougallj.wordpress.com/2022/07/30/parallelising-huffman-decoding-and-x86-disassembly-by-synchronising-non-self-synchronising-prefix-codes/

Skip to primary content

dougallj

Search [                    ] [Search]
Main menu

  * Home
  * About

Post navigation

- Previous

Parallelising Huffman decoding and x86 disassembly by synchronising
non-self-synchronising prefix codes

Posted on July 30, 2022 by dougallj

Variable length non-self-synchronising prefix codes (like x86
instructions and Huffman codes) are hard to decode in parallel, as
each word must be decoded to figure out its length, before the next
can be found and decoded. I came up with a practical method for
finding a synchronisation point in the middle of a run of such
prefix-codes (provided the code has a maximum length), allowing a
stream to be split into parts that can be processed in parallel.

The fact that x86 tends to eventually synchronise is a fairly common
observation (e.g. when disassembling incorrect addresses in a
debugger). For example, consider the following instructions:

488B4808          mov rcx,[rax+0x8]
488B4918          mov rcx,[rcx+0x18]
486310            movsxd rdx,dword [rax]
488D3492          lea rsi,[rdx+rdx*4]
4883C018          add rax,byte +0x18
837CF12000        cmp dword [rcx+rsi*8+0x20],byte +0x0

If we skip two bytes, we first see some incorrect results, but we
quickly we get back to real instructions:

4808488B          o64 or [rax-0x75],cl
49184863          o64 sbb [r8+0x63],cl
10488D            adc [rax-0x73],cl
3492              xor al,0x92
4883C018          add rax,byte +0x18
837CF12000        cmp dword [rcx+rsi*8+0x20],byte +0x0

I like to think of decoding at an invalid offset as giving you an
arbitrary result with an arbitrary length. Each time there's a chance
you'll land on a valid offset, but if you don't you'll get another
arbitrary result and get another chance.

In real-world code this does work, and this is the premise of the
method. However, in general, this is not guaranteed. For example, the
x86 instruction add [rax], al is encoded as 00 00, so if we had a
stream of zeroes, and we started in the wrong place, we'd never get
back to the right instructions.

Method

I assume that the prefix code has a maximum length, n, which is 15
for x86 instructions. The idea is to start decoding at n sequential
offsets. One of these decoders is guaranteed to be decoding at a
valid offset (as invalid offsets are those that are within real
words, so there can be at most n-1 contiguous invalid offsets). Each
of these decoders will visit a different sequence of offsets in the
message, but they'll hopefully all converge with the valid decoder.
The synchronisation point is the smallest offset that every decoder
can reach.

With a few tricks, the implementation ends up looking quite
different. To avoid reading more of the message than necessary, we
repeatedly find the decoder with the lowest offset and advance it. To
avoid redundant decoding, if it lands on the same offset as another
decoder, we can discard it, and we can just repeat this process until
there's only one decoder remaining. These decoders will always be
within n positions of each other, so each can be represented by a
single set bit in an n-bit buffer.

This code is simplified - in practice this would need bounds-checking
and an iteration limit to bail out on
hard-or-impossible-to-synchronise streams to avoid hurting
performance.

  size_t find_sync_after(uint8_t *data, size_t offset) {
    assert(2 <= MAX_CODE_LENGTH && MAX_CODE_LENGTH <= 63);
    uint64_t probes = (1ull << MAX_CODE_LENGTH) - 1;
    while (probes != 1) {
      uint64_t size = decode_length(data, offset);
      assert(1 <= size && size <= MAX_CODE_LENGTH);

      offset += 1;
      probes >>= 1;

      probes |= 1ull << (size - 1);

      offset += countTrailingZeros(probes);
      probes >>= countTrailingZeros(probes);
    }
    return offset;
  }

(I also came up with another method to find the first synchronisation
point after a given offset p, by decoding backwards through the
stream, maintaining a ring-buffer of n target locations (positions
after p where a decode starting from the n contiguous locations would
land), and continuing until all entries in the ring-buffer converge
to the same value, which is the result. This is more complicated and
requires more decoding, but I thought it deserved a mention as it's
interesting to think about, and it could be a drop-in replacement for
an operation based on self-synchronising codes.)

Applications

This can be used to split the executable section of a large x86
binary into parts for multiple threads to work on, guaranteeing that
each will see the same instructions as a single thread performing
linear disassembly.

Huffman coding is generally latency bound, and decoding multiple
streams at once makes it possible to take advantage of
instruction-level parallelism (see Fabian Giesen's Reading bits in
far too many ways series), but typically this is only practical in
formats designed to have multiple independent streams. From Fabian's
work, it's clear that this method could be applied to extract
parallelism from large blocks of Huffman-coded data, especially where
the size is known ahead of time. I have experimented with using this
technique for the Huffman-based DEFLATE (zlib/gzip) decoding, which
unfortunately does not satisfy these criteria.

A DEFLATE stream typically consists of multiple different blocks,
each using a different Huffman encoding, where the block size isn't
specified ahead of time. However, if we can guess an offset around
half-way through the block, and find a synchronisation point, we can
exploit instruction-level parallelism to decode both halves of the
block at the same time.

This introduces a ton of complexity: the second half is decoded into
temporary memory, and only after the first half completes can it be
copied into place and LZ77 references can be resolved. To make things
worse either the first or second half may finish first, and the
second half may have overshot the end of the current block, and may
need to be discarded, requiring careful error handling. You'd also
need some heuristic to guess how big the blocks are, perhaps assuming
block sizes tend to be consistent (though currently I just jump to a
hardcoded offset).

My experiments are preliminary and extremely dubious, using a single
file and hand-tuned constants, but showed a roughly 25% speedup on
Apple M1. I'd like to say that this shows that latency barriers in
decoding existing file formats will be able to be broken in the
future, but this result is reliant on the nature of the input data,
and much more thorough measurement is needed to see if such a method
can improve performance in general, let alone be beneficial enough to
justify the complexity.

Share this:

  * Twitter
  * Facebook
  * 

Like this:

Like Loading...

Related

This entry was posted in Uncategorized by dougallj. Bookmark the
permalink.

Leave a Reply Cancel reply

Enter your comment here...
[                    ]

Fill in your details below or click an icon to log in:

  *  
  *  
  *  
  *  

Gravatar
Email (required) (Address never made public)
[                    ]
Name (required)
[                    ]
Website
[                    ]
WordPress.com Logo

You are commenting using your WordPress.com account. ( Log Out / 
Change )

Twitter picture

You are commenting using your Twitter account. ( Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. ( Log Out /  Change )

Cancel

Connecting to %s

[ ] Notify me of new comments via email.

[ ] Notify me of new posts via email.

[Post Comment] 

 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
 [                                             ] 
D[                                             ] 

Blog at WordPress.com.

  * Follow Following
      + [wpcom-] dougallj
        [                    ]
        Sign me up
      + Already have a WordPress.com account? Log in now.
  * 
      + [wpcom-] dougallj
      + Customize
      + Follow Following
      + Sign up
      + Log in
      + Copy shortlink
      + Report this content
      + View post in Reader
      + Manage subscriptions
      + Collapse this bar

%d bloggers like this:

[b]