https://mattnakama.com/blog/go-branchless-coding/

  * about
  * blog
  * rss

Branchless Coding in Go

Apr 23 2021

So, I was recently inspired by a coworker's design to pack what would
have been JSON (JavaScript Object Notation) data into a bunch of bit
fields in an array of 64-bit elements. Binary arithmetic,
bit-packing, and high performance are fun for me, so I thought of
some ways to improve it, for learning purposes. Then I had an idea:
maybe I can unpack/pack bits without using conditional branching (if
statements).

Why does this matter?#

Performance and security. Modern CPUs will often do branch prediction
on reaching a conditional jump, which is essentially a guess. If the
CPU guesses wrong, it needs to unwind its changes then execute the
other branch of the if statement. A wrong guess is both a performance
hit and a potential information leak. Attackers running on the same
machine can time how long an operation takes to infer the CPU's
guess, such as with Spectre. For more advanced operations, such as
cryptography and server-side APIs, timing attacks can reveal
priviledged information.

The less branches we have in our code, the faster the CPU can execute
it. There are less guesses to get wrong =)

Simple unpack#

Unpacking a packed binary format usually requires a bunch of bitwise
operations. Assuming that we have an 8-bit binary structure that
holds 8 boolean config options, Here's how we might unpack it:

tests := make([]bool, 8)

...

var input uint8 = uint8(*num)

// set each boolean to a bit in the input
if input&(1<<0) != 0 {
        tests[0] = true
}
if input&(1<<1) != 0 {
        tests[1] = true
}
if input&(1<<2) != 0 {
        tests[2] = true
}
if input&(1<<3) != 0 {
        tests[3] = true
}
if input&(1<<4) != 0 {
        tests[4] = true
}
if input&(1<<5) != 0 {
        tests[5] = true
}
if input&(1<<6) != 0 {
        tests[6] = true
}
if input&(1<<7) != 0 {
        tests[7] = true
}

...

 Full source code

I shortened it a bit for readability. Essentially, we check each
input bit, one at a time, and store the result to a bool. Now, let's
look at the assembly output:

go build -o /tmp/binextract-if -gcflags='-S' binextract-if.go

Comments added for clarity:

404 b.go:19 MOVQ        86+96(SP), AX ; AX = input
409 b.go:19 NOP                   ;
416 b.go:19 TESTB       $1, AL        ; flag = AL & 0b0001
418 b.go:19 JEQ 862               ; if flag { jmp }
424 b.go:20 MOVQ        85+112(SP),CX ; CX = tests[]
429 b.go:20 MOVB        $1, (CX)      ; tests[0] = true
432 b.go:22 TESTB       $2, AL        ; flag = AL & 0b0010
434 b.go:22 JEQ 440               ; if flag { jmp }
436 b.go:23 MOVB        $1, 1(CX)     ; tests[1] = true
440 b.go:25 TESTB       $4, AL
442 b.go:25 JEQ 448
444 b.go:26 MOVB        $1, 2(CX)
448 b.go:28 TESTB       $8, AL
450 b.go:28 JEQ 456
452 b.go:29 MOVB        $1, 3(CX)
456 b.go:31 TESTB       $16, AL
458 b.go:31 JEQ 464
460 b.go:32 MOVB        $1, 4(CX)
464 b.go:34 TESTB       $32, AL
466 b.go:34 JEQ 472
468 b.go:35 MOVB        $1, 5(CX)
472 b.go:37 TESTB       $64, AL
474 b.go:37 JEQ 480
476 b.go:38 MOVB        $1, 6(CX)
480 b.go:40 TESTB       $-128, AL
482 b.go:40 JEQ 488
484 b.go:41 MOVB        $1, 7(CX)
488 b.go:44 MOVQ        CX, (SP)
...
862 b.go:23     MOVQ    85+112(SP),CX ; same MOVQ as location 418
867 b.go:19     JMP     432               ; (appendix A for explanation)

cheat sheet:

MOV  move value on left to location on right
NOP  no operation (do nothing)
TEST do a boolean AND comparison between left and right
JEQ  if last operation resulted in zero, jump to location
$    constant. $1==0b0001, $2==0b0010, $4==0b0100, etc.
AX   cpu register; used for comparisons here
AL   lower 8 bits of AX
CX   cpu register; stores address of the output array

 Full assembly dump

That's a lot of conditional branching. Where there's conditional
branching, there's CPU branch prediction, and where there's branch
prediction, there's performance loss and potential vectors for timing
attacks like spectre. For occasionally reading a config file, this is
negligible. If this is hot code that runs frequently, with sensitive
information, perhaps we should remove the branches.

Branchless unpack#

Can we unpack these boolean values without conditional jumps? Yep! We
can even do it in Go, and without a ternary operator =)

tests := make([]bool, 8)

...

var input uint8 = uint8(*num)

...

// set each boolean to a bit in the input
tests[0] = (input & (1 << 0)) != 0
tests[1] = (input & (1 << 1)) != 0
tests[2] = (input & (1 << 2)) != 0
tests[3] = (input & (1 << 3)) != 0
tests[4] = (input & (1 << 4)) != 0
tests[5] = (input & (1 << 5)) != 0
tests[6] = (input & (1 << 6)) != 0
tests[7] = (input & (1 << 7)) != 0

 Full "no if" source code

So, this is nearly the same thing, except we convert the bit fields
into booleans instead of using if statements. C would let us
automatically convert ints to booleans, but Go doesn't allow
converting booleans to/from anything. However, any expression that
you can use inside an if condition is also usable as a value for a
boolean variable ;-)

Assembly output:

404 n.go:19 MOVQ        86+96(SP), AX ; AX = input
409 n.go:19 TESTB       $1, AL        ; flag = AL & 0b0001
411 n.go:19 MOVQ        85+112(SP),CX ; CX = tests[]
416 n.go:19 SETNE       (CX)          ; tests[0] = flag
419 n.go:20 TESTB       $2, AL        ; flag = AL & 0b0010
421 n.go:20 SETNE       1(CX)         ; tests[1] = flag
425 n.go:21 TESTB       $4, AL
427 n.go:21 SETNE       2(CX)
431 n.go:22 TESTB       $8, AL
433 n.go:22 SETNE       3(CX)
437 n.go:23 TESTB       $16, AL
439 n.go:23 SETNE       4(CX)
443 n.go:24 TESTB       $32, AL
445 n.go:24 SETNE       5(CX)
449 n.go:25 TESTB       $64, AL
451 n.go:25 SETNE       6(CX)
455 n.go:26 TESTB       $-128, AL
457 n.go:26 SETNE       7(CX)
461 n.go:28 MOVQ        CX, (SP)


SETNE sets target to true (0x01) if previous test is non-zero

 Full "no if" assembly dump

Check it out, no branching! And with no branching, there's no branch
prediction - at least not in this code =). Essentially, Go takes
advantage of the SETNE x86 instruction to conditionally set the
boolean variable instead of using conditional jumps.

Branchless pack#

Now that we have a binary unpacker that runs in constant time, how do
we write the reverse? Go does not allow us to directly convert binary
values into bits. We can't just bitshift them as integers like we
would in C:

#include <stdio.h>
#include <stdbool.h> // C99 bool support

int main() {
        bool a=true;
        bool b=true;
        bool c=false;
        bool d=true;

        printf("%x %x %x %x\n", d, c, b, a);

        // only one line =D
        unsigned int out = a | (b<<1) | (c<<2) | (d<<3);

        printf("result: 0x%x\n", out);

        return 0;
}

 unbool.c

Aside: It's rare for C to beat Go in code golf; is this some kind of
achievement? Also, this code will only work with C99-style bools,
where true is always 1, and not any other non-zero integer.

In Go, we have to use if to convert bools into bits. I'm only going
to do 4 bits here in the interest of clarity.

var out uint32

pa := flag.Bool("a", false, "a")
pb := flag.Bool("b", false, "b")
pc := flag.Bool("c", false, "c")
pd := flag.Bool("d", false, "d")

flag.Parse()

a := *pa
b := *pb
c := *pc
d := *pd
fmt.Printf("%t %t %t %t\n", a, b, c, d)

// set each boolean to a bit in the input
if a {
        out |= 0x1
}
if b {
        out |= 0x2
}
if c {
        out |= 0x4
}
if d {
        out |= 0x8
}
fmt.Printf("result: %b\n", out)

 unbool.go

And the assembly dump:

go build -o /tmp/binextract-noif -gcflags='-S' bool.go

659 u.go:29 MOVBLZX     "".a+87(SP), AX ; AX = a
664 u.go:29 MOVL        AX, CX          ; CX = AX
666 u.go:29 ORL $2, AX              ; AX | 0b0010
669 u.go:35 MOVBLZX     "".b+86(SP), DX ; DX = b
674 u.go:35 TESTQ       DX, DX          ; flag = DX != 0
677 u.go:35 CMOVLNE     AX, CX          ; if flag { CX = AX }
680 u.go:32 MOVL        CX, AX          ; AX = CX
682 u.go:32 ORL $4, CX              ; CX | 0b0100
685 u.go:35 MOVBLZX     "".c+85(SP), DX ; DX = c
690 u.go:35 TESTQ       DX, DX          ; flag = DX != 0
693 u.go:35 CMOVLNE     CX, AX          ; if flag { AX = CX }
696 u.go:35 MOVL        AX, CX          ; CX = AX
698 u.go:35 ORL $8, AX              ; AX | 0b1000
701 u.go:38 MOVBLZX     "".d+84(SP), DX ; DX = d
706 u.go:38 TESTQ       DX, DX          ; flag = DX != 0
709 u.go:38 CMOVLNE     AX, CX          ; if flag { CX = AX }
712 u.go:38 MOVL        CX, (SP)        ; out = CX


CMOVLNE assign left to right if previous result is non-zero

 Full unbool.go assembly dump

Okay, this one is a bit harder to pick apart. Go is being clever. To
understand what's going on, know that Go treats the bool type as 0 or
1. It's like C99 bools, but much more strict.

When it assigns AX=a, a will either be 0 or 1.
MOVBLZX "".a+87(SP), AX

Next, we store the value of AX into CX.
MOVL AX, CX

Then we set the 0b0010 bit on AX. Yep, we set this before checking if
b is true.
RL $2, AX

Now we move (MOVBLXZ does a zero-extend) b into DX.
MOVBLZX "".b+86(SP), DX

and test if it's non-zero (true)
TESTQ DX, DX

If b is non-zero, set CX to AX. Remember that AX has the "if it's
true" value.
CMOVLNE AX, CX

Whatever the result, we reset AX to CX's current value.
MOVL CX, AX

And then we repeat this until we have all our booleans packed. Note
that AX and CX alternate roles.

It seems that the Go compiler is much smarter than I expected. It
sees the layout, figures out what we want to accomplish, then
replaces the branches in source code with conditional move
instructions! Once, again, constant-time code execution with no
branch prediction. Clever Go =)

Practical use-case#

With this new knowledge, I wrote some server-side JWT validation code
that runs in mostly constant-time. Since the JWT validation is part
of initial authentication, timing attacks could be used to determine
which checks are run against the JWT to search for potential
weaknesses. The code I wrote does the validation checks and stores
each error condition as one bit in an error status variable. After
all the checks are complete, a simple if valError != 0 check is run
to see if any errors did occur, then a success or error response is
returned to the caller. This also has the neat side effect of making
it easier to write unit tests that check for combinations of
validation errors.

The typical solution to defend against timing attacks on
authentication would be to set a minimum wait time. This way, any
short-circuit behavior would be masked, and the timing would be
identical whether all checks were run or not. The minimum wait time
is much more important if a database lookup is done for
authentication, such as when checking if a user account exists. The
timing difference will typically be larger, so it would be more
detectable over the network. To my knowledge, mysql and postgres do
not have any mechanism to prevent leaking information through timing.

Pros and Cons#

Minimum wait time:

  * easier to implement
  * can be used when your code depends on another service or database
    that does not run in constant time
  * requires you to time your code so you can set the timer to a
    little longer than max run time
  * if your code timing changes, minimum wait time may need to be
    updated, since code that takes longer than MWT will stand out

Constant-time:

  * harder to implement
  * will not protect services or databases your code calls if they
    expose information via timing
  * no MWT to configure and re-configure for code updates
  * you still need to test and time your code to make sure timing
    information is not leaking

Conclusion#

Defending against side-channel attacks is a challenge. Find the
critical security paths (such as login, authentication, validation,
etc) and harden those. For cryptography, it's vital to get this
right. Everything else, proceed normally and hope the Spectre bug
doesn't bite on that shared cloud hosting you're probably using.

For performance, remember that only the hot spots really need
optimization, and there is usually some low-hanging fruit that will
give a bigger performance boost than branch-less coding. Example:
using a non-garbage-collected language =P

Thanks to Matija Lesar for giving valuable feedback on this post.

Question/Comments? Found a mistake? Email me =)

Appendix A#

So... while writing this, I found an interesting assembly output in the
if version.

404 b.go:19 MOVQ        86+96(SP), AX
409 b.go:19 NOP
416 b.go:19 TESTB       $1, AL
418 b.go:19 JEQ 862               ; jump to 862
424 b.go:20 MOVQ        85+112(SP),CX ; this line is also at 862
429 b.go:20 MOVB        $1, (CX)      ; the line we needed to skip
432 b.go:22 TESTB       $2, AL        ; return here from 867
434 b.go:22 JEQ 440
436 b.go:23 MOVB        $1, 1(CX)
...
862 b.go:23     MOVQ    85+112(SP),CX ; same MOVQ as location 418
867 b.go:19     JMP     432               ; this could have been JEQ

Wait, why are 424 and 862 the same instruction? Why did we jump over
this just to repeat it and then jump back? Why jump to the end of the
code block? We only needed to skip the assignment on 429 What's up
with that?

I originally thought it was some clever compiler trick, or perhaps a
bug, but I couldn't think of a reason it should do this.

Then I went to #go-nuts on freenode to ask about it.

    2021-03-22 18:08
    <fizzie> I haven't been looking at Go assembly output, but if
    it's anything like C compilers, I think the problem is in
    assuming there would need to be a reason for a compiler to do
    something, other than "an implementation detail related to the
    compiler's internal data structures caused it".

    <fizzie> FWIW, the output I get is substantially similar, except
    without the extra NOP. Just a jump to the end, that "duplicate"
    operation, and a jump back. If I had to guess, it'd be something
    like, at the time when it was making that decision, it didn't
    realize that those two branches would end up with equivalent
    code, so it had to generate a separate "else" branch which it
    located past the end of the other block of code.

    <Gnum4n> but the mov happens before the branch. Go moved it to
    after the branch on its own

    <fizzie> Does the mov really "happen" at any point in particular?
    It's not like there's anything explicit in the source code that
    says "put this in a register".

    <Gnum4n> oh! I see what you mean. just because I created the
    array before line 20 in local function scope doesn't mean that Go
    knows that it should be loaded into a register before line 19, so
    it assumes that it needs to load it on line 20 or before 23

    <fizzie> Something like that, yes. FWIW, it doesn't exactly
    explain how it realized it can do that register load just once
    before all the other ifs (instead of translating them the same
    way), but that's the part that I think tends to boil down to "it
    made a lot of transformations through a lot of intermediate
    representations, and that's how the chips fell".

So, it's not what the compiler did (weird duplication), but what the
compiler didn't do, which is to de-duplicate the MOV and jump to
where we would expect.

Appendix B#

You may have noticed that input values are provided via command-line
flags. I initially hard-coded the values, but the Go compiler
converted them into one big constant and optimized away all the
decision-making. Clever compiler.

Previous Post Algorithms to Live by - ch1 Next Post Algorithms to
Live by - ch2
 
 
 

(c) 2021

  * Why does this matter?
  * Simple unpack
  * Branchless unpack
  * Branchless pack
  * Practical use-case
      + Pros and Cons
  * Conclusion
  * Appendix A
  * Appendix B