https://mattnakama.com/blog/go-branchless-coding/ * about * blog * rss Branchless Coding in Go Apr 23 2021 So, I was recently inspired by a coworker's design to pack what would have been JSON (JavaScript Object Notation) data into a bunch of bit fields in an array of 64-bit elements. Binary arithmetic, bit-packing, and high performance are fun for me, so I thought of some ways to improve it, for learning purposes. Then I had an idea: maybe I can unpack/pack bits without using conditional branching (if statements). Why does this matter?# Performance and security. Modern CPUs will often do branch prediction on reaching a conditional jump, which is essentially a guess. If the CPU guesses wrong, it needs to unwind its changes then execute the other branch of the if statement. A wrong guess is both a performance hit and a potential information leak. Attackers running on the same machine can time how long an operation takes to infer the CPU's guess, such as with Spectre. For more advanced operations, such as cryptography and server-side APIs, timing attacks can reveal priviledged information. The less branches we have in our code, the faster the CPU can execute it. There are less guesses to get wrong =) Simple unpack# Unpacking a packed binary format usually requires a bunch of bitwise operations. Assuming that we have an 8-bit binary structure that holds 8 boolean config options, Here's how we might unpack it: tests := make([]bool, 8) ... var input uint8 = uint8(*num) // set each boolean to a bit in the input if input&(1<<0) != 0 { tests[0] = true } if input&(1<<1) != 0 { tests[1] = true } if input&(1<<2) != 0 { tests[2] = true } if input&(1<<3) != 0 { tests[3] = true } if input&(1<<4) != 0 { tests[4] = true } if input&(1<<5) != 0 { tests[5] = true } if input&(1<<6) != 0 { tests[6] = true } if input&(1<<7) != 0 { tests[7] = true } ... Full source code I shortened it a bit for readability. Essentially, we check each input bit, one at a time, and store the result to a bool. Now, let's look at the assembly output: go build -o /tmp/binextract-if -gcflags='-S' binextract-if.go Comments added for clarity: 404 b.go:19 MOVQ 86+96(SP), AX ; AX = input 409 b.go:19 NOP ; 416 b.go:19 TESTB $1, AL ; flag = AL & 0b0001 418 b.go:19 JEQ 862 ; if flag { jmp } 424 b.go:20 MOVQ 85+112(SP),CX ; CX = tests[] 429 b.go:20 MOVB $1, (CX) ; tests[0] = true 432 b.go:22 TESTB $2, AL ; flag = AL & 0b0010 434 b.go:22 JEQ 440 ; if flag { jmp } 436 b.go:23 MOVB $1, 1(CX) ; tests[1] = true 440 b.go:25 TESTB $4, AL 442 b.go:25 JEQ 448 444 b.go:26 MOVB $1, 2(CX) 448 b.go:28 TESTB $8, AL 450 b.go:28 JEQ 456 452 b.go:29 MOVB $1, 3(CX) 456 b.go:31 TESTB $16, AL 458 b.go:31 JEQ 464 460 b.go:32 MOVB $1, 4(CX) 464 b.go:34 TESTB $32, AL 466 b.go:34 JEQ 472 468 b.go:35 MOVB $1, 5(CX) 472 b.go:37 TESTB $64, AL 474 b.go:37 JEQ 480 476 b.go:38 MOVB $1, 6(CX) 480 b.go:40 TESTB $-128, AL 482 b.go:40 JEQ 488 484 b.go:41 MOVB $1, 7(CX) 488 b.go:44 MOVQ CX, (SP) ... 862 b.go:23 MOVQ 85+112(SP),CX ; same MOVQ as location 418 867 b.go:19 JMP 432 ; (appendix A for explanation) cheat sheet: MOV move value on left to location on right NOP no operation (do nothing) TEST do a boolean AND comparison between left and right JEQ if last operation resulted in zero, jump to location $ constant. $1==0b0001, $2==0b0010, $4==0b0100, etc. AX cpu register; used for comparisons here AL lower 8 bits of AX CX cpu register; stores address of the output array Full assembly dump That's a lot of conditional branching. Where there's conditional branching, there's CPU branch prediction, and where there's branch prediction, there's performance loss and potential vectors for timing attacks like spectre. For occasionally reading a config file, this is negligible. If this is hot code that runs frequently, with sensitive information, perhaps we should remove the branches. Branchless unpack# Can we unpack these boolean values without conditional jumps? Yep! We can even do it in Go, and without a ternary operator =) tests := make([]bool, 8) ... var input uint8 = uint8(*num) ... // set each boolean to a bit in the input tests[0] = (input & (1 << 0)) != 0 tests[1] = (input & (1 << 1)) != 0 tests[2] = (input & (1 << 2)) != 0 tests[3] = (input & (1 << 3)) != 0 tests[4] = (input & (1 << 4)) != 0 tests[5] = (input & (1 << 5)) != 0 tests[6] = (input & (1 << 6)) != 0 tests[7] = (input & (1 << 7)) != 0 Full "no if" source code So, this is nearly the same thing, except we convert the bit fields into booleans instead of using if statements. C would let us automatically convert ints to booleans, but Go doesn't allow converting booleans to/from anything. However, any expression that you can use inside an if condition is also usable as a value for a boolean variable ;-) Assembly output: 404 n.go:19 MOVQ 86+96(SP), AX ; AX = input 409 n.go:19 TESTB $1, AL ; flag = AL & 0b0001 411 n.go:19 MOVQ 85+112(SP),CX ; CX = tests[] 416 n.go:19 SETNE (CX) ; tests[0] = flag 419 n.go:20 TESTB $2, AL ; flag = AL & 0b0010 421 n.go:20 SETNE 1(CX) ; tests[1] = flag 425 n.go:21 TESTB $4, AL 427 n.go:21 SETNE 2(CX) 431 n.go:22 TESTB $8, AL 433 n.go:22 SETNE 3(CX) 437 n.go:23 TESTB $16, AL 439 n.go:23 SETNE 4(CX) 443 n.go:24 TESTB $32, AL 445 n.go:24 SETNE 5(CX) 449 n.go:25 TESTB $64, AL 451 n.go:25 SETNE 6(CX) 455 n.go:26 TESTB $-128, AL 457 n.go:26 SETNE 7(CX) 461 n.go:28 MOVQ CX, (SP) SETNE sets target to true (0x01) if previous test is non-zero Full "no if" assembly dump Check it out, no branching! And with no branching, there's no branch prediction - at least not in this code =). Essentially, Go takes advantage of the SETNE x86 instruction to conditionally set the boolean variable instead of using conditional jumps. Branchless pack# Now that we have a binary unpacker that runs in constant time, how do we write the reverse? Go does not allow us to directly convert binary values into bits. We can't just bitshift them as integers like we would in C: #include #include // C99 bool support int main() { bool a=true; bool b=true; bool c=false; bool d=true; printf("%x %x %x %x\n", d, c, b, a); // only one line =D unsigned int out = a | (b<<1) | (c<<2) | (d<<3); printf("result: 0x%x\n", out); return 0; } unbool.c Aside: It's rare for C to beat Go in code golf; is this some kind of achievement? Also, this code will only work with C99-style bools, where true is always 1, and not any other non-zero integer. In Go, we have to use if to convert bools into bits. I'm only going to do 4 bits here in the interest of clarity. var out uint32 pa := flag.Bool("a", false, "a") pb := flag.Bool("b", false, "b") pc := flag.Bool("c", false, "c") pd := flag.Bool("d", false, "d") flag.Parse() a := *pa b := *pb c := *pc d := *pd fmt.Printf("%t %t %t %t\n", a, b, c, d) // set each boolean to a bit in the input if a { out |= 0x1 } if b { out |= 0x2 } if c { out |= 0x4 } if d { out |= 0x8 } fmt.Printf("result: %b\n", out) unbool.go And the assembly dump: go build -o /tmp/binextract-noif -gcflags='-S' bool.go 659 u.go:29 MOVBLZX "".a+87(SP), AX ; AX = a 664 u.go:29 MOVL AX, CX ; CX = AX 666 u.go:29 ORL $2, AX ; AX | 0b0010 669 u.go:35 MOVBLZX "".b+86(SP), DX ; DX = b 674 u.go:35 TESTQ DX, DX ; flag = DX != 0 677 u.go:35 CMOVLNE AX, CX ; if flag { CX = AX } 680 u.go:32 MOVL CX, AX ; AX = CX 682 u.go:32 ORL $4, CX ; CX | 0b0100 685 u.go:35 MOVBLZX "".c+85(SP), DX ; DX = c 690 u.go:35 TESTQ DX, DX ; flag = DX != 0 693 u.go:35 CMOVLNE CX, AX ; if flag { AX = CX } 696 u.go:35 MOVL AX, CX ; CX = AX 698 u.go:35 ORL $8, AX ; AX | 0b1000 701 u.go:38 MOVBLZX "".d+84(SP), DX ; DX = d 706 u.go:38 TESTQ DX, DX ; flag = DX != 0 709 u.go:38 CMOVLNE AX, CX ; if flag { CX = AX } 712 u.go:38 MOVL CX, (SP) ; out = CX CMOVLNE assign left to right if previous result is non-zero Full unbool.go assembly dump Okay, this one is a bit harder to pick apart. Go is being clever. To understand what's going on, know that Go treats the bool type as 0 or 1. It's like C99 bools, but much more strict. When it assigns AX=a, a will either be 0 or 1. MOVBLZX "".a+87(SP), AX Next, we store the value of AX into CX. MOVL AX, CX Then we set the 0b0010 bit on AX. Yep, we set this before checking if b is true. RL $2, AX Now we move (MOVBLXZ does a zero-extend) b into DX. MOVBLZX "".b+86(SP), DX and test if it's non-zero (true) TESTQ DX, DX If b is non-zero, set CX to AX. Remember that AX has the "if it's true" value. CMOVLNE AX, CX Whatever the result, we reset AX to CX's current value. MOVL CX, AX And then we repeat this until we have all our booleans packed. Note that AX and CX alternate roles. It seems that the Go compiler is much smarter than I expected. It sees the layout, figures out what we want to accomplish, then replaces the branches in source code with conditional move instructions! Once, again, constant-time code execution with no branch prediction. Clever Go =) Practical use-case# With this new knowledge, I wrote some server-side JWT validation code that runs in mostly constant-time. Since the JWT validation is part of initial authentication, timing attacks could be used to determine which checks are run against the JWT to search for potential weaknesses. The code I wrote does the validation checks and stores each error condition as one bit in an error status variable. After all the checks are complete, a simple if valError != 0 check is run to see if any errors did occur, then a success or error response is returned to the caller. This also has the neat side effect of making it easier to write unit tests that check for combinations of validation errors. The typical solution to defend against timing attacks on authentication would be to set a minimum wait time. This way, any short-circuit behavior would be masked, and the timing would be identical whether all checks were run or not. The minimum wait time is much more important if a database lookup is done for authentication, such as when checking if a user account exists. The timing difference will typically be larger, so it would be more detectable over the network. To my knowledge, mysql and postgres do not have any mechanism to prevent leaking information through timing. Pros and Cons# Minimum wait time: * easier to implement * can be used when your code depends on another service or database that does not run in constant time * requires you to time your code so you can set the timer to a little longer than max run time * if your code timing changes, minimum wait time may need to be updated, since code that takes longer than MWT will stand out Constant-time: * harder to implement * will not protect services or databases your code calls if they expose information via timing * no MWT to configure and re-configure for code updates * you still need to test and time your code to make sure timing information is not leaking Conclusion# Defending against side-channel attacks is a challenge. Find the critical security paths (such as login, authentication, validation, etc) and harden those. For cryptography, it's vital to get this right. Everything else, proceed normally and hope the Spectre bug doesn't bite on that shared cloud hosting you're probably using. For performance, remember that only the hot spots really need optimization, and there is usually some low-hanging fruit that will give a bigger performance boost than branch-less coding. Example: using a non-garbage-collected language =P Thanks to Matija Lesar for giving valuable feedback on this post. Question/Comments? Found a mistake? Email me =) Appendix A# So... while writing this, I found an interesting assembly output in the if version. 404 b.go:19 MOVQ 86+96(SP), AX 409 b.go:19 NOP 416 b.go:19 TESTB $1, AL 418 b.go:19 JEQ 862 ; jump to 862 424 b.go:20 MOVQ 85+112(SP),CX ; this line is also at 862 429 b.go:20 MOVB $1, (CX) ; the line we needed to skip 432 b.go:22 TESTB $2, AL ; return here from 867 434 b.go:22 JEQ 440 436 b.go:23 MOVB $1, 1(CX) ... 862 b.go:23 MOVQ 85+112(SP),CX ; same MOVQ as location 418 867 b.go:19 JMP 432 ; this could have been JEQ Wait, why are 424 and 862 the same instruction? Why did we jump over this just to repeat it and then jump back? Why jump to the end of the code block? We only needed to skip the assignment on 429 What's up with that? I originally thought it was some clever compiler trick, or perhaps a bug, but I couldn't think of a reason it should do this. Then I went to #go-nuts on freenode to ask about it. 2021-03-22 18:08 I haven't been looking at Go assembly output, but if it's anything like C compilers, I think the problem is in assuming there would need to be a reason for a compiler to do something, other than "an implementation detail related to the compiler's internal data structures caused it". FWIW, the output I get is substantially similar, except without the extra NOP. Just a jump to the end, that "duplicate" operation, and a jump back. If I had to guess, it'd be something like, at the time when it was making that decision, it didn't realize that those two branches would end up with equivalent code, so it had to generate a separate "else" branch which it located past the end of the other block of code. but the mov happens before the branch. Go moved it to after the branch on its own Does the mov really "happen" at any point in particular? It's not like there's anything explicit in the source code that says "put this in a register". oh! I see what you mean. just because I created the array before line 20 in local function scope doesn't mean that Go knows that it should be loaded into a register before line 19, so it assumes that it needs to load it on line 20 or before 23 Something like that, yes. FWIW, it doesn't exactly explain how it realized it can do that register load just once before all the other ifs (instead of translating them the same way), but that's the part that I think tends to boil down to "it made a lot of transformations through a lot of intermediate representations, and that's how the chips fell". So, it's not what the compiler did (weird duplication), but what the compiler didn't do, which is to de-duplicate the MOV and jump to where we would expect. Appendix B# You may have noticed that input values are provided via command-line flags. I initially hard-coded the values, but the Go compiler converted them into one big constant and optimized away all the decision-making. Clever compiler. Previous Post Algorithms to Live by - ch1 Next Post Algorithms to Live by - ch2 (c) 2021 * Why does this matter? * Simple unpack * Branchless unpack * Branchless pack * Practical use-case + Pros and Cons * Conclusion * Appendix A * Appendix B