[HN Gopher] Round a direction vector to an 8-way compass
       ___________________________________________________________________
        
       Round a direction vector to an 8-way compass
        
       Author : greghn
       Score  : 57 points
       Date   : 2022-07-25 13:11 UTC (9 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | bee_rider wrote:
       | Is it possible to just ask... xinput or whatever joystick driver
       | is used nowadays... to produce the cardinal or 8-compass
       | directions? I guess it seems plausible, at least, that maybe for
       | some controllers we could skip the whole conversion to/from a
       | pair of floats thing, depending on what signals are sent down the
       | USB pipe.
        
       | RicoElectrico wrote:
       | I wonder if some amount of hysteresis would be helpful to avoid
       | switching back and forth on the edge.
        
         | ChrisLomont wrote:
         | Button debouncing is useful for all sorts of stuff like this,
         | but it requires much trickier code to get right, including
         | tracking some form of history.
        
       | iainmerrick wrote:
       | What's the motivation for optimising this? If it's for user
       | input, you only need to call this once per frame (or at most a
       | few times).
        
         | jesse__ wrote:
         | I can think of a few reasons:
         | 
         | 1. writing a game that movement happens along 8 axes, and needs
         | to trace lots of rays
         | 
         | 2. several other reasons similar to (1)
         | 
         | 3. had a clever idea to optimize a thing and went ham, for fun
        
         | beckingz wrote:
         | faster parsing of user input means more time for doing graphics
         | and game state!
         | 
         | Though in this case 1.5 nanoseconds vs 40 nanoseconds is small
         | and won't matter for most games.
        
         | Cthulhu_ wrote:
         | More faster is more better; you only have 16 milliseconds per
         | tick assuming you want to read input 60 times a second, some
         | games poll even faster / have a higher frame rate.
        
           | nmilo wrote:
           | It's still a difference of 38.5 _nano_ seconds. You have a
           | budget 16,000,000 nanoseconds per frame. It 's like Jeff
           | Bezos writing about how to save money at Walmart; just pick
           | the readable and stupid approach and move on. Game loops are
           | made fast or slow by the efficiency of their tight loops: the
           | render loop, physics update loop, UI loop, etc. Something
           | that happens once per frame is barely ever worth considering.
        
           | sgtnoodle wrote:
           | Ok, but 40ns of 16ms is 0.00025% of your frame time.
        
         | dahart wrote:
         | It doesn't matter for user input, the 40ns won't make a
         | difference even at 60fps or higher with 4 players. (Have
         | written the controller processing loop for a console game
         | engine.) I suspect it would be less than 40ns if float was used
         | instead of double; there's no good reason to use double here at
         | all. But the proposed solution is better, and super easy
         | (arguably easier than atan2+sincos), so maybe no reason not to
         | use it a save a few microseconds if it happens to add up to
         | that much.
         | 
         | I could imagine other uses of this code, like a UI wind map, or
         | some kind of randomized maze, or something else. There might be
         | reasons to round a vector of directions all at once.
        
           | iainmerrick wrote:
           | Yeah, it's a cute optimisation, and might make sense in a
           | game engine; I just question whether this would ever be
           | worthwhile writing as custom code in a single game.
           | 
           | I'm not sure I agree it's more readable. The optimised
           | version is reasonably regular, but it's still longer and has
           | more constants, hence more opportunities for sign errors and
           | the like.
           | 
           | It feels sort of like jumping through hoops to avoid a matrix
           | inversion, say. Sure, matrix inversion is "slow", but unless
           | you're doing it millions of times a frame it doesn't remotely
           | matter. If matrix inversion or atan2 is the most direct way
           | to express a transformation, just do it that way.
        
             | dahart wrote:
             | I think the atan2+sincos version is more readable, so I
             | guess we actually do agree there. But I wouldn't (and
             | didn't) use that in a game engine. For rounding and
             | readability in a real engine, I'd name 4 of the 8
             | directions and look for the largest absolute value dot
             | product. That would at least corral the constants. And I'd
             | use (did use) f16 or integer arithmetic.
             | 
             | One reason I have to support this kind of optimization and
             | not reach for the trig method by default is the last time I
             | wrote a controller system, it wasn't limited to something
             | as small as deciding on a single dpad-like direction, it
             | was a much larger system that designers could author
             | controller recognizers for, up to and including drawing
             | shapes like a circle with the stick, or entering a time
             | based sequence of inputs (think Contra's famous up-down-up-
             | down-etc-etc cheat code). If you have sequences and
             | multiple players and high frame rate and many possible
             | patterns to identify, that's when efficiency of these small
             | operations really starts to add up. I actually spent the
             | most development time ensuring that the amount of time
             | spent _not_ recognizing controls was minimized, making sure
             | that a no-op pass through the system was measured in
             | nanoseconds and not microseconds.
        
       | moralestapia wrote:
       | This but in a 2-sphere space is what I need right now for a
       | quantum computing simulator I'm coding for fun/learning.
       | 
       | I'm doing a different thing tho, dot product vs. four cardinal
       | points.
        
       | jameshart wrote:
       | Seems to rest on the assumption that an input control provides a
       | vector on the unit circle, and tries to divide that into eight
       | regions, corresponding to unit vectors aligned to compass
       | directions. That seems like an odd assumption though - most
       | analog XY inputs produce a vector with x and y components
       | nominally distributed from -1 to 1, (although in practice
       | constrained by calibration, and not necessarily linear in control
       | stick deflection)
       | 
       | This approach effectively divides the 'unit square' of possible
       | input vectors up into a grid, based on cos(pi/8) and cos(3pi/8)
       | gridlines, with basically 9 possible areas in each quadrant
       | mapped to corresponding directional vectors.
       | 
       | It may not be a bad strategy in general to use this kind of grid
       | model, but it needs testing with actual controllers to see what
       | the 'feel' ends up being. While this mathematical model makes
       | sense in theory, in practice there's no reason to suspect that
       | these cosine values are necessarily good thresholds to use to
       | create an intuitive compass direction control that actually feels
       | intuitive.
       | 
       | In general in games, math needs to be a guide, but not a
       | straitjacket for how you approach things.
        
         | dahart wrote:
         | This is a good point; some controllers are neither circular nor
         | square. Check out the mapping ranges for a couple of Nintendo's
         | old controllers: https://electromodder.co.uk/
         | 
         | That said, Lemire's approach rewritten as dot products might
         | work nicely and even look a bit more clean, even if it took 3ns
         | instead of 1.5.
        
           | Arrath wrote:
           | > This is a good point; some controllers are neither circular
           | nor square. Check out the mapping ranges for a couple of
           | Nintendo's old controllers: https://electromodder.co.uk/
           | 
           | Very interesting, thanks! I've noticed myself that sometimes
           | if I'm holding the controller a bit skewed my natural impulse
           | will be to push in directions that don't exactly map to what
           | the controller wants. EG I think I'm pushing forward, but I'm
           | moving forward and slightly to the left. And its not just
           | stick drift, its my own user error.
        
             | dahart wrote:
             | I do the same. Some difficult and brutal platformers will
             | really help you notice how much user error you bring to the
             | table. :P Good point too, because for fast twitchy games
             | where the directions really matter, it's quite nice to be
             | able to calibrate and/or bias the direction rounding to
             | compensate for my own bias. This could be one good reason
             | to go with the trig code instead - offsetting the angle
             | becomes trivial. It could be done under the covers before
             | the rounding code sees it, but it'd still be doing trig
             | somewhere.
        
       | jansan wrote:
       | Two more optimizations:
       | 
       | - Use _else if_ to avoid uncessary condition checking.
       | 
       | - Only check for _xneg_ / _yneg_ if _outx_ / _outy_ are non-zero.
        
         | dundarious wrote:
         | Almost certainly makes it worse to add additional conditional
         | dependencies, meaning branches upon branches.
         | 
         | Some examples:
         | 
         | 1. Branches with dependencies.                 static void
         | minMax(u8* pxmin, u8* pxmax, const u8* xs, size_t n)       {
         | const u8 init = n > 0u ? xs[0] : (u8)0u;         u8 xmin = init
         | u8 xmax = init;         for (size_t i = 1u; i < n; ++i)
         | {           const u8 x = xs[i];           if (x < xmin) { xmin
         | = x; }           else if (x > xmax) { xmax = x; }         }
         | *pxmin = xmin;         *pxmax = xmax;       }
         | 
         | 2. Branches w/o dependencies                 static void
         | minMax(u8* pxmin, u8* pxmax, const u8* xs, size_t n)       {
         | const u8 init = n > 0u ? xs[0] : (u8)0u;         u8 xmin = init
         | u8 xmax = init;         for (size_t i = 1u; i < n; ++i)
         | {           const u8 x = xs[i];           if (x < xmin) { xmin
         | = x; }           if (x > xmax) { xmax = x; }  /*not doing else
         | if*/         }         *pxmin = xmin;         *pxmax = xmax;
         | }
         | 
         | Compiling with auto-vectorization enabled (-O3 for most
         | versions of GCC, and I was using -march=native on a Zen1 AMD
         | CPU, so AVX2 capable), version 2 is 0.05x the runtime of the
         | version 1 (so a _lot_ faster). Version 2 disassembly will show
         | lots of "packed" vector instructions.
         | 
         | 3. Branchless                 inline u8 minBranchless(u8 x, u8
         | y)       {         return x * (u8)(x <= y ? 1u : 0u)
         | + y * (u8)(x > y ? 1u : 0u);       }            inline u8
         | maxBranchless(u8 x, u8 y)       {         return x * (u8)(x >=
         | y ? 1u : 0u)              + y * (u8)(x < y ? 1u : 0u);       }
         | static void minMax(u8* pxmin, u8* pxmax, const u8* xs, size_t
         | n)       {         const u8 init = n > 0u ? xs[0] : (u8)0u;
         | u8 xmin = init         u8 xmax = init;         for (size_t i =
         | 1u; i < n; ++i)         {           const u8 x = xs[i];
         | xmin = minBranchless(xmin, x);           xmax =
         | maxBranchless(xmax, x);         }         *pxmin = xmin;
         | *pxmax = xmax;       }
         | 
         | Version 3 is probably a better comparison for the code from the
         | article -- there are only 1 or 2 joystick direction vectors
         | that need normalization, so perhaps not enough to show major
         | vectorization benefit, or to even prompt compiler auto-
         | vectorization. Version 3 will most likely be compiled into cmov
         | instructions on x86_64, which are branchless, but branches can
         | still be better, if they can be predicted well. The test data I
         | used was random, so cmov should be expected to be beneficial.
         | 
         | Last time I ran comparisons, Version 2 was 0.07x the runtime of
         | Version 3, so Version 3 was not that much faster than Version
         | 1, but still a good improvement.
         | 
         | As a final note, I should say I'm not a fan of _relying_ on
         | compiler auto-vectorization. If you depend on the performance
         | of vectorized code (and vectorizing code is one of the best
         | optimizations out there), then I'm a fan of explicitly
         | vectorizing. SSE2 is in nearly 100% of relevant x86_64 systems
         | (first coming out in 2003). If you're targeting consumers, AVX
         | is almost certainly worth it. If you're writing server code,
         | you should be taking full advantage of AVX2 and maybe 512. For
         | arm, NEON is the current vector extension of relevance, and you
         | can polyfill with DLTcollab /sse2neon or something like that,
         | until you go native. Writing vectorized code using SIMD
         | intrinsics (#include <immintrin.h>) is preferable. Especially
         | since compilers just cannot know how to auto-vectorize a _lot_
         | of floating point code, because of strict assumptions about
         | requirements around non-commutativity, so they almost _never_
         | generate packed FMADDs.
         | 
         | As for the article code, I wouldn't make any assumptions about
         | whether the compiler is generating branches or cmov-s. I'd
         | guess cmov-s would be faster, but I haven't really thought
         | about it. All I'll say is that turning a lot of `if`s into
         | `else if`s would make it impossible to generate cmov-s, and
         | chaining branches just to avoid assigning to a local that's
         | almost certainly in a register is only going to make branching
         | slower.
        
           | jesse__ wrote:
           | Loved this, thanks.
        
           | WithinReason wrote:
           | Isn't the ternary operator a branch? Could you just write
           | (u8)(x <= y)?
        
             | dundarious wrote:
             | In x86_64, a branch is typically considered to be a
             | comparison and then a jump to different code (otherwise
             | fall through to following code). Test sets some control
             | flags, and the specific jump type jumps based on the value
             | of a specific control flag. You might have something like:
             | mov rax,rbx           test rcx,rcx           jnz L1
             | mov rax,rdx       L1: ret
             | 
             | A cmov instruction is a conditional move. It says, based on
             | specific control flags, do a mov, otherwise do nothing.
             | mov rax,rbx       test rcx,rcx       cmovz rax,rdx
             | ret
             | 
             | The conditional move is not really branching in the classic
             | meaning of the word, because there is no jump (the
             | instruction pointer unconditionally moves forward to the
             | next instruction).
             | 
             | Branches, even well-predicted ones, can hurt performance,
             | especially if the jumped to code is not in cache (needs to
             | be fetched). More detail can be found in this nice top-to-
             | bottom video from Casey Muratori, which goes over modern
             | CPU architecture (it's nothing too fancy, I think every
             | person doing optimization should know this stuff):
             | https://www.youtube.com/watch?v=8VakkEFOiJc
             | 
             | You can imagine the compiler generating either assembly
             | from if style code or from ternary style code. Writing the
             | arithmetic ternary style just happens to nudge the compiler
             | into using cmov-s more often than other ways. For simple
             | cases, a smart enough compiler could generate cmov-s from
             | the if-else case as well, but it's not smart enough for
             | even toy examples. Never mind that cmov-s are not
             | unconditionally better -- a well-trained branch predictor
             | is faster. The question is if your data is conducive to
             | that or not -- something the compiler may never infer. So
             | we're left to encode such information very indirectly via
             | the code patterns, or via all sorts of __builtin_expect
             | style intrinsics.
             | 
             | This all goes to show that C is no longer portable
             | assembler, and there are all sorts of crucial resources on
             | the CPU (cmov, SIMD, etc.) that may go unused based on
             | slight differences in the C code (e.g., even little things
             | that might seem correct, like using unsigned for an
             | accumulator, can ruin its ability to generate cmov-s).
             | 
             | Compilers are really good at turning code that multiplies
             | by a constant into a weird mov and lea that's cheaper than
             | a mul, but they're still very spotty on everything else,
             | and "everything else" covers a huge amount of the stuff
             | that has made CPUs faster in the last 15+ years.
             | Understanding your specific platform target(s) is still
             | crucial if you want to do better than just "get lucky with
             | the compiler sometimes".
        
           | jcranmer wrote:
           | > SSE2 is in nearly 100% of relevant x86_64 systems (first
           | coming out in 2003).
           | 
           | Not nearly, exactly 100%. The 64-bit ABI specifically
           | requires SSE2 support.
        
             | dundarious wrote:
             | Last time I talked with someone about this, they convinced
             | me there was some rare, possibly even research-only x86_64
             | chip that didn't support it, so my point is that if my hazy
             | memory of someone else's argument is correct, that it
             | doesn't matter, treat it as if it's exactly 100% anyway.
             | 
             | It's 100% of what matters, so in practice I'm in full
             | agreement with you.
        
         | WithinReason wrote:
         | Would _else if_ be actually faster on a modern CPU
         | architecture?
        
           | pklausler wrote:
           | Only measurement with real data would tell for sure, but I
           | would bet "yes" if I had to. With the "else", the "if"
           | predicate is being predicted over a smaller amount of
           | relevant cases; without the "else", the prediction is going
           | to have to cope with some amount of needless false results
           | and is likely to be less accurate.
        
         | adrian_b wrote:
         | As written, the compiler can implement the "if"s with
         | conditional move instructions and that would be faster than
         | with conditional branches, as those would be hard to predict in
         | this case.
         | 
         | Using "else if" is a valid optimization in general, but not
         | necessarily when your target is a branchless program. Using
         | "else if" may help in some cases, but you must have in mind
         | exactly which CMOV instructions will be generated by the
         | compiler, to be sure that there is any improvement, i.e. any
         | reduction in the total number of CMOV instructions, and that
         | you are not increasing the length of some chains of dependent
         | instructions (leading to a greater latency).
        
       | dahart wrote:
       | Is the problem with the sin/cos approach that it's using double
       | precision? That's surely completely unnecessary precision. These
       | days I don't know what happens in x86 land with sin & cos, but on
       | a gaming GPU the fp32 cosf() is typically 16 or 32 times faster
       | than fp64 cos(). Is that true on CPUs too? (I'm sure the simple
       | proposed solution is always faster even in fp32, I'm just curious
       | if the gap is much narrower, smaller than 25x.)
        
         | hansvm wrote:
         | The precision doesn't matter much on the CPU (roughly half the
         | throughout if you're doing a lot of them). You have a data
         | dependency on the arctangent, and transcendentals are slow
         | anyway, so you're not going to have a good time. You can even
         | compute the sin and cos together and still expect a bare
         | minimum latency of 50 clock cycles for the arctangent followed
         | by 50 for the sincos, usually implemented with a variable speed
         | algorithm that might be slower (i.e., lower bound of 30ns of
         | latency on a moderately fast single core, just for the trig).
         | 
         | Throughput can be a bit higher with vectorization and
         | pipelining, but more or less any such optimization you can make
         | to the transcendentals you can also make on the proposed
         | solution.
        
           | dahart wrote:
           | Thanks, good to know! So that does suggest to me that if
           | there are other threads in progress, the actual impact to the
           | runtime in practice probably would be lower with fp32 - the
           | half throughput rate does tell us how long we wait for any
           | instructions dependent on the operation, right? I'm saying I
           | suspect the total latency can be filled with other work, so
           | this isn't really 40ns that is fully devoted to atan2+sincos
           | and nothing else. Is that accurate? (On the GPUs I use, trig
           | functions use a different math pipe than FMA, and so they can
           | be 'free' if you don't use too many of them and there is
           | enough logic or FMA to hide the trig latency.)
           | 
           | Aren't the benefits of using fp32 bigger than the trig
           | instruction throughput too? Doubles eat more register space,
           | and double the bandwidth to memory if anything ends up in
           | memory. I just don't see any reason whatsoever to use doubles
           | for this operation. If we're going to worry about
           | nanoseconds, then these extra bytes matter too, no?
        
             | hansvm wrote:
             | The throughput comment was just that we have fixed-width
             | vector instructions which support being filled with and
             | doing trig on either 32-bit or 64-bit floats. They both
             | occupy the same floating point execution units (mostly, see
             | the AMD bulldozer/steamdriver/... fiasco from a few years
             | ago), so you're either filling them with N doubles or 2N
             | floats. You'd be hard-pressed to squeeze other useful work
             | into that space. The actual difference between float/double
             | in the amount of work done is otherwise miniscule and
             | mostly related to teasing out the extra precision. With
             | fast-converging algorithms it doesn't wind up mattering
             | much in terms of total latency.
             | 
             | "Latency" is what I was referring to with respect to how
             | long you have to wait to execute a data-dependant
             | instruction. "Throughput" refers to how many you can have
             | in flight and how fast those execute.
             | 
             | Latency _can_ be filled with other work, but as I
             | mentioned, it can be filled better with other work using
             | the faster non-trig solution. Rule of thumb, you won't
             | squeeze in more than 4 in-flight instructions per core, so
             | you're probably not going to eek out more 10ns average
             | throughput for the trig solution, worse than the proposed
             | optimization even assuming it couldn't be pipelined at all.
             | You're right that float/int execution units are usually
             | separate, but they do have some shared resources. If you're
             | doing enough logic relative to these computations then yes
             | these could sometimes be basically free.
             | 
             | The register/cache/RAM/disk space and bandwidth very well
             | might wind up dominating every other consideration. As
             | always, it depends what else you're doing and what exactly
             | you're doing with these computations. To give a useful
             | counter-example, consider matrix multiplication. It can be
             | structured such that even if you have to read/write the
             | matrices from disk because of their size it's still a CPU-
             | bound operation because you can efficiently use each data
             | access many times. In that case you really do just have
             | typically a 2x slowdown for doubles because you have half
             | the elements per vector.
             | 
             | Might f32 be sufficient for this problem? Almost certainly
             | IMO. It might not matter though (or it might -- as always,
             | measure).
        
               | dahart wrote:
               | FWIW, for a game controller, I'm absolutely certain that
               | f32 is already overkill and that a half float or even a
               | byte per axis, u16/u8, is sufficient. The only reason to
               | need f32 or f64 is if you're rounding direction vectors
               | for some other purpose than a game controller. But in
               | graphics we frequently use f16 for direction vectors even
               | when f32 is needed for positions. Normals and unit sphere
               | directions usually require less precision than other
               | operations. Using f64 for a game controller stick input
               | isn't justifiable, even if it isn't measurable in code
               | speed, I don't think it would fly or pass code review in
               | a real game engine. My next concern would be whether
               | using doubles willy-nilly leads to the compiler spilling
               | registers to memory more often. That could easily undo
               | the 40ns optimization, and it would be difficult to
               | measure or blame the direction rounding code, it would
               | diffuse the cost to other parts of the code.
        
       | phkahler wrote:
       | It may be useful to read about the CORDIC algorithms. It just
       | feels like there is an even simpler solution lurking here:
       | 
       | https://en.wikipedia.org/wiki/CORDIC
        
       | thexa4 wrote:
       | Taking the dot product of the [x,y] with the 8 directions and
       | picking the one with the largest magnitude might be faster in
       | some cases.
        
         | rocqua wrote:
         | You only need 4 directions, and can look at the sign of the
         | dot-product.
        
         | tgv wrote:
         | Good idea. Only four inproducts and comparisons are needed: the
         | other four have the same value, but negated. Might be SIMD-
         | able.
        
           | Someone wrote:
           | I was thinking about a more complicated approach (edit2: that
           | ends up simpler). A 4-way version of this can be done with
           | two comparisons ( _if x > y_ to check on what side of one
           | diagonal the point lies and _if x > -y_ to check the same for
           | the other diagonal, plus some bit twiddling to get the
           | required two bits out).
           | 
           | One can do that, rotate by 45 degrees, do that again, and do
           | some further bit manipulation to combine the two bits of the
           | first 4-way version with those of the second into the three
           | bits this should produce.
           | 
           | That would be four comparisons, a 2x2 matrix multiplication
           | (edit: that's a matrix-vector multiplication) with a known-
           | ahead rotation matrix (naively, that would be 8 (edit: 4)
           | multiplications and 4 (edit: 2) additions, but I think we can
           | lose those multiplications because all elements of a 45
           | degree rotation matrix have absolute value 1/[?]2 and the
           | problem is scale-invariant) and some bit twiddling.
           | 
           | Now, make that branchless...
           | 
           | Edit2: and of course, you can do both splits into quadrants
           | by just looking at the signs of x and y. That rotates the
           | four quadrants by 45 degrees, but that doesn't matter.
           | 
           | That makes the 4-way version simply _(x >0) + 2 x (y>0)_: two
           | comparisons and a bit shift.                 twoBits = (x >
           | 0) + 2 x (y > 0)       x' = x + y       y' = x - y
           | twoMoreBits = (x' > 0) + 2 x (y' > 0)       (Some bit
           | twiddling with 'twoBits' and 'twoMoreBits'        Feel free
           | to compute those bit pairs differently if        it makes the
           | bit twiddling easier)
        
       ___________________________________________________________________
       (page generated 2022-07-25 23:01 UTC)