[HN Gopher] Round a direction vector to an 8-way compass
___________________________________________________________________
Round a direction vector to an 8-way compass
Author : greghn
Score : 57 points
Date : 2022-07-25 13:11 UTC (9 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| bee_rider wrote:
| Is it possible to just ask... xinput or whatever joystick driver
| is used nowadays... to produce the cardinal or 8-compass
| directions? I guess it seems plausible, at least, that maybe for
| some controllers we could skip the whole conversion to/from a
| pair of floats thing, depending on what signals are sent down the
| USB pipe.
| RicoElectrico wrote:
| I wonder if some amount of hysteresis would be helpful to avoid
| switching back and forth on the edge.
| ChrisLomont wrote:
| Button debouncing is useful for all sorts of stuff like this,
| but it requires much trickier code to get right, including
| tracking some form of history.
| iainmerrick wrote:
| What's the motivation for optimising this? If it's for user
| input, you only need to call this once per frame (or at most a
| few times).
| jesse__ wrote:
| I can think of a few reasons:
|
| 1. writing a game that movement happens along 8 axes, and needs
| to trace lots of rays
|
| 2. several other reasons similar to (1)
|
| 3. had a clever idea to optimize a thing and went ham, for fun
| beckingz wrote:
| faster parsing of user input means more time for doing graphics
| and game state!
|
| Though in this case 1.5 nanoseconds vs 40 nanoseconds is small
| and won't matter for most games.
| Cthulhu_ wrote:
| More faster is more better; you only have 16 milliseconds per
| tick assuming you want to read input 60 times a second, some
| games poll even faster / have a higher frame rate.
| nmilo wrote:
| It's still a difference of 38.5 _nano_ seconds. You have a
| budget 16,000,000 nanoseconds per frame. It 's like Jeff
| Bezos writing about how to save money at Walmart; just pick
| the readable and stupid approach and move on. Game loops are
| made fast or slow by the efficiency of their tight loops: the
| render loop, physics update loop, UI loop, etc. Something
| that happens once per frame is barely ever worth considering.
| sgtnoodle wrote:
| Ok, but 40ns of 16ms is 0.00025% of your frame time.
| dahart wrote:
| It doesn't matter for user input, the 40ns won't make a
| difference even at 60fps or higher with 4 players. (Have
| written the controller processing loop for a console game
| engine.) I suspect it would be less than 40ns if float was used
| instead of double; there's no good reason to use double here at
| all. But the proposed solution is better, and super easy
| (arguably easier than atan2+sincos), so maybe no reason not to
| use it a save a few microseconds if it happens to add up to
| that much.
|
| I could imagine other uses of this code, like a UI wind map, or
| some kind of randomized maze, or something else. There might be
| reasons to round a vector of directions all at once.
| iainmerrick wrote:
| Yeah, it's a cute optimisation, and might make sense in a
| game engine; I just question whether this would ever be
| worthwhile writing as custom code in a single game.
|
| I'm not sure I agree it's more readable. The optimised
| version is reasonably regular, but it's still longer and has
| more constants, hence more opportunities for sign errors and
| the like.
|
| It feels sort of like jumping through hoops to avoid a matrix
| inversion, say. Sure, matrix inversion is "slow", but unless
| you're doing it millions of times a frame it doesn't remotely
| matter. If matrix inversion or atan2 is the most direct way
| to express a transformation, just do it that way.
| dahart wrote:
| I think the atan2+sincos version is more readable, so I
| guess we actually do agree there. But I wouldn't (and
| didn't) use that in a game engine. For rounding and
| readability in a real engine, I'd name 4 of the 8
| directions and look for the largest absolute value dot
| product. That would at least corral the constants. And I'd
| use (did use) f16 or integer arithmetic.
|
| One reason I have to support this kind of optimization and
| not reach for the trig method by default is the last time I
| wrote a controller system, it wasn't limited to something
| as small as deciding on a single dpad-like direction, it
| was a much larger system that designers could author
| controller recognizers for, up to and including drawing
| shapes like a circle with the stick, or entering a time
| based sequence of inputs (think Contra's famous up-down-up-
| down-etc-etc cheat code). If you have sequences and
| multiple players and high frame rate and many possible
| patterns to identify, that's when efficiency of these small
| operations really starts to add up. I actually spent the
| most development time ensuring that the amount of time
| spent _not_ recognizing controls was minimized, making sure
| that a no-op pass through the system was measured in
| nanoseconds and not microseconds.
| moralestapia wrote:
| This but in a 2-sphere space is what I need right now for a
| quantum computing simulator I'm coding for fun/learning.
|
| I'm doing a different thing tho, dot product vs. four cardinal
| points.
| jameshart wrote:
| Seems to rest on the assumption that an input control provides a
| vector on the unit circle, and tries to divide that into eight
| regions, corresponding to unit vectors aligned to compass
| directions. That seems like an odd assumption though - most
| analog XY inputs produce a vector with x and y components
| nominally distributed from -1 to 1, (although in practice
| constrained by calibration, and not necessarily linear in control
| stick deflection)
|
| This approach effectively divides the 'unit square' of possible
| input vectors up into a grid, based on cos(pi/8) and cos(3pi/8)
| gridlines, with basically 9 possible areas in each quadrant
| mapped to corresponding directional vectors.
|
| It may not be a bad strategy in general to use this kind of grid
| model, but it needs testing with actual controllers to see what
| the 'feel' ends up being. While this mathematical model makes
| sense in theory, in practice there's no reason to suspect that
| these cosine values are necessarily good thresholds to use to
| create an intuitive compass direction control that actually feels
| intuitive.
|
| In general in games, math needs to be a guide, but not a
| straitjacket for how you approach things.
| dahart wrote:
| This is a good point; some controllers are neither circular nor
| square. Check out the mapping ranges for a couple of Nintendo's
| old controllers: https://electromodder.co.uk/
|
| That said, Lemire's approach rewritten as dot products might
| work nicely and even look a bit more clean, even if it took 3ns
| instead of 1.5.
| Arrath wrote:
| > This is a good point; some controllers are neither circular
| nor square. Check out the mapping ranges for a couple of
| Nintendo's old controllers: https://electromodder.co.uk/
|
| Very interesting, thanks! I've noticed myself that sometimes
| if I'm holding the controller a bit skewed my natural impulse
| will be to push in directions that don't exactly map to what
| the controller wants. EG I think I'm pushing forward, but I'm
| moving forward and slightly to the left. And its not just
| stick drift, its my own user error.
| dahart wrote:
| I do the same. Some difficult and brutal platformers will
| really help you notice how much user error you bring to the
| table. :P Good point too, because for fast twitchy games
| where the directions really matter, it's quite nice to be
| able to calibrate and/or bias the direction rounding to
| compensate for my own bias. This could be one good reason
| to go with the trig code instead - offsetting the angle
| becomes trivial. It could be done under the covers before
| the rounding code sees it, but it'd still be doing trig
| somewhere.
| jansan wrote:
| Two more optimizations:
|
| - Use _else if_ to avoid uncessary condition checking.
|
| - Only check for _xneg_ / _yneg_ if _outx_ / _outy_ are non-zero.
| dundarious wrote:
| Almost certainly makes it worse to add additional conditional
| dependencies, meaning branches upon branches.
|
| Some examples:
|
| 1. Branches with dependencies. static void
| minMax(u8* pxmin, u8* pxmax, const u8* xs, size_t n) {
| const u8 init = n > 0u ? xs[0] : (u8)0u; u8 xmin = init
| u8 xmax = init; for (size_t i = 1u; i < n; ++i)
| { const u8 x = xs[i]; if (x < xmin) { xmin
| = x; } else if (x > xmax) { xmax = x; } }
| *pxmin = xmin; *pxmax = xmax; }
|
| 2. Branches w/o dependencies static void
| minMax(u8* pxmin, u8* pxmax, const u8* xs, size_t n) {
| const u8 init = n > 0u ? xs[0] : (u8)0u; u8 xmin = init
| u8 xmax = init; for (size_t i = 1u; i < n; ++i)
| { const u8 x = xs[i]; if (x < xmin) { xmin
| = x; } if (x > xmax) { xmax = x; } /*not doing else
| if*/ } *pxmin = xmin; *pxmax = xmax;
| }
|
| Compiling with auto-vectorization enabled (-O3 for most
| versions of GCC, and I was using -march=native on a Zen1 AMD
| CPU, so AVX2 capable), version 2 is 0.05x the runtime of the
| version 1 (so a _lot_ faster). Version 2 disassembly will show
| lots of "packed" vector instructions.
|
| 3. Branchless inline u8 minBranchless(u8 x, u8
| y) { return x * (u8)(x <= y ? 1u : 0u)
| + y * (u8)(x > y ? 1u : 0u); } inline u8
| maxBranchless(u8 x, u8 y) { return x * (u8)(x >=
| y ? 1u : 0u) + y * (u8)(x < y ? 1u : 0u); }
| static void minMax(u8* pxmin, u8* pxmax, const u8* xs, size_t
| n) { const u8 init = n > 0u ? xs[0] : (u8)0u;
| u8 xmin = init u8 xmax = init; for (size_t i =
| 1u; i < n; ++i) { const u8 x = xs[i];
| xmin = minBranchless(xmin, x); xmax =
| maxBranchless(xmax, x); } *pxmin = xmin;
| *pxmax = xmax; }
|
| Version 3 is probably a better comparison for the code from the
| article -- there are only 1 or 2 joystick direction vectors
| that need normalization, so perhaps not enough to show major
| vectorization benefit, or to even prompt compiler auto-
| vectorization. Version 3 will most likely be compiled into cmov
| instructions on x86_64, which are branchless, but branches can
| still be better, if they can be predicted well. The test data I
| used was random, so cmov should be expected to be beneficial.
|
| Last time I ran comparisons, Version 2 was 0.07x the runtime of
| Version 3, so Version 3 was not that much faster than Version
| 1, but still a good improvement.
|
| As a final note, I should say I'm not a fan of _relying_ on
| compiler auto-vectorization. If you depend on the performance
| of vectorized code (and vectorizing code is one of the best
| optimizations out there), then I'm a fan of explicitly
| vectorizing. SSE2 is in nearly 100% of relevant x86_64 systems
| (first coming out in 2003). If you're targeting consumers, AVX
| is almost certainly worth it. If you're writing server code,
| you should be taking full advantage of AVX2 and maybe 512. For
| arm, NEON is the current vector extension of relevance, and you
| can polyfill with DLTcollab /sse2neon or something like that,
| until you go native. Writing vectorized code using SIMD
| intrinsics (#include <immintrin.h>) is preferable. Especially
| since compilers just cannot know how to auto-vectorize a _lot_
| of floating point code, because of strict assumptions about
| requirements around non-commutativity, so they almost _never_
| generate packed FMADDs.
|
| As for the article code, I wouldn't make any assumptions about
| whether the compiler is generating branches or cmov-s. I'd
| guess cmov-s would be faster, but I haven't really thought
| about it. All I'll say is that turning a lot of `if`s into
| `else if`s would make it impossible to generate cmov-s, and
| chaining branches just to avoid assigning to a local that's
| almost certainly in a register is only going to make branching
| slower.
| jesse__ wrote:
| Loved this, thanks.
| WithinReason wrote:
| Isn't the ternary operator a branch? Could you just write
| (u8)(x <= y)?
| dundarious wrote:
| In x86_64, a branch is typically considered to be a
| comparison and then a jump to different code (otherwise
| fall through to following code). Test sets some control
| flags, and the specific jump type jumps based on the value
| of a specific control flag. You might have something like:
| mov rax,rbx test rcx,rcx jnz L1
| mov rax,rdx L1: ret
|
| A cmov instruction is a conditional move. It says, based on
| specific control flags, do a mov, otherwise do nothing.
| mov rax,rbx test rcx,rcx cmovz rax,rdx
| ret
|
| The conditional move is not really branching in the classic
| meaning of the word, because there is no jump (the
| instruction pointer unconditionally moves forward to the
| next instruction).
|
| Branches, even well-predicted ones, can hurt performance,
| especially if the jumped to code is not in cache (needs to
| be fetched). More detail can be found in this nice top-to-
| bottom video from Casey Muratori, which goes over modern
| CPU architecture (it's nothing too fancy, I think every
| person doing optimization should know this stuff):
| https://www.youtube.com/watch?v=8VakkEFOiJc
|
| You can imagine the compiler generating either assembly
| from if style code or from ternary style code. Writing the
| arithmetic ternary style just happens to nudge the compiler
| into using cmov-s more often than other ways. For simple
| cases, a smart enough compiler could generate cmov-s from
| the if-else case as well, but it's not smart enough for
| even toy examples. Never mind that cmov-s are not
| unconditionally better -- a well-trained branch predictor
| is faster. The question is if your data is conducive to
| that or not -- something the compiler may never infer. So
| we're left to encode such information very indirectly via
| the code patterns, or via all sorts of __builtin_expect
| style intrinsics.
|
| This all goes to show that C is no longer portable
| assembler, and there are all sorts of crucial resources on
| the CPU (cmov, SIMD, etc.) that may go unused based on
| slight differences in the C code (e.g., even little things
| that might seem correct, like using unsigned for an
| accumulator, can ruin its ability to generate cmov-s).
|
| Compilers are really good at turning code that multiplies
| by a constant into a weird mov and lea that's cheaper than
| a mul, but they're still very spotty on everything else,
| and "everything else" covers a huge amount of the stuff
| that has made CPUs faster in the last 15+ years.
| Understanding your specific platform target(s) is still
| crucial if you want to do better than just "get lucky with
| the compiler sometimes".
| jcranmer wrote:
| > SSE2 is in nearly 100% of relevant x86_64 systems (first
| coming out in 2003).
|
| Not nearly, exactly 100%. The 64-bit ABI specifically
| requires SSE2 support.
| dundarious wrote:
| Last time I talked with someone about this, they convinced
| me there was some rare, possibly even research-only x86_64
| chip that didn't support it, so my point is that if my hazy
| memory of someone else's argument is correct, that it
| doesn't matter, treat it as if it's exactly 100% anyway.
|
| It's 100% of what matters, so in practice I'm in full
| agreement with you.
| WithinReason wrote:
| Would _else if_ be actually faster on a modern CPU
| architecture?
| pklausler wrote:
| Only measurement with real data would tell for sure, but I
| would bet "yes" if I had to. With the "else", the "if"
| predicate is being predicted over a smaller amount of
| relevant cases; without the "else", the prediction is going
| to have to cope with some amount of needless false results
| and is likely to be less accurate.
| adrian_b wrote:
| As written, the compiler can implement the "if"s with
| conditional move instructions and that would be faster than
| with conditional branches, as those would be hard to predict in
| this case.
|
| Using "else if" is a valid optimization in general, but not
| necessarily when your target is a branchless program. Using
| "else if" may help in some cases, but you must have in mind
| exactly which CMOV instructions will be generated by the
| compiler, to be sure that there is any improvement, i.e. any
| reduction in the total number of CMOV instructions, and that
| you are not increasing the length of some chains of dependent
| instructions (leading to a greater latency).
| dahart wrote:
| Is the problem with the sin/cos approach that it's using double
| precision? That's surely completely unnecessary precision. These
| days I don't know what happens in x86 land with sin & cos, but on
| a gaming GPU the fp32 cosf() is typically 16 or 32 times faster
| than fp64 cos(). Is that true on CPUs too? (I'm sure the simple
| proposed solution is always faster even in fp32, I'm just curious
| if the gap is much narrower, smaller than 25x.)
| hansvm wrote:
| The precision doesn't matter much on the CPU (roughly half the
| throughout if you're doing a lot of them). You have a data
| dependency on the arctangent, and transcendentals are slow
| anyway, so you're not going to have a good time. You can even
| compute the sin and cos together and still expect a bare
| minimum latency of 50 clock cycles for the arctangent followed
| by 50 for the sincos, usually implemented with a variable speed
| algorithm that might be slower (i.e., lower bound of 30ns of
| latency on a moderately fast single core, just for the trig).
|
| Throughput can be a bit higher with vectorization and
| pipelining, but more or less any such optimization you can make
| to the transcendentals you can also make on the proposed
| solution.
| dahart wrote:
| Thanks, good to know! So that does suggest to me that if
| there are other threads in progress, the actual impact to the
| runtime in practice probably would be lower with fp32 - the
| half throughput rate does tell us how long we wait for any
| instructions dependent on the operation, right? I'm saying I
| suspect the total latency can be filled with other work, so
| this isn't really 40ns that is fully devoted to atan2+sincos
| and nothing else. Is that accurate? (On the GPUs I use, trig
| functions use a different math pipe than FMA, and so they can
| be 'free' if you don't use too many of them and there is
| enough logic or FMA to hide the trig latency.)
|
| Aren't the benefits of using fp32 bigger than the trig
| instruction throughput too? Doubles eat more register space,
| and double the bandwidth to memory if anything ends up in
| memory. I just don't see any reason whatsoever to use doubles
| for this operation. If we're going to worry about
| nanoseconds, then these extra bytes matter too, no?
| hansvm wrote:
| The throughput comment was just that we have fixed-width
| vector instructions which support being filled with and
| doing trig on either 32-bit or 64-bit floats. They both
| occupy the same floating point execution units (mostly, see
| the AMD bulldozer/steamdriver/... fiasco from a few years
| ago), so you're either filling them with N doubles or 2N
| floats. You'd be hard-pressed to squeeze other useful work
| into that space. The actual difference between float/double
| in the amount of work done is otherwise miniscule and
| mostly related to teasing out the extra precision. With
| fast-converging algorithms it doesn't wind up mattering
| much in terms of total latency.
|
| "Latency" is what I was referring to with respect to how
| long you have to wait to execute a data-dependant
| instruction. "Throughput" refers to how many you can have
| in flight and how fast those execute.
|
| Latency _can_ be filled with other work, but as I
| mentioned, it can be filled better with other work using
| the faster non-trig solution. Rule of thumb, you won't
| squeeze in more than 4 in-flight instructions per core, so
| you're probably not going to eek out more 10ns average
| throughput for the trig solution, worse than the proposed
| optimization even assuming it couldn't be pipelined at all.
| You're right that float/int execution units are usually
| separate, but they do have some shared resources. If you're
| doing enough logic relative to these computations then yes
| these could sometimes be basically free.
|
| The register/cache/RAM/disk space and bandwidth very well
| might wind up dominating every other consideration. As
| always, it depends what else you're doing and what exactly
| you're doing with these computations. To give a useful
| counter-example, consider matrix multiplication. It can be
| structured such that even if you have to read/write the
| matrices from disk because of their size it's still a CPU-
| bound operation because you can efficiently use each data
| access many times. In that case you really do just have
| typically a 2x slowdown for doubles because you have half
| the elements per vector.
|
| Might f32 be sufficient for this problem? Almost certainly
| IMO. It might not matter though (or it might -- as always,
| measure).
| dahart wrote:
| FWIW, for a game controller, I'm absolutely certain that
| f32 is already overkill and that a half float or even a
| byte per axis, u16/u8, is sufficient. The only reason to
| need f32 or f64 is if you're rounding direction vectors
| for some other purpose than a game controller. But in
| graphics we frequently use f16 for direction vectors even
| when f32 is needed for positions. Normals and unit sphere
| directions usually require less precision than other
| operations. Using f64 for a game controller stick input
| isn't justifiable, even if it isn't measurable in code
| speed, I don't think it would fly or pass code review in
| a real game engine. My next concern would be whether
| using doubles willy-nilly leads to the compiler spilling
| registers to memory more often. That could easily undo
| the 40ns optimization, and it would be difficult to
| measure or blame the direction rounding code, it would
| diffuse the cost to other parts of the code.
| phkahler wrote:
| It may be useful to read about the CORDIC algorithms. It just
| feels like there is an even simpler solution lurking here:
|
| https://en.wikipedia.org/wiki/CORDIC
| thexa4 wrote:
| Taking the dot product of the [x,y] with the 8 directions and
| picking the one with the largest magnitude might be faster in
| some cases.
| rocqua wrote:
| You only need 4 directions, and can look at the sign of the
| dot-product.
| tgv wrote:
| Good idea. Only four inproducts and comparisons are needed: the
| other four have the same value, but negated. Might be SIMD-
| able.
| Someone wrote:
| I was thinking about a more complicated approach (edit2: that
| ends up simpler). A 4-way version of this can be done with
| two comparisons ( _if x > y_ to check on what side of one
| diagonal the point lies and _if x > -y_ to check the same for
| the other diagonal, plus some bit twiddling to get the
| required two bits out).
|
| One can do that, rotate by 45 degrees, do that again, and do
| some further bit manipulation to combine the two bits of the
| first 4-way version with those of the second into the three
| bits this should produce.
|
| That would be four comparisons, a 2x2 matrix multiplication
| (edit: that's a matrix-vector multiplication) with a known-
| ahead rotation matrix (naively, that would be 8 (edit: 4)
| multiplications and 4 (edit: 2) additions, but I think we can
| lose those multiplications because all elements of a 45
| degree rotation matrix have absolute value 1/[?]2 and the
| problem is scale-invariant) and some bit twiddling.
|
| Now, make that branchless...
|
| Edit2: and of course, you can do both splits into quadrants
| by just looking at the signs of x and y. That rotates the
| four quadrants by 45 degrees, but that doesn't matter.
|
| That makes the 4-way version simply _(x >0) + 2 x (y>0)_: two
| comparisons and a bit shift. twoBits = (x >
| 0) + 2 x (y > 0) x' = x + y y' = x - y
| twoMoreBits = (x' > 0) + 2 x (y' > 0) (Some bit
| twiddling with 'twoBits' and 'twoMoreBits' Feel free
| to compute those bit pairs differently if it makes the
| bit twiddling easier)
___________________________________________________________________
(page generated 2022-07-25 23:01 UTC)