[HN Gopher] You probably shouldn't use a lookup table (2022)
___________________________________________________________________
You probably shouldn't use a lookup table (2022)
Author : segfaultbuserr
Score : 68 points
Date : 2023-10-09 18:48 UTC (4 hours ago)
(HTM) web link (specbranch.com)
(TXT) w3m dump (specbranch.com)
| dragontamer wrote:
| This is a lot of words about things that probably don't matter.
|
| IMO: the bottleneck of lookup tables is concisely described as
| follows: 2 or 3 lookups in L1 cache per cycle on modern CPUs, and
| 10x fewer to 500x fewer depending on how far away (L2, L3, DDR
| near, or DDR remote).
|
| Meanwhile, modern CPU instructions like AES, Multiply, XOR, ADD
| and more can effectively operate 3, 4, or even more times per
| clock tick regardless of circumstances.
|
| --------
|
| Don't even talk about cache hierarchies. You are already behind
| at the L1 cache level due to the relatively few load/store units
| in modern CPU (or GPU) cores. And hitting L2 or L3 cache gets
| exponentially worse.
| jjtheblunt wrote:
| Why exponentially?
| ilyt wrote:
| > Meanwhile, modern CPU instructions like AES, Multiply, XOR,
| ADD and more can effectively operate 3, 4, or even more times
| per clock tick regardless of circumstances.
|
| I don't think people are making lookup tables to adding or
| xoring 2 numbers...
| dragontamer wrote:
| Between BMI instructions (PEXT, PDEP, bit reverse, least
| significant 1s bit, etc. Etc), AES, SHA1, add, xor, rotate,
| shift, and, or, not, multiply, carryless multiply and...
|
| Pshufb (4-bit / 16-byte lookup table)
|
| You might be surprised at how many operations can be done
| without touching the load/store units on CPUs.
|
| In my experience, it's usually worth the effort. Then again,
| it's a LOT of effort needed.
|
| If you are just doing things simply and easily? Lookups are
| almost the easiest solution. But I probably wouldn't reach
| for lookup tables for a performance based problem today.
| pjz wrote:
| Here's an in-the-wild, long-standing example of a table used
| for bit reversals: https://gitlab.com/libtiff/libtiff/-/blob/
| master/libtiff/tif...
|
| The function to do the same for a byte is... what? 3 or 4
| lines long in C? I guess it takes a couple of temp variables,
| though, which may dissuade some devs.
| a1369209993 wrote:
| That's not "A lookup table is faster than a native bit
| reversal instruction."; that's "There's no portable way to
| make the compiler emit a native bit reversal instruction,
| and a lookup table is (presumably) faster than a couple
| dozen instructions to do it manually (at least on most
| target architectures).".
| jeffbee wrote:
| What about something like https://github.com/abseil/abseil-
| cpp/blob/master/absl/string... ? ASCII lowercase can be
| something like auto is_upper= (x > '@')
| & (x < '['); return is_upper ? x+32 : x;
|
| Which boils down to add, add, cmp, cmov on x86. It might
| save some cache traffic, but the cache traffic is less code
| since it's only a mov. The version that's code instead of
| lookups probably exacerbates register pressure problems in
| large programs? I don't have good intuition for which would
| be more efficient in practice.
| tgv wrote:
| Except that in Unicode, this doesn't work.
| dragontamer wrote:
| The example code is a 256-byte lookup table, meaning the
| original example won't work with Unicode either.
|
| In fact, I'm pretty sure that Unicode cannot be solved
| with a lookup table due to its variable length, but maybe
| you can prove me wrong? (Assuming UTF8 here)
| tgv wrote:
| 256 is indeed pretty limited.
|
| Unicode does have a limited space, but it cannot be
| stored practically in single table. It currently runs up
| to 0x323AF, a bit over 200k, and most of the characters
| of course don't have a lower/uppercase mapping. The
| implementations I've seen do a few comparisons and then
| delegate to a table.
|
| But: horses for courses. If you have to normalize a lot
| of Latin-1 text (such transform case, or strip
| diacritics), you can probably write some vector
| instructions that runs circles around a simple LUT. But
| it's not going to be as easy.
| nwellnhof wrote:
| Yes, this is a good example where LUTs are very likely to
| be slower, especially if you can avoid branch
| mispredictions with cmov.
| dragontamer wrote:
| Your 8-bit lookup can never be parellized, while the
| add/cmp/cmov is really easy AVX512 that probably auto-
| inlines and auto-vectorizes.
|
| I dunno, it's ridiculously a benefit to the code by my
| instinct. While lookup table looks pretty bad.
| jeffbee wrote:
| > can never be parellized
|
| I mean, I'm not skilled enough in those ISA extensions to
| stick my neck out. It's not totally obvious to me that
| there is not some shuffle or permute facility that can
| load 64 bytes at a time from LUTs.
| dragontamer wrote:
| > It's not totally obvious to me that there is not some
| shuffle or permute facility that can load 64 bytes at a
| time from LUTs.
|
| You are talking about vgather and/or vscatter, which are
| well known to be very slow AVX2 or AVX512 instructions.
|
| Maybe a future CPU will make these instructions high
| performance. But no modern 2023-era CPU has a high-speed
| vgather.
|
| Like: the vgather does one-at-a-time slow. You basically
| lose parallelism even if the vgather instruction
| describes what you want to do, it's not an effective
| parallel operation today (and may never be)
|
| --------
|
| Pshufb as a 4-bit LUT is an exception and is effectively
| a high speed (but very very small) lookup table. Like
| every cycle 64 bytes at a time fast.
|
| You are limited by the 16-byte lookup size (aka the size
| of an SSE register), maybe a bit bigger if there are new
| instructions I dunno about.
| gpderetta wrote:
| Vectorized gather loads can do just that. But currently
| are pretty underwhelming on x86 machines.
| gpderetta wrote:
| Although it can't be easily vectorized, it can be
| pipelined, that can help both hide the access latency and
| keep the lookup table hot. The code is not going to be
| pretty thorough.
| vardump wrote:
| > Which boils down to add, add, cmp, cmov on x86.
|
| By using SIMD instructions like AVX2 you can do 32
| characters in parallel. With AVX512, 64.
|
| Pretty easy to write something way faster by not using a
| LUT.
| robocat wrote:
| Comedy: did you notice that the next static definition is
| for a 256 byte lookup table for the unreversed bits too:
| TIFFNoBitRevTable[256] = { 0x00, 0x01, 0x02, 0x03,
| 0x04, 0x05, 0x06, 0x07, ... 0xFF }
|
| Maybe PSHUFB/VTBL to reverse each nibble and then put the
| reversed nibbles together would be wayyy faster?
|
| I'm guessing portability matters here, and there is not
| much commercial pressure to actually optimise this TIFF
| library?
| capitainenemo wrote:
| A subset of (3) that is the reason for Hedgewars' sine lookup
| table. "When you want the results to be very predictable"
| Deterministic lockstep has always had issues with floating point
| variations across platforms, and I imagine that applies to other
| things as well...
| fluoridation wrote:
| This is very long for something that's quite simple: If your
| input space is small enough that it fits into an 8- or 16-bit
| LUT, try it. If it's faster, use the LUT; if it's not, don't use
| it. Really, that's all that needs to be said. The cases where
| LUTs this small crop up are so rare that you lose nothing by
| trying. There's no use in trying to predict the performance when
| you can just measure it.
| remram wrote:
| I agree. Do a benchmark end-to-end. Measuring the lookup alone
| and theorizing is worth nothing.
| jameshart wrote:
| Not hugely convincing to put a toy model up against a micro
| benchmark and claim the toy model proves the micro benchmark
| wrong.
|
| The correct way to compare these approaches is with real code.
| vardump wrote:
| Most of the time the solution that pollutes L1 caches more is
| going to lose in real life use. LUTs do that a lot, but of
| course ballooned code can hit L1C cache badly as well.
|
| You can compute a ton on a modern CPU during what a memory
| access would take.
| pvorb wrote:
| The correct way to compare these two approaches is to measure
| the difference on _your individual program_ and not rely on
| benchmarks that measure different programs.
|
| But I also understand that you can't do that for every decision
| you need to make, so at least this article gives a hint that
| lookup tables are something that might be worth measuring.
| coolhand2120 wrote:
| In my work LUTs are a reaction to really poor performance, not a
| matter of course. Every time I reach for one my inner dialog says
| "you (probably) should use a lookup table".
| gpderetta wrote:
| For the most part I agree. This can generalized with the
| observation that a) sometimes is faster to recompute a function
| than caching it, and b) cache is a limited resource usually more
| than CPU cycles.
| vardump wrote:
| Same. I've replaced a ton of LUTs with pure [vectorized] code
| and most of the time it's a big win. Some computations can of
| course be too heavy, but that's pretty rare nowadays.
| rented_mule wrote:
| I ran into CPU cache issues in a slightly different application
| 15+ years ago. I was working on indexing gigabytes of text on a
| mobile CPU (before smart phones caused massive investment in such
| CPUs). Word normalization logic (e.g., sky/skies/sky's -> sky)
| was very slow, so I used a word cache which sped it up immensely.
|
| The initial version of the word cache cleared the entire cache
| when it hit a certain number of entries, just as a placeholder.
| When I got around to adding some LRU eviction logic, it became
| faster on our desktop simulator, but far slower on the embedded
| device (slower than with no word cache at all). The difference
| came down to the CPU cache size / strategy differences between
| the desktop and mobile CPUs.
|
| We ended up shipping the "dumb" word cache logic because it was
| so much faster in practice. The word cache eviction function was
| only two lines of code plus a large comment that acknowledged
| that "yes, this looks dumb" and imploring that the speed of any
| future improvements be compared to the dumb version on the target
| device. Testing real data on the hardware used in practice is
| incredibly important.
| thot_experiment wrote:
| Interesting article but very in the weeds on it's usecases and I
| wish the caveats on when it applies were more clearly stated.
| There are many situations where the computation being done is
| prohibitively expensive compared to the time horizon required,
| LUTs are great for that sort of thing. IMO people should use LUTs
| MORE often.
| dragontamer wrote:
| Yes use LUTs.
|
| But 4-LUT is also called pshufb and you never ever need to
| touch the relatively slow load/store units.
|
| The space of computations you can do with 4-LUT PSHUFB
| instructions, followed up with simple bit instructions (add,
| or, not, xor, multiply) is rather insane.
| tgv wrote:
| The space you can cover in a table with n values is much,
| much larger, even for small values of n (say 100).
| Tommah wrote:
| Eh, you should probably test it either way. Performance can be
| very unpredictable. I ran into a weird case years ago when I was
| writing a program that needed to calculate popcount (number of 1
| bits in a number). My method using a 256-entry lookup table
| turned out to be faster than using GCC's popcount intrinsic. (See
| [1] for a question from someone who encountered a similar
| situation. The solution provided there was to compile with
| -march=native, but I think I tried that without success.)
|
| I ran into a similar problem in another program where I needed
| 128-bit integers. I implemented them myself as a C struct
| containing two 64-bit integers. Then I discovered that GCC
| implemented its own 128-bit integer type. But that type was
| somehow slower than my implementation!
|
| [1] https://stackoverflow.com/questions/52161596/why-is-
| builtin-...
| ChrisMarshallNY wrote:
| Lookup tables can be _insanely_ fast, if kept entirely in
| L[1|2] cache.
|
| We found this out, when we were optimizing image processing
| pipelines (image processing uses _lots_ of lookup tables).
|
| We found that the best way to optimize was to avoid moving out
| of cache, so we had to rewrite a lot of our stuff, so it never
| broke cache.
|
| That meant a lot of copy-and-paste repetition, strict (small)
| static functions, and hardcoded values, instead of properties
| or even constants.
|
| I don't remember all the stuff we learned, but it wasn't pretty
| (and applied only to x86 architecture. Not sure how well it
| would work, these days, on ARM).
| fluoridation wrote:
| Regardless of the arguments this article makes, it can be
| surprising. One time back in the Core 2 days I was optimizing
| an alpha blend in software and I had something like x * y /
| 255. That operation was a tiny bit slower than table[(x << 8)
| | y].
| vardump wrote:
| Perhaps a wrong link? Your link shows a case where the
| compiler's implementation is _better_ than a custom one.
| Thorrez wrote:
| Has the link been edited? And are you taking about the
| question or the answers? In the question, __builtin_popcount
| is the slowest of the 3 options on both compilers.
| o11c wrote:
| Was your problem perhaps because of the hardware bug where
| popcount unnecessarily stalled on the output as if it were an
| input?
|
| Later compilers added an explicit zero of the output to work
| around this issue.
| zschoche wrote:
| Well, it always depends on what you lookup i.e. how much time one
| needs to compute entries of the given lookup table. If it takes
| O(2^n) time , then payback comes quickly, where n is the input
| size. It's good advice to be very skeptical if one drops general
| statements like that. Look for algorithms with the best worst
| -case time complexity is a good start. It will always win. Theory
| works. Question is only of big the input has to be in order to
| see that. Here, the fun part starts I guess.
___________________________________________________________________
(page generated 2023-10-09 23:01 UTC)