hngopher.com

       [HN Gopher] You probably shouldn't use a lookup table (2022)
       ___________________________________________________________________
        
       You probably shouldn't use a lookup table (2022)
        
       Author : segfaultbuserr
       Score  : 68 points
       Date   : 2023-10-09 18:48 UTC (4 hours ago)
        
 (HTM) web link (specbranch.com)
 (TXT) w3m dump (specbranch.com)
        
       | dragontamer wrote:
       | This is a lot of words about things that probably don't matter.
       | 
       | IMO: the bottleneck of lookup tables is concisely described as
       | follows: 2 or 3 lookups in L1 cache per cycle on modern CPUs, and
       | 10x fewer to 500x fewer depending on how far away (L2, L3, DDR
       | near, or DDR remote).
       | 
       | Meanwhile, modern CPU instructions like AES, Multiply, XOR, ADD
       | and more can effectively operate 3, 4, or even more times per
       | clock tick regardless of circumstances.
       | 
       | --------
       | 
       | Don't even talk about cache hierarchies. You are already behind
       | at the L1 cache level due to the relatively few load/store units
       | in modern CPU (or GPU) cores. And hitting L2 or L3 cache gets
       | exponentially worse.
        
         | jjtheblunt wrote:
         | Why exponentially?
        
         | ilyt wrote:
         | > Meanwhile, modern CPU instructions like AES, Multiply, XOR,
         | ADD and more can effectively operate 3, 4, or even more times
         | per clock tick regardless of circumstances.
         | 
         | I don't think people are making lookup tables to adding or
         | xoring 2 numbers...
        
           | dragontamer wrote:
           | Between BMI instructions (PEXT, PDEP, bit reverse, least
           | significant 1s bit, etc. Etc), AES, SHA1, add, xor, rotate,
           | shift, and, or, not, multiply, carryless multiply and...
           | 
           | Pshufb (4-bit / 16-byte lookup table)
           | 
           | You might be surprised at how many operations can be done
           | without touching the load/store units on CPUs.
           | 
           | In my experience, it's usually worth the effort. Then again,
           | it's a LOT of effort needed.
           | 
           | If you are just doing things simply and easily? Lookups are
           | almost the easiest solution. But I probably wouldn't reach
           | for lookup tables for a performance based problem today.
        
           | pjz wrote:
           | Here's an in-the-wild, long-standing example of a table used
           | for bit reversals: https://gitlab.com/libtiff/libtiff/-/blob/
           | master/libtiff/tif...
           | 
           | The function to do the same for a byte is... what? 3 or 4
           | lines long in C? I guess it takes a couple of temp variables,
           | though, which may dissuade some devs.
        
             | a1369209993 wrote:
             | That's not "A lookup table is faster than a native bit
             | reversal instruction."; that's "There's no portable way to
             | make the compiler emit a native bit reversal instruction,
             | and a lookup table is (presumably) faster than a couple
             | dozen instructions to do it manually (at least on most
             | target architectures).".
        
             | jeffbee wrote:
             | What about something like https://github.com/abseil/abseil-
             | cpp/blob/master/absl/string... ? ASCII lowercase can be
             | something like                    auto is_upper= (x > '@')
             | & (x < '[');          return is_upper ? x+32 : x;
             | 
             | Which boils down to add, add, cmp, cmov on x86. It might
             | save some cache traffic, but the cache traffic is less code
             | since it's only a mov. The version that's code instead of
             | lookups probably exacerbates register pressure problems in
             | large programs? I don't have good intuition for which would
             | be more efficient in practice.
        
               | tgv wrote:
               | Except that in Unicode, this doesn't work.
        
               | dragontamer wrote:
               | The example code is a 256-byte lookup table, meaning the
               | original example won't work with Unicode either.
               | 
               | In fact, I'm pretty sure that Unicode cannot be solved
               | with a lookup table due to its variable length, but maybe
               | you can prove me wrong? (Assuming UTF8 here)
        
               | tgv wrote:
               | 256 is indeed pretty limited.
               | 
               | Unicode does have a limited space, but it cannot be
               | stored practically in single table. It currently runs up
               | to 0x323AF, a bit over 200k, and most of the characters
               | of course don't have a lower/uppercase mapping. The
               | implementations I've seen do a few comparisons and then
               | delegate to a table.
               | 
               | But: horses for courses. If you have to normalize a lot
               | of Latin-1 text (such transform case, or strip
               | diacritics), you can probably write some vector
               | instructions that runs circles around a simple LUT. But
               | it's not going to be as easy.
        
               | nwellnhof wrote:
               | Yes, this is a good example where LUTs are very likely to
               | be slower, especially if you can avoid branch
               | mispredictions with cmov.
        
               | dragontamer wrote:
               | Your 8-bit lookup can never be parellized, while the
               | add/cmp/cmov is really easy AVX512 that probably auto-
               | inlines and auto-vectorizes.
               | 
               | I dunno, it's ridiculously a benefit to the code by my
               | instinct. While lookup table looks pretty bad.
        
               | jeffbee wrote:
               | > can never be parellized
               | 
               | I mean, I'm not skilled enough in those ISA extensions to
               | stick my neck out. It's not totally obvious to me that
               | there is not some shuffle or permute facility that can
               | load 64 bytes at a time from LUTs.
        
               | dragontamer wrote:
               | > It's not totally obvious to me that there is not some
               | shuffle or permute facility that can load 64 bytes at a
               | time from LUTs.
               | 
               | You are talking about vgather and/or vscatter, which are
               | well known to be very slow AVX2 or AVX512 instructions.
               | 
               | Maybe a future CPU will make these instructions high
               | performance. But no modern 2023-era CPU has a high-speed
               | vgather.
               | 
               | Like: the vgather does one-at-a-time slow. You basically
               | lose parallelism even if the vgather instruction
               | describes what you want to do, it's not an effective
               | parallel operation today (and may never be)
               | 
               | --------
               | 
               | Pshufb as a 4-bit LUT is an exception and is effectively
               | a high speed (but very very small) lookup table. Like
               | every cycle 64 bytes at a time fast.
               | 
               | You are limited by the 16-byte lookup size (aka the size
               | of an SSE register), maybe a bit bigger if there are new
               | instructions I dunno about.
        
               | gpderetta wrote:
               | Vectorized gather loads can do just that. But currently
               | are pretty underwhelming on x86 machines.
        
               | gpderetta wrote:
               | Although it can't be easily vectorized, it can be
               | pipelined, that can help both hide the access latency and
               | keep the lookup table hot. The code is not going to be
               | pretty thorough.
        
               | vardump wrote:
               | > Which boils down to add, add, cmp, cmov on x86.
               | 
               | By using SIMD instructions like AVX2 you can do 32
               | characters in parallel. With AVX512, 64.
               | 
               | Pretty easy to write something way faster by not using a
               | LUT.
        
             | robocat wrote:
             | Comedy: did you notice that the next static definition is
             | for a 256 byte lookup table for the unreversed bits too:
             | TIFFNoBitRevTable[256] = {         0x00, 0x01, 0x02, 0x03,
             | 0x04, 0x05, 0x06, 0x07, ... 0xFF }
             | 
             | Maybe PSHUFB/VTBL to reverse each nibble and then put the
             | reversed nibbles together would be wayyy faster?
             | 
             | I'm guessing portability matters here, and there is not
             | much commercial pressure to actually optimise this TIFF
             | library?
        
       | capitainenemo wrote:
       | A subset of (3) that is the reason for Hedgewars' sine lookup
       | table. "When you want the results to be very predictable"
       | Deterministic lockstep has always had issues with floating point
       | variations across platforms, and I imagine that applies to other
       | things as well...
        
       | fluoridation wrote:
       | This is very long for something that's quite simple: If your
       | input space is small enough that it fits into an 8- or 16-bit
       | LUT, try it. If it's faster, use the LUT; if it's not, don't use
       | it. Really, that's all that needs to be said. The cases where
       | LUTs this small crop up are so rare that you lose nothing by
       | trying. There's no use in trying to predict the performance when
       | you can just measure it.
        
         | remram wrote:
         | I agree. Do a benchmark end-to-end. Measuring the lookup alone
         | and theorizing is worth nothing.
        
       | jameshart wrote:
       | Not hugely convincing to put a toy model up against a micro
       | benchmark and claim the toy model proves the micro benchmark
       | wrong.
       | 
       | The correct way to compare these approaches is with real code.
        
         | vardump wrote:
         | Most of the time the solution that pollutes L1 caches more is
         | going to lose in real life use. LUTs do that a lot, but of
         | course ballooned code can hit L1C cache badly as well.
         | 
         | You can compute a ton on a modern CPU during what a memory
         | access would take.
        
         | pvorb wrote:
         | The correct way to compare these two approaches is to measure
         | the difference on _your individual program_ and not rely on
         | benchmarks that measure different programs.
         | 
         | But I also understand that you can't do that for every decision
         | you need to make, so at least this article gives a hint that
         | lookup tables are something that might be worth measuring.
        
       | coolhand2120 wrote:
       | In my work LUTs are a reaction to really poor performance, not a
       | matter of course. Every time I reach for one my inner dialog says
       | "you (probably) should use a lookup table".
        
       | gpderetta wrote:
       | For the most part I agree. This can generalized with the
       | observation that a) sometimes is faster to recompute a function
       | than caching it, and b) cache is a limited resource usually more
       | than CPU cycles.
        
         | vardump wrote:
         | Same. I've replaced a ton of LUTs with pure [vectorized] code
         | and most of the time it's a big win. Some computations can of
         | course be too heavy, but that's pretty rare nowadays.
        
       | rented_mule wrote:
       | I ran into CPU cache issues in a slightly different application
       | 15+ years ago. I was working on indexing gigabytes of text on a
       | mobile CPU (before smart phones caused massive investment in such
       | CPUs). Word normalization logic (e.g., sky/skies/sky's -> sky)
       | was very slow, so I used a word cache which sped it up immensely.
       | 
       | The initial version of the word cache cleared the entire cache
       | when it hit a certain number of entries, just as a placeholder.
       | When I got around to adding some LRU eviction logic, it became
       | faster on our desktop simulator, but far slower on the embedded
       | device (slower than with no word cache at all). The difference
       | came down to the CPU cache size / strategy differences between
       | the desktop and mobile CPUs.
       | 
       | We ended up shipping the "dumb" word cache logic because it was
       | so much faster in practice. The word cache eviction function was
       | only two lines of code plus a large comment that acknowledged
       | that "yes, this looks dumb" and imploring that the speed of any
       | future improvements be compared to the dumb version on the target
       | device. Testing real data on the hardware used in practice is
       | incredibly important.
        
       | thot_experiment wrote:
       | Interesting article but very in the weeds on it's usecases and I
       | wish the caveats on when it applies were more clearly stated.
       | There are many situations where the computation being done is
       | prohibitively expensive compared to the time horizon required,
       | LUTs are great for that sort of thing. IMO people should use LUTs
       | MORE often.
        
         | dragontamer wrote:
         | Yes use LUTs.
         | 
         | But 4-LUT is also called pshufb and you never ever need to
         | touch the relatively slow load/store units.
         | 
         | The space of computations you can do with 4-LUT PSHUFB
         | instructions, followed up with simple bit instructions (add,
         | or, not, xor, multiply) is rather insane.
        
           | tgv wrote:
           | The space you can cover in a table with n values is much,
           | much larger, even for small values of n (say 100).
        
       | Tommah wrote:
       | Eh, you should probably test it either way. Performance can be
       | very unpredictable. I ran into a weird case years ago when I was
       | writing a program that needed to calculate popcount (number of 1
       | bits in a number). My method using a 256-entry lookup table
       | turned out to be faster than using GCC's popcount intrinsic. (See
       | [1] for a question from someone who encountered a similar
       | situation. The solution provided there was to compile with
       | -march=native, but I think I tried that without success.)
       | 
       | I ran into a similar problem in another program where I needed
       | 128-bit integers. I implemented them myself as a C struct
       | containing two 64-bit integers. Then I discovered that GCC
       | implemented its own 128-bit integer type. But that type was
       | somehow slower than my implementation!
       | 
       | [1] https://stackoverflow.com/questions/52161596/why-is-
       | builtin-...
        
         | ChrisMarshallNY wrote:
         | Lookup tables can be _insanely_ fast, if kept entirely in
         | L[1|2] cache.
         | 
         | We found this out, when we were optimizing image processing
         | pipelines (image processing uses _lots_ of lookup tables).
         | 
         | We found that the best way to optimize was to avoid moving out
         | of cache, so we had to rewrite a lot of our stuff, so it never
         | broke cache.
         | 
         | That meant a lot of copy-and-paste repetition, strict (small)
         | static functions, and hardcoded values, instead of properties
         | or even constants.
         | 
         | I don't remember all the stuff we learned, but it wasn't pretty
         | (and applied only to x86 architecture. Not sure how well it
         | would work, these days, on ARM).
        
           | fluoridation wrote:
           | Regardless of the arguments this article makes, it can be
           | surprising. One time back in the Core 2 days I was optimizing
           | an alpha blend in software and I had something like x * y /
           | 255. That operation was a tiny bit slower than table[(x << 8)
           | | y].
        
         | vardump wrote:
         | Perhaps a wrong link? Your link shows a case where the
         | compiler's implementation is _better_ than a custom one.
        
           | Thorrez wrote:
           | Has the link been edited? And are you taking about the
           | question or the answers? In the question, __builtin_popcount
           | is the slowest of the 3 options on both compilers.
        
         | o11c wrote:
         | Was your problem perhaps because of the hardware bug where
         | popcount unnecessarily stalled on the output as if it were an
         | input?
         | 
         | Later compilers added an explicit zero of the output to work
         | around this issue.
        
       | zschoche wrote:
       | Well, it always depends on what you lookup i.e. how much time one
       | needs to compute entries of the given lookup table. If it takes
       | O(2^n) time , then payback comes quickly, where n is the input
       | size. It's good advice to be very skeptical if one drops general
       | statements like that. Look for algorithms with the best worst
       | -case time complexity is a good start. It will always win. Theory
       | works. Question is only of big the input has to be in order to
       | see that. Here, the fun part starts I guess.
        
       ___________________________________________________________________
       (page generated 2023-10-09 23:01 UTC)