[HN Gopher] Do Low-Level Optimizations Matter? (2020)
       ___________________________________________________________________
        
       Do Low-Level Optimizations Matter? (2020)
        
       Author : signa11
       Score  : 113 points
       Date   : 2021-07-17 05:21 UTC (1 days ago)
        
 (HTM) web link (cantrip.org)
 (TXT) w3m dump (cantrip.org)
        
       | dragontamer wrote:
       | Sorting is a good 'random problem' but matrix multiplication is
       | WOW when you get all the low level details right.
       | 
       | L1 blocking, prefetching, column vs row, etc etc. You can
       | probably spend a whole semester just optimizing matrix
       | multiplication and understanding each incremental low level
       | optimization.
        
         | duped wrote:
         | And it's incredibly frustrating that precious few languages
         | bake in matrix arithmetic so compilers can do this optimization
         | for you. There's Octave and MATLAB but they're slow for other
         | reasons. It's very frustrating when you have to either use some
         | crazy template library that takes forever to compile or do the
         | optimizations by yourself.
         | 
         | Julia is definitely promising for that reason
        
           | jpeloquin wrote:
           | I thought baking matrix arithmetic into the language was more
           | about making matrix-related code look nice than making it
           | faster. That is, most languages just use LAPACK for matrix
           | arithmetic, even Matlab and Julia, and they are distinguished
           | only in that they provide special syntax for it. "In Julia
           | (as in much of scientific computation), dense linear-algebra
           | operations are based on the LAPACK library, which in turn is
           | built on top of basic linear-algebra building-blocks known as
           | the BLAS" [0]. How does baking matrix arithmetic into the
           | language itself (maybe I misunderstood what you meant by
           | this) provide new opportunities for compilers to optimize the
           | code? Is it that if every bit of a program was implemented in
           | the same language (e.g., Fortran), the compiler can do whole-
           | program optimizations it otherwise could not?
           | 
           | [0] https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#BL
           | AS-...
        
             | duped wrote:
             | With dedicated semantics it becomes possible for a compiler
             | to optimize things explicitly rather than implicitly. It's
             | sort of equivalent to how a language without generics or
             | inheritance can't perform certain optimizations (like
             | inlining method calls) because the compiler isn't aware of
             | those semantics in the first place.
             | 
             | The overhead of BLAS is significant for small matrices.
             | However, a compiler can tell if matrices are small at
             | compile time (or JIT time) and inline the expressions which
             | presents enormous speed gains. It can also make judgement
             | calls about when to use sparse or dense arrays but that's a
             | bit trickier. Compilers can do this without knowing that
             | matrices or linear algebra exist, but it usually takes PGO
             | which can be difficult to use.
             | 
             | Iirc there's a linear algebra library for Julia that can
             | beat the LAPACK backends for small matrices, which is a
             | pretty big deal.
             | 
             | There are also classes of optimizations like SIMD that
             | don't just depend on matrix size but the target platform,
             | which may not be the platform the software was compiled on.
             | The way to do this is with function multiversioning but not
             | all compilers/languages support it, this is kind of a
             | classic usecase for a JIT compiler.
             | 
             | One of the reasons Eigen is great is because it can use
             | template semantics to make these optimizations at compile
             | time. It's just slightly annoying in how it harms compile
             | times.
        
               | adgjlsfhk1 wrote:
               | yeah. It's Octavian https://raw.githubusercontent.com/Jul
               | iaLinearAlgebra/Octavia... It can beat MKL convincingly
               | up to 100x100 and remains slightly ahead up to 1000x1000
               | (and performance is still improving).
        
             | _hl_ wrote:
             | I think the point they were making is that LAPACK et al are
             | a pain to use directly (they really are!), and abstractions
             | are limited by the capabilities of the language or heavy to
             | compile (not sure I agree, Eigen is pretty okay).
        
       | meisel wrote:
       | It appears that sorting is done on the memory-mapped buffer? That
       | seems like it could have extra costs (lazily paging in from disk)
       | that we wouldn't see for an in-memory buffer, and which could
       | affect the benchmark.
        
         | jandrewrogers wrote:
         | The MAP_POPULATE flag instructs mmap() to load the underlying
         | pages into physical memory immediately. There should be no
         | paging from disk during the benchmark run.
        
       | MrYellowP wrote:
       | Depends on the necessity. If you really need more performance,
       | then they absolutely matter.
       | 
       | In general it makes sense to program for the platform/CPU, not
       | the compiler. Respect the CPU first, the compiler second. This
       | turns into second nature just like everything else.
       | 
       | From my experience, 99% of the people out there disagree, doing
       | so for a good reason. They code for the compiler, not the system.
       | When that's good enough, then that it is.
        
       | inciampati wrote:
       | For sorting, not really. The big development in compute power is
       | parallelism. ips4o facto (https://github.com/SaschaWitt/ips4o),
       | if you want to sort large vectors really fast it makes more sense
       | to sort in-place and in parallel. Parallel in-place radix sort is
       | also choice, but way less flexible than comparison sort.
        
       | ot wrote:
       | > let us begin with a file of a hundred-million totally random
       | integers [...] This is objectively the worst case, containing the
       | greatest possible entropy; but also the best case, hiding no
       | surprises.
       | 
       | There is no single worst case for sorting, and in fact most
       | benchmarks use a several different inputs, and even the best
       | algorithms win some and lose some. It's important not to overfit
       | to a specific distribution.
       | 
       | For example, fully random integers can benefit MSB-radix-type
       | sorts, because it's enough to look at a log(n) prefix of the
       | integer to have the data almost sorted, so the later passes have
       | very predictable behavior.
       | 
       | Also the chance of duplicates is negligible, and duplicates are
       | problematic for some quicksort implementations.
        
         | ot wrote:
         | For example, the radix sort implementation is cheating here:
         | for (auto& v : buckets) v.reserve((end - begin)/128);
         | 
         | It's assuming uniform distribution of the buckets, so it can
         | make a very accurate guess and avoid extra allocations. If that
         | line is removed for the general case, I would expect things to
         | get a lot slower. I would be curious to see a comparison with a
         | good implementation of in-place radix sort. For example there
         | is one available in boost.
        
           | dragontamer wrote:
           | The radix sort isn't really the crux of the article. They're
           | trying to just see how fast radix sort is ("unfairly"), and
           | then try to improve a different comparison-sort instead.
        
       | nightcracker wrote:
       | This article is missing the important reference to the prior work
       | by Edelkamp and Weiss, on branchless quicksort [1].
       | 
       | I've long since implemented that in pdqsort [2], along with other
       | techniques described by me in [3] which is a drop-in replacement
       | for std::sort that's ~twice as fast as std::sort for small data
       | with branchless comparisons. It's also available in Boost.sort
       | [4].
       | 
       | [1] https://arxiv.org/abs/1604.06697 [2]
       | https://github.com/orlp/pdqsort [3]
       | https://arxiv.org/abs/2106.05123 [4]
       | https://www.boost.org/doc/libs/1_76_0/libs/sort/doc/html/sor...
        
         | Waterluvian wrote:
         | An ignorant question:
         | 
         | Why don't default sort implementations provide multiple
         | approaches under the hood and switch based on collection size?
        
           | abatilo wrote:
           | There's at least some precedent for conditional execution in
           | standard libraries.
           | 
           | At least, in Python, the default sort is
           | https://en.m.wikipedia.org/wiki/Timsort
        
           | ot wrote:
           | They do, to a degree. Modern implementations generally use
           | hardcoded routines for very small numbers (say, up to 5),
           | then some quadratic sort like insertion sort, then quicksort.
           | The dispatching can happen at all levels of the recursion, so
           | quicksort's leaves are actually optimized sorts for small
           | collections.
        
           | amelius wrote:
           | Hopefully not just collection size, but also based on storage
           | medium. You might want to sort stuff which is stored on disk
           | or tape. Why should a standard library only offer solutions
           | for objects stored in memory?
        
           | Someone wrote:
           | I expect many do. Recursive ones in particular, switch when
           | things get relatively easy. For example glibc's _qsort_
           | switches to insertion sort for small sizes. https://sourcewar
           | e.org/git/?p=glibc.git;a=blob_plain;f=stdli...:
           | 
           |  _"Only quicksorts TOTAL_ELEMS / MAX_THRESH partitions,
           | leaving insertion sort to order the MAX_THRESH items within
           | each partition. This is a big win, since insertion sort is
           | faster for small, mostly sorted array segments."_
           | 
           | The comment on the definition of MAX_THRESH is very
           | funny/something worth crying about.
           | 
           |  _"This particular magic number was chosen to work best on a
           | Sun 4 /260."_
           | 
           | That was a 16.67MHz machine with 128 megabytes of RAM, tops.
           | If that magic constant has been checked since 1990, maybe,
           | it's worth updating that comment.
        
           | anaphor wrote:
           | This is a thing, see
           | https://www.youtube.com/watch?v=FJJTYQYB1JQ (talk by Andrei
           | Alexandrescu at CppCon 2019 about this exact idea)
        
       | TchoBeer wrote:
       | The headline is misleading; the article isn't really about
       | whether or not low-level optimisations matter. With that being
       | said, the idea that one of the big speedups that computers have
       | gotten in the past two decades is better caching and branch
       | prediction is interesting, and the idea that we use neural nets
       | for it is known to me but still really cool.
        
         | atq2119 wrote:
         | My understanding is that the use of neural nets in branch
         | prediction is greatly overstated for marketing purposes. It's
         | really more that different branch predictors are combined into
         | a single one using a simple function (read: linear combination)
         | that is "learned" (read: coefficients are updated sightly
         | whenever the branch is encountered, based on the predicted vs
         | actual branch direction).
        
           | dragontamer wrote:
           | AMD Zen 1 had a neural net branch predictor for real.
           | 
           | But Zen2 / Zen3 don't seem to have this idea referenced
           | anywhere. I can only imagine that AMD managed to beat their
           | neural-net branch predictor with some boring methodology.
           | 
           | EDIT: https://fuse.wikichip.org/news/2458/a-look-at-the-amd-
           | zen-2-...
           | 
           | Wikichip has done an article on it.
        
       | api wrote:
       | Branches matter a lot in many cases. In some cases compilers are
       | smart enough to optimize them out and in some cases they aren't.
        
       | wpietri wrote:
       | For those interested in the question, it's worth looking at how
       | zstd was made. I have to archive hundreds of TB of JSON about
       | tweets, so I did a compressor bake-off last year. I was surprised
       | at how much better zstd did. Turns out designing for the low-
       | level details of modern processors can make a big difference:
       | https://engineering.fb.com/2016/08/31/core-data/smaller-and-...
        
         | userbinator wrote:
         | Somehow, I suspect not using JSON in the first place might
         | yield better absolute results since the compressor wouldn't
         | have to deal with all the redundancy of JSON and instead focus
         | on the content only.
        
           | wpietri wrote:
           | For many applications, this one included, "the first place"
           | is out of reach. What I've got is an awful lot of incoming
           | JSON data. I'd like to keep a copy in case we need it later.
           | 
           | Is it possible for me to write and test a robust custom
           | compressor that does better in both storage and compression
           | costs than any of the modern compressors? Possibly.
           | 
           | Is that true even though one of FB's goals was specifically
           | optimizing JSON compression? Maybe.
           | 
           | Can I build it for the few hundred bucks per month in S3
           | costs I might save? Definitely not.
        
       | einpoklum wrote:
       | > Do Low-Level Optimizations Matter?
       | 
       | 1. They matter a _lot_.
       | 
       | 2. (Mentioned but not covered in the article) adaptation to known
       | features of your input distribution also matters a lot.
        
       | adrian_b wrote:
       | The history of the CMOV instructions mentioned in the article is
       | incorrect.
       | 
       | While ARM had such instructions from the beginning and other
       | computers had them even earlier, in the Intel/AMD x86 instruction
       | set the CMOV instructions (including FCMOV for floating-point)
       | have been introduced in the Pentium Pro, in November 1995 and all
       | later Intel models have them. AMD has added the CMOV/FCMOV
       | instructions in Athlon, in June 1999.
       | 
       | The 64-bit extension of x86 has just inherited them from the
       | 32-bit ISA and Itanium had them later than Pentium Pro.
       | 
       | The CMOV instructions were an important new feature of Pentium
       | Pro, besides the out-of-order execution and triple instruction
       | decoding and execution.
        
       | makapuf wrote:
       | The article is indeed interesting underlining several cache miss
       | aspects and performance impacts, but sorting 1e8 ints in 7s with
       | standard lib vs 4s (with complex and subtle insights, and maybe
       | sensitive to cpu arch) might not be worth it if you're dealing
       | with a database, or with not that experienced coworkers, or
       | deaking with a more complex computation. So does 2x matter? The
       | article doesn't say, but it is interesting nonetheless.
        
       | reader_mode wrote:
       | >Why doesn't G++ generate cmov instructions? Older releases did,
       | in fact, often enough to generate bug reports11. It turned out
       | that, any time a subsequent loop iteration depends on the result,
       | and the branch would have been predicted correctly, cmov may be
       | slower than a branch, sometimes much slower. cmov can stall
       | speculation all by itself.
       | 
       | >The latest designs from Intel and AMD are said to avoid the cmov
       | pipeline stall, often, but Gcc has not caught up with them yet.
       | 
       | This is why you don't optimize first - even things that you think
       | you know might not be true in the next HW iteration, compiler
       | version, etc. This pattern shows up all the time. The only sane
       | way to do optimization is to measure, write tests to catch
       | regressions, otherwise always write for readability first.
        
         | alex_smart wrote:
         | >This is why you don't optimize first
         | 
         | The internals of an optimizing compiler might very well be the
         | worst place to apply that principle. Who else is supposed to
         | implement these optimizations, if not the compiler?
        
         | pjmlp wrote:
         | I wish I got a penny for each C and C++ developer that fails to
         | follow that advice.
        
       ___________________________________________________________________
       (page generated 2021-07-18 23:01 UTC)