[HN Gopher] Do Low-Level Optimizations Matter? (2020)
___________________________________________________________________
Do Low-Level Optimizations Matter? (2020)
Author : signa11
Score : 113 points
Date : 2021-07-17 05:21 UTC (1 days ago)
(HTM) web link (cantrip.org)
(TXT) w3m dump (cantrip.org)
| dragontamer wrote:
| Sorting is a good 'random problem' but matrix multiplication is
| WOW when you get all the low level details right.
|
| L1 blocking, prefetching, column vs row, etc etc. You can
| probably spend a whole semester just optimizing matrix
| multiplication and understanding each incremental low level
| optimization.
| duped wrote:
| And it's incredibly frustrating that precious few languages
| bake in matrix arithmetic so compilers can do this optimization
| for you. There's Octave and MATLAB but they're slow for other
| reasons. It's very frustrating when you have to either use some
| crazy template library that takes forever to compile or do the
| optimizations by yourself.
|
| Julia is definitely promising for that reason
| jpeloquin wrote:
| I thought baking matrix arithmetic into the language was more
| about making matrix-related code look nice than making it
| faster. That is, most languages just use LAPACK for matrix
| arithmetic, even Matlab and Julia, and they are distinguished
| only in that they provide special syntax for it. "In Julia
| (as in much of scientific computation), dense linear-algebra
| operations are based on the LAPACK library, which in turn is
| built on top of basic linear-algebra building-blocks known as
| the BLAS" [0]. How does baking matrix arithmetic into the
| language itself (maybe I misunderstood what you meant by
| this) provide new opportunities for compilers to optimize the
| code? Is it that if every bit of a program was implemented in
| the same language (e.g., Fortran), the compiler can do whole-
| program optimizations it otherwise could not?
|
| [0] https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#BL
| AS-...
| duped wrote:
| With dedicated semantics it becomes possible for a compiler
| to optimize things explicitly rather than implicitly. It's
| sort of equivalent to how a language without generics or
| inheritance can't perform certain optimizations (like
| inlining method calls) because the compiler isn't aware of
| those semantics in the first place.
|
| The overhead of BLAS is significant for small matrices.
| However, a compiler can tell if matrices are small at
| compile time (or JIT time) and inline the expressions which
| presents enormous speed gains. It can also make judgement
| calls about when to use sparse or dense arrays but that's a
| bit trickier. Compilers can do this without knowing that
| matrices or linear algebra exist, but it usually takes PGO
| which can be difficult to use.
|
| Iirc there's a linear algebra library for Julia that can
| beat the LAPACK backends for small matrices, which is a
| pretty big deal.
|
| There are also classes of optimizations like SIMD that
| don't just depend on matrix size but the target platform,
| which may not be the platform the software was compiled on.
| The way to do this is with function multiversioning but not
| all compilers/languages support it, this is kind of a
| classic usecase for a JIT compiler.
|
| One of the reasons Eigen is great is because it can use
| template semantics to make these optimizations at compile
| time. It's just slightly annoying in how it harms compile
| times.
| adgjlsfhk1 wrote:
| yeah. It's Octavian https://raw.githubusercontent.com/Jul
| iaLinearAlgebra/Octavia... It can beat MKL convincingly
| up to 100x100 and remains slightly ahead up to 1000x1000
| (and performance is still improving).
| _hl_ wrote:
| I think the point they were making is that LAPACK et al are
| a pain to use directly (they really are!), and abstractions
| are limited by the capabilities of the language or heavy to
| compile (not sure I agree, Eigen is pretty okay).
| meisel wrote:
| It appears that sorting is done on the memory-mapped buffer? That
| seems like it could have extra costs (lazily paging in from disk)
| that we wouldn't see for an in-memory buffer, and which could
| affect the benchmark.
| jandrewrogers wrote:
| The MAP_POPULATE flag instructs mmap() to load the underlying
| pages into physical memory immediately. There should be no
| paging from disk during the benchmark run.
| MrYellowP wrote:
| Depends on the necessity. If you really need more performance,
| then they absolutely matter.
|
| In general it makes sense to program for the platform/CPU, not
| the compiler. Respect the CPU first, the compiler second. This
| turns into second nature just like everything else.
|
| From my experience, 99% of the people out there disagree, doing
| so for a good reason. They code for the compiler, not the system.
| When that's good enough, then that it is.
| inciampati wrote:
| For sorting, not really. The big development in compute power is
| parallelism. ips4o facto (https://github.com/SaschaWitt/ips4o),
| if you want to sort large vectors really fast it makes more sense
| to sort in-place and in parallel. Parallel in-place radix sort is
| also choice, but way less flexible than comparison sort.
| ot wrote:
| > let us begin with a file of a hundred-million totally random
| integers [...] This is objectively the worst case, containing the
| greatest possible entropy; but also the best case, hiding no
| surprises.
|
| There is no single worst case for sorting, and in fact most
| benchmarks use a several different inputs, and even the best
| algorithms win some and lose some. It's important not to overfit
| to a specific distribution.
|
| For example, fully random integers can benefit MSB-radix-type
| sorts, because it's enough to look at a log(n) prefix of the
| integer to have the data almost sorted, so the later passes have
| very predictable behavior.
|
| Also the chance of duplicates is negligible, and duplicates are
| problematic for some quicksort implementations.
| ot wrote:
| For example, the radix sort implementation is cheating here:
| for (auto& v : buckets) v.reserve((end - begin)/128);
|
| It's assuming uniform distribution of the buckets, so it can
| make a very accurate guess and avoid extra allocations. If that
| line is removed for the general case, I would expect things to
| get a lot slower. I would be curious to see a comparison with a
| good implementation of in-place radix sort. For example there
| is one available in boost.
| dragontamer wrote:
| The radix sort isn't really the crux of the article. They're
| trying to just see how fast radix sort is ("unfairly"), and
| then try to improve a different comparison-sort instead.
| nightcracker wrote:
| This article is missing the important reference to the prior work
| by Edelkamp and Weiss, on branchless quicksort [1].
|
| I've long since implemented that in pdqsort [2], along with other
| techniques described by me in [3] which is a drop-in replacement
| for std::sort that's ~twice as fast as std::sort for small data
| with branchless comparisons. It's also available in Boost.sort
| [4].
|
| [1] https://arxiv.org/abs/1604.06697 [2]
| https://github.com/orlp/pdqsort [3]
| https://arxiv.org/abs/2106.05123 [4]
| https://www.boost.org/doc/libs/1_76_0/libs/sort/doc/html/sor...
| Waterluvian wrote:
| An ignorant question:
|
| Why don't default sort implementations provide multiple
| approaches under the hood and switch based on collection size?
| abatilo wrote:
| There's at least some precedent for conditional execution in
| standard libraries.
|
| At least, in Python, the default sort is
| https://en.m.wikipedia.org/wiki/Timsort
| ot wrote:
| They do, to a degree. Modern implementations generally use
| hardcoded routines for very small numbers (say, up to 5),
| then some quadratic sort like insertion sort, then quicksort.
| The dispatching can happen at all levels of the recursion, so
| quicksort's leaves are actually optimized sorts for small
| collections.
| amelius wrote:
| Hopefully not just collection size, but also based on storage
| medium. You might want to sort stuff which is stored on disk
| or tape. Why should a standard library only offer solutions
| for objects stored in memory?
| Someone wrote:
| I expect many do. Recursive ones in particular, switch when
| things get relatively easy. For example glibc's _qsort_
| switches to insertion sort for small sizes. https://sourcewar
| e.org/git/?p=glibc.git;a=blob_plain;f=stdli...:
|
| _"Only quicksorts TOTAL_ELEMS / MAX_THRESH partitions,
| leaving insertion sort to order the MAX_THRESH items within
| each partition. This is a big win, since insertion sort is
| faster for small, mostly sorted array segments."_
|
| The comment on the definition of MAX_THRESH is very
| funny/something worth crying about.
|
| _"This particular magic number was chosen to work best on a
| Sun 4 /260."_
|
| That was a 16.67MHz machine with 128 megabytes of RAM, tops.
| If that magic constant has been checked since 1990, maybe,
| it's worth updating that comment.
| anaphor wrote:
| This is a thing, see
| https://www.youtube.com/watch?v=FJJTYQYB1JQ (talk by Andrei
| Alexandrescu at CppCon 2019 about this exact idea)
| TchoBeer wrote:
| The headline is misleading; the article isn't really about
| whether or not low-level optimisations matter. With that being
| said, the idea that one of the big speedups that computers have
| gotten in the past two decades is better caching and branch
| prediction is interesting, and the idea that we use neural nets
| for it is known to me but still really cool.
| atq2119 wrote:
| My understanding is that the use of neural nets in branch
| prediction is greatly overstated for marketing purposes. It's
| really more that different branch predictors are combined into
| a single one using a simple function (read: linear combination)
| that is "learned" (read: coefficients are updated sightly
| whenever the branch is encountered, based on the predicted vs
| actual branch direction).
| dragontamer wrote:
| AMD Zen 1 had a neural net branch predictor for real.
|
| But Zen2 / Zen3 don't seem to have this idea referenced
| anywhere. I can only imagine that AMD managed to beat their
| neural-net branch predictor with some boring methodology.
|
| EDIT: https://fuse.wikichip.org/news/2458/a-look-at-the-amd-
| zen-2-...
|
| Wikichip has done an article on it.
| api wrote:
| Branches matter a lot in many cases. In some cases compilers are
| smart enough to optimize them out and in some cases they aren't.
| wpietri wrote:
| For those interested in the question, it's worth looking at how
| zstd was made. I have to archive hundreds of TB of JSON about
| tweets, so I did a compressor bake-off last year. I was surprised
| at how much better zstd did. Turns out designing for the low-
| level details of modern processors can make a big difference:
| https://engineering.fb.com/2016/08/31/core-data/smaller-and-...
| userbinator wrote:
| Somehow, I suspect not using JSON in the first place might
| yield better absolute results since the compressor wouldn't
| have to deal with all the redundancy of JSON and instead focus
| on the content only.
| wpietri wrote:
| For many applications, this one included, "the first place"
| is out of reach. What I've got is an awful lot of incoming
| JSON data. I'd like to keep a copy in case we need it later.
|
| Is it possible for me to write and test a robust custom
| compressor that does better in both storage and compression
| costs than any of the modern compressors? Possibly.
|
| Is that true even though one of FB's goals was specifically
| optimizing JSON compression? Maybe.
|
| Can I build it for the few hundred bucks per month in S3
| costs I might save? Definitely not.
| einpoklum wrote:
| > Do Low-Level Optimizations Matter?
|
| 1. They matter a _lot_.
|
| 2. (Mentioned but not covered in the article) adaptation to known
| features of your input distribution also matters a lot.
| adrian_b wrote:
| The history of the CMOV instructions mentioned in the article is
| incorrect.
|
| While ARM had such instructions from the beginning and other
| computers had them even earlier, in the Intel/AMD x86 instruction
| set the CMOV instructions (including FCMOV for floating-point)
| have been introduced in the Pentium Pro, in November 1995 and all
| later Intel models have them. AMD has added the CMOV/FCMOV
| instructions in Athlon, in June 1999.
|
| The 64-bit extension of x86 has just inherited them from the
| 32-bit ISA and Itanium had them later than Pentium Pro.
|
| The CMOV instructions were an important new feature of Pentium
| Pro, besides the out-of-order execution and triple instruction
| decoding and execution.
| makapuf wrote:
| The article is indeed interesting underlining several cache miss
| aspects and performance impacts, but sorting 1e8 ints in 7s with
| standard lib vs 4s (with complex and subtle insights, and maybe
| sensitive to cpu arch) might not be worth it if you're dealing
| with a database, or with not that experienced coworkers, or
| deaking with a more complex computation. So does 2x matter? The
| article doesn't say, but it is interesting nonetheless.
| reader_mode wrote:
| >Why doesn't G++ generate cmov instructions? Older releases did,
| in fact, often enough to generate bug reports11. It turned out
| that, any time a subsequent loop iteration depends on the result,
| and the branch would have been predicted correctly, cmov may be
| slower than a branch, sometimes much slower. cmov can stall
| speculation all by itself.
|
| >The latest designs from Intel and AMD are said to avoid the cmov
| pipeline stall, often, but Gcc has not caught up with them yet.
|
| This is why you don't optimize first - even things that you think
| you know might not be true in the next HW iteration, compiler
| version, etc. This pattern shows up all the time. The only sane
| way to do optimization is to measure, write tests to catch
| regressions, otherwise always write for readability first.
| alex_smart wrote:
| >This is why you don't optimize first
|
| The internals of an optimizing compiler might very well be the
| worst place to apply that principle. Who else is supposed to
| implement these optimizations, if not the compiler?
| pjmlp wrote:
| I wish I got a penny for each C and C++ developer that fails to
| follow that advice.
___________________________________________________________________
(page generated 2021-07-18 23:01 UTC)