[HN Gopher] Vectorized and performance-portable Quicksort
___________________________________________________________________
Vectorized and performance-portable Quicksort
Author : slackerIII
Score : 321 points
Date : 2022-06-04 16:48 UTC (6 hours ago)
(HTM) web link (opensource.googleblog.com)
(TXT) w3m dump (opensource.googleblog.com)
| DeathArrow wrote:
| This might interest someone: "Vectorized Operations extension for
| PostgreSQL"
|
| https://github.com/postgrespro/vops
| polskibus wrote:
| This looks awesome but could do with a bit of
| productionalization. For example ready to use binaries. Do you
| know if there are any plans to merge this work with core
| postgres?
| orlp wrote:
| SIMD based sorting isn't new, e.g. djbsort:
| https://sorting.cr.yp.to/. I find it strange the paper doesn't
| cite it or compare to it at all.
|
| EDIT: to be fair it does seem that Bramas' first published work
| on SIMD sorting (that I could find) predates djbsort,
| https://arxiv.org/abs/1704.08579. A comparison would still be
| nice though.
| TontonNestor wrote:
| djbsort uses a sorting network, not a quicksort. It is also
| designed to be in constant time, which is not the case with the
| linked blog post. So the restrictions are different.
|
| This is not my field, so I can't judge the relevance of this,
| but the author cites the state of the art. It just seems that
| djbsort is not the state of the art and therefore does not need
| to be cited.
|
| I still don't understand the hype around Bernstein, he is quite
| relevant in cryptography/cryptography implementations but not
| everything he does is gold.
| janwas wrote:
| :) FWIW we also use a constant-time sorting network as the
| base case of the Quicksort recursion, that's a widely used
| optimization.
| janwas wrote:
| What exactly should be cited? I had a quick look and djbsort
| seems to be a constant-time algorithm, presumably a sorting
| network. They report about 2 GB/s for 768 int32 on 3 GHz
| Skylake; IIRC our sorting network reaches 3-4 GB/s, but for a
| smaller input size (256 int32 or int64). Hard to compare
| directly, but djbsort is likely slower, and the sorting network
| is anyway a small fraction of the total time in a Quicksort.
| explaingarlic wrote:
| The whole "stick SIMD into it and compare it with the stdlib"
| thing is a really common learning trope. Feels more like a
| portfolio filler, albeit probably not a useless one.
| janwas wrote:
| We also compare with state of the art platform-specific code.
| The interesting thing here is that they turn out to be slower
| than our approach using portable intrinsics
| (github.com/google/highway) :)
| mixedCase wrote:
| The article acknowledges SIMD based sorting and states what
| makes this novel.
| nostrademons wrote:
| It's not new within Google either. I remember poking through
| Jeff Dean's original MapReduce code (written c. 2003) and
| finding a SIMD-based QuickSort that was the in-memory portion
| of the external sort algorithm.
|
| It looks like this algorithm is interesting in the specifics of
| how it works, eg. the use of a compress-store instruction for
| partitioning. The algorithm I mentioned above mostly focused on
| speeding up comparisons through the use of SIMD instructions,
| rather than changing the partitioning part of the QuickSort
| algorithm itself. Not sure how that compares to djbsort, I
| didn't see technical details of the algorithm on the page.
| jeffbee wrote:
| The blog post does seem to be quite explicit about the fact
| that their contribution is in the field of portability, not
| novelty.
| janwas wrote:
| Interesting, I did not know that, thanks for the pointer.
| There is also an effort underway to update LLVM std::sort to
| BlockQuicksort, which can use some SIMD instructions
| similarly to what you describe.
| convolvatron wrote:
| nothing happened before google
|
| https://dl.acm.org/doi/10.1145/113379.113380
| janwas wrote:
| hah, our paper actually cites "Radix sort for vector
| multiprocessors" by the same author, also from 1991 :)
| skybrian wrote:
| Will browsers start using this code for typed array sorting?
| Maybe they already do?
| xxs wrote:
| Sorting integers is actually rare in practice, esp. not very
| useful for a browser.
| skybrian wrote:
| The method exists. They could make it faster without breaking
| anything.
|
| https://developer.mozilla.org/en-
| US/docs/Web/JavaScript/Refe...
| wly_cdgr wrote:
| SpaceChem players have known this for years
| unnouinceput wrote:
| Quote: "Today we're sharing open source code that can sort arrays
| of numbers about ten times as fast as the C++ std::sort...." and
| proceeds to explain about SIMD instructions.
|
| Well, to me sounds that C++ std::sort is simply lacking an
| optimization which can very easy be implemented in next iteration
| of the compiler/linker. I had high hopes this would be a really
| breakthrough algorithm, not lack of a simple optimization using
| some instruction set that compiler/linker didn't implement. If
| they want, CPU vendors would make an even more awesome and faster
| instruction set, let's say SIMD2, which would leave this one in
| the dust.
| janwas wrote:
| CPU vendors have recently introduced SVE (Arm) and RVV (RISC-V)
| and we support those in Highway and thus also vqsort.
|
| It's definitely interesting to discuss std::sort benefitting
| from such optimizations, but "very easy" seems rather
| optimistic.
| benstrumental wrote:
| Do you expect these optimizations making their way into
| std::sort eventually?
| janwas wrote:
| Author here, happy to discuss.
| tehsauce wrote:
| Do you have an idea if it's possible that this fast SIMD sort
| could be applied for faster sorting on GPUs?
| [deleted]
| TinkersW wrote:
| I just wanted to say awesome job, I'm going to use it, ignore
| all the ignorant SIMD haters;0
| bhl wrote:
| Regarding columnar databases, is there an optimal data
| structure in between storing data as rows and storing data as
| columns that would make it fast to sort and filter by either
| dimension? Some mix of B-trees or interval trees?
| nickysielicki wrote:
| Highway looks awesome from a quick glance. Definitely going to
| set some time aside to play with it and read it.
|
| Sort of tangentially related, what sort of tools are you using
| for disassembly in 2022? Is there anything better than objdump
| today? What I really want is a version of godbolt that reads
| compile_commands.json and can show me the godbolt-style color-
| coded disassembly for an entire binary, any function. I find
| that when I'm asking these sort of questions I'm wasting a lot
| of time fitting my problem into godbolt so I can just get a
| rough idea of what's coming out, because it's so much faster
| (maybe 10x) to read the color tagged version from godbolt than
| it is to do the same with objdump.
|
| Edit: along the same lines, what can people do (or rather, what
| do you do) about situations like this:
| https://github.com/google/highway/blob/master/hwy/examples/b...
| where things might get better in the future but for now we have
| to implement it another way? How do you avoid regressions and
| how do you track workarounds? I want some sort of database that
| tries competing implementations on every build and emits a
| compiler warning when things get better. How are you dealing
| with this?
| janwas wrote:
| Thanks! Feel free to raise Github issues if you'd like to
| ask/discuss anything.
|
| I'm also a huge fan of Godbolt/Compiler Explorer. Highway is
| integrated there, so you can just copy in your functions.
| Here's the last throwaway test that I was using:
| https://gcc.godbolt.org/z/5azbK95j9
|
| > things might get better in the future but for now we have
| to implement it another way
|
| There's several possible answers. 1) For the issue you
| pointed to, that we cannot have arrays of N vectors, it's a
| reasonable workaround to instead allocate an array of N
| vectors' worth of elements, and trust the compiler to elide
| unnecessary Load/Store. This often works using clang, and
| would avoid having to manually unroll here. I do prefer to
| minimize the amount of compiler magic required for good
| performance, though, so typically we're unrolling manually as
| shown in that code.
|
| 2) If there are compiler bugs, typically the workarounds have
| to stay in because in the open-source world people are still
| using that compiler years later.
|
| Automatically detecting when things get better is an
| interesting idea but I am not yet aware of such
| infrastructure.
| celrod wrote:
| // Compiler doesn't make independent sum* accumulators, so
| unroll manually. // We cannot use an array because V
| might be a sizeless type. For reasonable // code, we
| unroll 4x, but 8x might help (2 FMA ports * 4 cycle
| latency).
|
| That code needs 2 loads per FMA. So a CPU with 2 FMA ports
| would need at least 4 load ports to be able to feed the 2
| FMA ports. Given that most CPUs with 2 FMA ports have just
| 2 load ports, unrolling by 4 should be more or less ideal.
|
| But, ideally, the compiler could make the decision based on
| the target architecture.
|
| Without enabling associative math, it isn't legal to
| duplicate floating point accumulators and change the order
| of the accumulation. Perhaps compiling under `-funsafe-
| math` would help. If you're using GCC, you'll probably need
| `-fvariable-expansion-in-unroller`, too.
|
| I think highway looks great. I'm sure I'll procrastinate on
| something important to play with it reasonably soon.
| superjan wrote:
| Great work! Does it also support sorting object pointers by a
| numeric key?
| mchusma wrote:
| Very cool work! What do you see as the typical process of
| getting code like this into use in well known codebases? I'm
| trying to get a sense of when a typical engineer (operating
| with higher level language or libraries) might get to take
| advantage of this.
|
| Thanks for open sourcing!
| janwas wrote:
| Thanks! Highway is included in Chrome and Firefox. Not sure
| what you mean by process? It's pretty easy for example using
| git submodules - you tell Git the Highway version you want to
| depend on, notify your CMake or Bazel build system of the hwy
| library dependency, and that's it.
|
| One requirement is C++11 or later - some people have asked
| about supporting C but I believe that's infeasible.
| DennisL123 wrote:
| Well done, Jan!
| BeeOnRope wrote:
| Interesting post and paper, thanks!
|
| Sometimes the state of the art is not found in another paper
| but somewhere else, e.g,. there is vxsort by Dan Shechter
| (damageboy):
|
| https://github.com/damageboy/vxsort-cpp/tree/master
|
| He uses a similar approach and while I'm not sure how it
| compares to the older Blacher et al version, I expect it to be
| in the ballpark.
|
| He's written an excellent series of blog posts on the design &
| implementation:
|
| https://bits.houmus.org/2020-01-28/this-goes-to-eleven-pt1
| johntortugo wrote:
| I think it would interesting to have a power consumption
| comparison as well.
| janwas wrote:
| Definitely, I have on my TODO to give turbostat a try.
| SemanticStrengh wrote:
| jart wrote:
| I wonder how fast it is compared to djbsort
| https://github.com/jart/cosmopolitan/blob/master/libc/nexgen...
| and longsort
| https://github.com/jart/cosmopolitan/blob/e011973593407f576d...
| djbsort is outrageously fast for 32-bit ints with avx2 (which
| unlike avx512 it's the avx we can reliably use in open source).
| But there's never been a clear instruction set to use on Intel /
| AMD for sorting 64-bit ints that's reliably fast. So if this
| thing can actually sort 64-bit integers 10x faster on avx2 I'd be
| so thrilled.
| wtallis wrote:
| > it's the avx we can reliably use in open source
|
| I'm not sure what you mean by that. You can't assume the
| presence of AVX or AVX2 without explicitly checking for it,
| because Intel was still disabling those features on new low-end
| Pentium and Celeron parts at least a recently as _Comet Lake_
| (2020). Sure, AVX2 support is much more widespread than AVX512
| support, but that has nothing to do with open-source and it 's
| a bit strange to describe that in terms of reliability.
| jart wrote:
| Some of us like to think of ourselves writing open source as
| serving the public interest. It's hard to do that if you're
| focusing on an ISA the public doesn't have. I haven't seen
| any consumer hardware that has AVX512.
| eklitzke wrote:
| Lots of consumer hardware has AVX512 (I have an 11th gen
| Intel laptop CPU that has it).
|
| Regardless, Clang and GCC both support function multi-
| versioning where you supply multiple versions of a function
| and specify which CPU features each implementation needs,
| and the best version of the function will be selected at
| runtime based on the results of cpuid. For example, you can
| use this to write a function that uses no vector
| instructions, SEE, AVX2, or AVX512 and all versions will be
| compiled into the executable and the best version you can
| actually use will be selected at runtime. This is how glibc
| selects the optimal version of functions like
| memset/memcpy/memcmp, as there are vector instructions that
| significantly speed these functions up.
|
| There's an LWN article about the feature if you're curious
| how it works: https://lwn.net/Articles/691932/
| akelly wrote:
| Intel 10th gen mobile and 11th gen mobile and desktop,
| excluding Pentium and Celeron, have AVX-512. And all 12th
| gen have it on the P cores but not the E cores. If the E
| cores are enabled then AVX-512 is unavailable.
| monocasa wrote:
| On 12th gen they disabled it on the P cores too even with
| E cores disabled with a microcode update. A lot of newer
| systems don't have access to the older microcode, and
| microcode doesn't typically let you downgrade.
| wtallis wrote:
| There are workarounds for downgrading microcode, because
| the CPU itself doesn't actually have non-volatile storage
| for microcode updates and relies on the motherboard
| firmware to upload updates on each boot (and motherboard
| firmware can often be downgraded, possibly after changing
| a setting to allow that).
|
| Which is probably why Intel has changed to disabling
| AVX512 using fuses in more recently manufactured Alder
| Lake CPUs.
| monocasa wrote:
| My point with "a lot of newer systems" was that there are
| motherboards now that completely lack a version of their
| firmware with the microcode that allows avx-512. There's
| nothing to downgrade to without an exploit to allow
| making your own firmware images with mixed and matched
| components.
| robertlagrant wrote:
| > Some of us like to think of ourselves as
|
| I don't see how this is relevant to anything.
| janwas wrote:
| I agree AVX-512 is not exactly widespread on client CPUs
| but as akelly mentions, it does exist (e.g. Icelake).
|
| What we do is dispatch to the best available instruction
| set at runtime - that costs only an indirect branch, plus
| somewhat larger binary and longer compile time.
| wtallis wrote:
| Even if AVX512 was entirely constrained to server hardware
| (it's not), how would it be contrary to the public interest
| for open-source software to take advantage of those
| instructions?
| janwas wrote:
| Yes, we can sort 64-bit ints. The speedup on AVX2 is roughly
| 2/3 of the 10x we see on AVX-512. Longsort appears to be an
| autovectorized sorting network. That's only going to be
| competitive or even viable for relatively small arrays
| (thousands). See comments above on djbsort.
|
| Why not use whichever AVX the CPU has? Not a problem when using
| runtime dispatch :)
| heavenlyblue wrote:
| What about performance-per-watt?
| celrod wrote:
| The bigger the vectors, the better the performance per
| watt.
| olliej wrote:
| Re-implementation of stuff with SIMD is always amazing to me. I
| have done stuff with SIMD before: 4 element float vector
| operations; basic arithmetic on arrays of floats.
|
| Those are things that are super simple, once you get past the
| scary mystique of SIMD.
|
| It's stuff like this that should be getting all the witch burning
| :D
| MrYellowP wrote:
| Man, I was thinking about this problem and came up with some
| weird solutions.
|
| I've actually thought all this stuff has already been solved?
|
| Is there a point in still trying to improve efficiency of
| sorting?
|
| I can do that! I'd love such a challenge if there was a point to
| it!
| joebob42 wrote:
| Sorting is one of a few things that computers spend so much
| time doing, even a relatively small improvement over that state
| of the art can have a big impact.
|
| On the one hand this means if you can come up with a better way
| to do it in some case that's awesome and could be very
| valuable. On the other hand it means a lot of people have
| thought pretty hard about it and the next improvement may be
| hard to come up with unless you pick a tight enough subset of
| the problem.
|
| Definitely go for it though :)
| jerryjerryjerry wrote:
| This is great, and can definitely improve sort performance in
| database and big data projects. I can immediately imagine this is
| a perfect match to my project of columnar processing engine
| TiFlash (https://github.com/pingcap/tiflash).
| BeeOnRope wrote:
| Are there instructions for building & reproducing the performance
| results?
| VHRanger wrote:
| I love work like this and SIMD-JSON which vectorizes what "should
| be" sequential form problems to find massive performance gains
| charcircuit wrote:
| Any nonsequential problem can be divided into sequential
| subprobolems.
| singhrac wrote:
| Another one if you enjoy this sort of thing like me is Raph
| Levien's work on the stack monoid; this [1] blog post and this
| [2] ArXiV paper are great places to look.
|
| [1]: https://raphlinus.github.io/gpu/2020/09/05/stack-
| monoid.html [2]: https://arxiv.org/abs/2205.11659
| alecco wrote:
| > By comparison, the standard library reaches 58/128/117 MB/s on
| the same CPU, so we have managed a 9-19x speedup depending on the
| type of numbers.
|
| That's pretty disingenuous. The standard implementation is known
| to be terribly slow because of constraints. They should compare
| against current state of the art.
| slackerIII wrote:
| The paper goes into a lot more details:
| https://arxiv.org/pdf/2205.05982.pdf
| MontyCarloHall wrote:
| I don't see any mention in the paper of thermal clock
| throttling concerns, which can really neuter performance of
| tools that sustain use of AVX operations over a period of
| time. For the quick benchmarks presented in the paper, of
| course it will be faster. What if I'm continuously hammering
| my CPU with AVX operations? I expect it to severely
| downclock.
| kadoban wrote:
| That might be an interesting benchmark, but assuming good
| cooling isn't exactly unreasonable either.
| MontyCarloHall wrote:
| In datacenter blade servers (i.e. on a cloud VM), I've
| noticed up to 50% downclocking due to thermal throttling
| when running sustained frequent AVX operations.
|
| I'm sure an exotic watercooled setup will fare much
| better, but those aren't generally what we run in
| production.
| hinkley wrote:
| My laptop has been throttling itself for a while, I
| recently discovered. I had been trying to benchmark some
| code changes and have given up and am letting the CI
| machine run them, because my numbers are all over the place
| and go down with each run.
| bee_rider wrote:
| One option would be to go into BIOS and see if there's
| some way of just locking your CPU to one of the lower
| clock speeds. This will give lower benchmarking numbers
| of course, but at least they should be fairly stable. (in
| Linux, it also often possible to tinker with frequencies
| while the system is running).
|
| Even on a desktop this sort of thing is sometimes
| necessary, for example my CPU has different clock speeds
| depending on how many processors are running, so I have
| to lock it to the all-core clock if I want to see proper
| parallel speedups.
|
| This might be annoying for day-to-day usage (although,
| CPUs really are insanely performant nowadays so maybe it
| will not be too bad).
| jeffbee wrote:
| On Ice Lake Xeon the penalty for using the AVX-512 features
| on a single core is -100MHz. If we pessimistically use the
| slowest part Intel sells, that is a 5% performance penalty
| (2% on their fastest parts). The speedup from this work is
| 40-60% compared to AVX2. So you'd be a fool to take the
| side of the folk myth. AVX-512 works.
|
| By the way the performance penalty for using AVX-512 on
| multiple cores when the multiple cores were already active
| is zero. There is no penalty in most server scenarios.
| MontyCarloHall wrote:
| >On Ice Lake Xeon the penalty for using the AVX-512
| features on a single core is -100MHz.
|
| That is a penalty due to licensing [0], not thermal
| throttling. As I wrote elsewhere, I've seen my clockspeed
| get cut in half across all cores on a physical die when
| running AVX-heavy operations for a sustained period of
| time, due to thermal throttling.
|
| [0] https://travisdowns.github.io/blog/2020/08/19/icl-
| avx512-fre...
| mcronce wrote:
| The default AVX offset for Ice Lake is indeed only 100MHz
| (and it doesn't exist starting with Rocket Lake), but
| 512b SIMD instructions use a lot of power, and as a
| result generate a lot of heat - so they certainly can
| cause thermal throttling or throttling due to power
| limits
| ahh wrote:
| It's the transition that kills you. Are you doing this
| full time?
| jeffbee wrote:
| My full-time thing is more search-y and takes tens of
| milliseconds so I'm not really sweating power state
| transitions that take a few micros.
| Rizz wrote:
| Which they did and showed in the previous sentence. You cannot
| possibly have missed that.
| throw-away_42 wrote:
| Also disingenuous to claim they don't make that comparison when
| it's literally the sentence before the one you quoted.
| alecco wrote:
| My bad. Apologies.
| cbetti wrote:
| Not having played with SIMD much myself, does leveraging these
| instructions for an intensive operation like a sort push other
| workloads out of the CPU more aggressively than operating on 32
| or 64 bits at a time would?
|
| In other words, do you have to be more careful when integrating
| these wide operators to preserve some resources for other
| operations?
| jeffbee wrote:
| At least on Intel, AVX has its own functional units and
| register file, so I would say it's not a major concern. It's
| possible that running some AVX instructions could be almost or
| completely free if you weren't using execution port 5 anyway,
| to the extent that instruction-level parallelism can be
| considered "free".
|
| If you're really taking a microscope to performance, the main
| hazards would be intermittently using AVX for only a few
| instructions, because that might lead to the CPU stopping for a
| few microseconds to turn the power on and off on the functional
| units. If you're using them heavily the overall thermal
| situation might cause a core or package-wide clock rate
| degradation, but if you have a use case for sustained AVX-512
| usage this is likely to be a good tradeoff.
| Arkanosis wrote:
| Pushing them out of the CPU, I don't know, but some SIMD
| instruction sets on some CPUs have side effects that can
| negatively affect the performance of other operations. For
| example, the use of AVX2 / AVX-512 can cause some Intel CPUs to
| lower their base frequency, thus reducing the performance of
| simultaneous operations that are not using SIMD.
| [deleted]
| vlovich123 wrote:
| Is that still true? I was under the impression that post-
| Haswell CPUs don't have that particular issue.
| mcronce wrote:
| Not for recent hardware - Ice Lake eliminated the
| throttling on 256b ops (they already didn't exist for
| certain 256b ops and all smaller ops), and reduced the
| throttling to almost nothing for 512b ops. Rocket Lake
| eliminated the throttling for 512b ops.
|
| https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Do
| w...
|
| They do use a lot of power (and as a result, generate a lot
| of heat), so they can still cause thermal throttling, or
| throttling due to power limits - but there's no more "AVX
| offset"
| fulafel wrote:
| Generally yes to the first question, no to the second.
|
| If you want your code to have low perturbance on other
| concurrent work done by the system, implementing it in a
| inefficient way doesn't usually help with that goal, since then
| your code will just be executing all the time because it takes
| a long time to finish. And you still won't have good control of
| the execution resources compared to using normal tools like OS
| scheduling policies to address the goal.
| VHRanger wrote:
| They do emit a lot of heat, which might actually throttle the
| CPU overall.
|
| But to my knowledge they use different registers, and when
| properly pipelined they don't hog the CPU cache like
| unoptimized algorithms that constantly round trip to RAM.
| saltcured wrote:
| Right, though your use of the terms "different registers" and
| "properly pipelined" is quite nuanced for a non-specialist.
| It leaves a lot to the imagination or prior experience!
|
| If it is a compute-heavy code that was spilling to a stack
| and now can fit in the SIMD registers, there could be a speed
| up with less memory traffic. This is analogous to a program's
| working set either fitting in RAM or spilling to swap while
| it runs, but at a different level of the memory hierarchy. In
| the extreme case, your working set could be a fixed set of
| variables in expensive loops for some sequential algorithm,
| and so the compiler register placement ability and number of
| registers available could be the difference between
| effectively O(1) or O(n) memory accesses with respect to the
| number of compute steps.
|
| Of course, you can find such transitions between registers/L1
| cache, L1/L2 cache, L2/L3 cache (if present), local/remote
| RAM (for modern NUMA machines), etc. This happens as you
| transition from a pure CPU-bound case to something with
| bigger data structures, where the program focus has to move
| to different parts of the data to make progress. Naively
| holding other things equal, an acceleration of a CPU-bound
| code will of course mean you churn through these different
| data faster, which means more cache spills as you pull in new
| data and have to displace something else. Going back to the
| earlier poster's question, this cache spilling can have a
| detrimental effect on other processes sharing the same cache
| or NUMA node, just like one extremely bloated program can
| cause a virtual memory swapping frenzy and hurt every other
| program on the machine.
|
| One form of "pipelining" optimization is to recognize when
| your loop is repeatedly fetching a lot of the same input data
| in multiple iterations and changing that to use
| variables/buffers to retain state between iterations and
| avoid redundant access. This often happens in convolution
| algorithms on arrays of multi-dimensional data, e.g. image
| processing and neural nets and many cellular-automata style
| or grid-based simulations. Your algorithm slides a "sampling
| window" along an array to compute an output value as a linear
| combination of all the neighboring values within the window
| using some filtering kernel (a fixed pattern of
| multiplicative scalars/weights). Naive code keeps loading
| this whole window once for each iteration of the loop to
| address one output position. It is effectively stateless
| except for the loops to visit every output. Pipelined code
| would manage a window buffer and fetch and replace only the
| fringe of the buffer while holding and reusing inputs from
| the previous iterations. While this makes the loops more
| stateful, it can dramatically reduce the number of memory
| reads and shift a memory-bound code back into a CPU-bound
| regime.
|
| Other major optimizations for these N-dimensional
| convolutions are done by the programmer/designer: some
| convolution kernels are "decomposable" and the expensive
| multi-dimensional operation can be strength-reduced to a
| sequence of 1-dimensional convolutions with appropriately
| chosen filter kernels to produce the same mathematical
| result. This optimization has nothing to do with SIMD but
| simplifies the work to embody the operation. Fewer input
| values to read (e.g. a 1D line of neighbors instead of 2D
| box) and the number of arithmetic operations to do all the
| multiply-and-add steps to produce the output. Imagine a 2D
| filter that operates on an 8x8 window. The 2D convolution has
| to sample and combine 64 input neighbors per output, while
| the decomposed filter does 8 neighbors in each axis and so
| one quarter as many steps total after one pass for the X axis
| and one pass for the Y axis.
|
| Naively, decompsition is done as separate 1D passes over the
| data and so performs more memory writes and allocates an
| additional intermediate array between the original input and
| final output. This is often a big win in spite of the extra
| memory demands. It's a lot more coding, but this could also
| be "pipelined" to some degree in N-dimensions to avoid
| pushing the entire intermediate results arrays through main
| memory. Approaches differ, but you can make different
| tradeoffs for how many intermediate results to store or
| buffer versus redundant calculation in a less stateful
| manner.
|
| Generally, as you make the above kinds of optimizations your
| code becomes more "block-based", loading larger chunks of
| data with fewer changes to new "random" memory locations.
| This is very much like how databases and filesystems
| optimized their access to disk storage in prior decades to
| avoid random seeks for individual bytes or blocks and to
| instead use clever structures to support faster loading of
| sets of blocks or pages. When this is successful, your
| program does most of its memory access sequentially and
| achieves higher throughput that is bound by total memory bus
| bandwidth rather than random access latency limitations.
|
| The final form of "pipelining" matters once you really hit
| your stride with these optimized, block-based programs. All
| this access again can cause cache spilling as you sustain
| high bandwidth memory access. But if you can provide the
| right hints to the CPU, or if the CPU is clever enough to
| detect the pattern, it can shift into a different cache
| management regime. Knowing that you will only access each
| block of data once and then move on (because you are holding
| the reusable state in registers), the CPU can use a different
| prefetching and spilling policy to both optimize for your
| next memory access and to avoid churning the whole cache with
| these values you will only read once and then ignore. This
| reduces the impact of the code on other concurrent tasks
| which can still enjoy more reasonable caching behavior in
| spite of the busy neighbor.
| janwas wrote:
| The CPU has a certain upper bound on power, TDP. That limit
| is independent of whether it is reached by executing scalar
| code on lots of cores, or SIMD on a few. One little-known
| fact is that out of order CPUs burn lots of energy just on
| scheduling instructions, ~10x more than the operation itself.
| SIMD amortizes that per-instruction overhead over multiple
| data elements, and actually increases the amount of useful
| work done per joule. We're talking 5-10x increase in energy
| efficiency.
| gww wrote:
| I am always amazed at the algorithms re-implemented using SIMD.
| One of my favorites is the Striped Smith-Waterman approach used
| for sequence alignment. Does anyone have any good resources on
| learning to use SIMD? I've found it heard to make the "plunge".
| janwas wrote:
| :) How about Chapter 12 of Agner's awesome manual
| (https://agner.org/optimize/optimizing_cpp.pdf),
| http://const.me/articles/simd/simd.pdf,
| https://en.algorithmica.org/hpc/simd/ ?
|
| Does anyone have others to share?
| magicnubs wrote:
| The lectures and assignments for Oregon State's CS475 (Parallel
| Programming) are all available online. [0] There are lectures
| [1] and a project [2] about SIMD. I really enjoyed the entire
| course as a survey of parallel and high-performance computing.
| The full course covers multi-processing, multi-threading,
| caching, SIMD, GPUs (CUDA and OpenCL) and MPI. The projects are
| in C/C++ (along with OpenMP, CUDA and OpenCL). FYI, I think the
| last two projects use some large research GPU bank that you
| have to have special access to use, so you'd be out of luck on
| implementing the projects for those.
|
| [0] https://web.engr.oregonstate.edu/~mjb/cs575/ [1]
| https://media.oregonstate.edu/media/t/1_7wju0jtq [2]
| https://web.engr.oregonstate.edu/~mjb/cs575/Projects/proj04....
| thecompilr wrote:
| I'm the co-author of one of the papers referenced in the
| blogpost, (Fast Quicksort Implementation Using AVX Instructions),
| we did write the AVX512 code back in 2015, just had nowhere to
| run it, at least publicly. The paper also very explicitly says
| that the lookup tables can be instead replaced by the AVX512
| compress instructions. The code for that paper is available in
| https://github.com/vkrasnov/avx_qsort
| slackerIII wrote:
| What would it take for something like this to make it into
| Postgres?
| polskibus wrote:
| See commment: https://news.ycombinator.com/item?id=31624451
| VHRanger wrote:
| unlikely - postgres stores data row-wise and this assumes
| sequentially stored columns. They even mention that issue in
| the blog post.
|
| It would be more likely to show up in something like Apache
| Arrow which is designed columnar to leverage these sort of
| tricks
| Denvercoder9 wrote:
| The data in indexes is stored columnar by most RDBMSes, as
| far as I know.
| wenc wrote:
| Typically a rowstore index is a B-tree data structure (on a
| column) that points to data pages in a rowstore dataset.
|
| It's not technically columnar as such.
|
| You can sort with a rowstore index today, but I imagine you
| have to materialize the index in some way to take advantage
| of vectorization.
| ddorian43 wrote:
| It actually is not.
| kjeetgill wrote:
| I mean, I guess you could call a B-Tree over a subset of
| columns "columnar"... And bit-map indexes are by columns
| too -- but that's really bending the definition of what
| most people mean by columnar storage in databases.
| janwas wrote:
| I don't know about "most", but here is a list:
| https://en.wikipedia.org/wiki/List_of_column-
| oriented_DBMSes
| ComputerGuru wrote:
| No, this is a list of columnar database systems - gp was
| saying the index, which is often (but not always) stored
| separately from the main db records.
| janwas wrote:
| Ah, indeed, my mistake. FWIW I've observed increasing
| numbers of papers over the past couple of years
| mentioning entirely columnar databases.
| samwillis wrote:
| There are various Postgres extensions (such as Citus) that
| are columnar, so with them it should be possible.
| [deleted]
| aaaaaaaaaaab wrote:
| This works only with integers, right? Then radix sort is gonna be
| still faster on sufficiently large inputs...
| carbocation wrote:
| Looks like djb is giving it a spin:
| https://twitter.com/hashbreaker/status/1533201734369103872
| sydthrowaway wrote:
| And he shat all over Google yet again
| aaaaaaaaaaab wrote:
| But he's right.
| viraptor wrote:
| He's just investigating how to reproduce the claimed result.
| Not sure where you got your take from.
___________________________________________________________________
(page generated 2022-06-04 23:00 UTC)