hngopher.com

       [HN Gopher] Do low-level optimizations matter? Faster quicksort ...
       ___________________________________________________________________
        
       Do low-level optimizations matter? Faster quicksort with cmov
       (2020)
        
       Author : fanf2
       Score  : 149 points
       Date   : 2024-08-21 20:42 UTC (20 hours ago)
        
 (HTM) web link (cantrip.org)
 (TXT) w3m dump (cantrip.org)
        
       | mbroncano wrote:
       | (2020)
        
         | tialaramex wrote:
         | Which of course makes it ironic that it says "We ought to be
         | able to sort pretty fast, by now, using a Standard Library
         | sort."
         | 
         | The C++ standard promised (since 1998) that the provided sorts
         | are O(n log n).
         | 
         | But in fact popular C++ implementations ignored this and
         | shipped quicksort - after all it's "pretty fast" but
         | unfortunately this means if bad guys can choose your input (or
         | you get unlucky) you get O(n squared)
         | 
         | Gradually they fixed this, libc++ updated to Introsort (last
         | century's best attempt to do this) in 2021. In response to a
         | ticket opened by Orson Peters.
         | 
         | These days you're also starting to see a trend towards the
         | Rust-style "fast sort which implements a wide contract" design
         | even in C++ implementations. Rust's sort doesn't actually
         | _promise_ this wide contract arrangement, but it does promise
         | safety and the wide contract turns out to be the simplest way
         | to do that.
         | 
         | What I mean there by "wide contract" is that these sorts don't
         | blow up if your ordering rule is nonsense. They can't "sort"
         | successfully because your ordering rule is nonsense, if A > B
         | && B > A then too bad, we can't sort A and B. But they _can_
         | and do gracefully give up, while many older C++ stdlibs just
         | cause uncontrolled mayhem in this case.
        
           | gpderetta wrote:
           | I'm pretty sure that introsort was already implemented in the
           | original STL last century.
           | 
           | In fact the standard didn't require O(log n) untill C++11 as
           | all implementations were already using introsort, so allowing
           | the previous more permissive O(n^2) bound was no longer
           | necessary.
        
             | tialaramex wrote:
             | I don't know which is the "original STL" in this context.
             | Alexander Stepanov's proposal of a standard template
             | library for C++ pre-dates Musser's invention of
             | introspective sorting (introsort). Stepanov is a smart guy
             | but he's also somewhat famous so I'd guess if he invented
             | it first we'd know that.
             | 
             | As I said, Orson filed a bug in LLVM bugzilla for libc++
             | going quadratic for adversarial input _in 2014_ , and it
             | was eventually fixed (by using introsort) in 2021.
             | 
             | There's a fun sideshow in that ticket, remember Orson
             | actually wrote what is probably the best general purpose
             | unstable sort at the time - but all he's done is open a
             | ticket saying libc++ doesn't meet the minimum requirements.
             | After a while, with the team having dragged their feet and
             | still not even landed the patch to just use introsort
             | somebody pipes up to report that they have benchmarked
             | sorts and actually libc++ is faster than other popular C++
             | sorts (in their benchmark). So, Orson politely asks them to
             | try PDQsort. That's the first time he explicitly mentions
             | his Pattern Defecting Quicksort in the ticket, and maybe a
             | harried LLVM bug handler wouldn't notice who filed the bug
             | until then. Just another name.
             | 
             | But it's OK, the LLVM devs are unperturbed and ignore him,
             | focusing instead on the minutiae of the performance numbers
             | for their _known defective_ unstable sort. Year later
             | somebody eventually adds introsort to libc++ and closes the
             | defect ticket. Nobody answers Orson 's question, the answer
             | of course is "Oh, yours is much faster".
        
               | gpderetta wrote:
               | Musser and Stepanov were working together on the STL. I
               | think they were both at SGI at that time, and introsort
               | was integrated into the SGI STL (around the same time the
               | STL was standardized, I guess introsort was too new to
               | require it in C++98); from there introsort made it into
               | most other standard library implementations which are
               | directly derived from SGI STL.
               | 
               | libc++ is notable for being one of the few from-scratch
               | implementations.
        
       | karmakaze wrote:
       | I would expect eliminating branches in a busy inner loop to
       | matter.
       | 
       | The interesting part is how that was done:
       | 
       | > A New Primitive, swap_if
       | 
       | > How can we use this method in our sort? First, let us make a
       | swap_if:                 inline bool swap_if(bool c, int& a, int&
       | b) {         int ta = a, mask = -c;  // false -> 0, true ->
       | 111..111         a = (b & mask) | (ta & ~mask);         b = (ta &
       | mask) | (b & ~mask);         return c;       }
       | 
       | > In our partition function, then, we can transform
       | if (*right <= pivot) {           int tmp = *left; *left = *right,
       | *right = tmp;           ++left;         }
       | 
       | > into just                   left += swap_if(*right <= pivot,
       | *left, *right);
        
         | magicalhippo wrote:
         | Will we see x86 or similar CPUs replace hot instruction blocks
         | with current-processor-specific optimized code during runtime,
         | similar to how certain JIT VMs do?
         | 
         | What I mean is, say you have a similar simple "x = (cond) ? a :
         | b;" which the compiler has not translated to a CMOV.
         | 
         | If this is in a hot loop then the CPU could, in theory, notice
         | that "it's just doing a conditional move, I can do the CMOV
         | faster" and then translate those code bytes at that memory
         | location to a CMOV instead (ie during decoding or something
         | like that).
         | 
         | Not worth the complexity? I imagine it would slow down the
         | decoder or wherever they insert this replacement logic. Or am I
         | hopelessly out of date and they're already doing this?
        
           | akira2501 wrote:
           | I think it could only do that in spans of code where
           | interrupts are disabled and purely between values already in
           | registers. CPU state feels like it's in your control, but
           | it's not at all in your control, and even if it was, multiple
           | processes might exist and memory is never in your control.
        
           | saagarjha wrote:
           | Whether to use a branch or a conditional move isn't really
           | dependent on what the source is doing but how likely the
           | branch is and its dependencies. Simplifying a bit, an
           | explicit cmov is basically telling the computer that the
           | condition is not easy to predict and to not bother. Modern
           | processors will typically do the same analysis themselves and
           | perform similarly on a branch with a slight overhead.
        
           | jeffbee wrote:
           | > Will we see x86 or similar CPUs replace hot instruction
           | blocks with current-processor-specific optimized code during
           | runtime, similar to how certain JIT VMs do?
           | 
           | Wasn't that one of Transmeta's things?
        
           | gpderetta wrote:
           | Look for "idiom detection". Macro uop fusion is a limited
           | form of this.
           | 
           | As a relevant example, POWER CPUs can convert some non
           | diverging branches into conditional execution if not well
           | predictable.
           | 
           | Many RISC CPUs can convert well known LL/SC code patterns
           | into locked atomic operations to guarantee forward progress.
        
         | kevinventullo wrote:
         | There's a well-known in-place implementation of swap of the
         | form:                 a ^= b       b ^= a       a ^= b
         | 
         | (Here ^ denotes bitwise XOR)
         | 
         | Allowing for the mask, one could do an in-place version of
         | swap_if via                 a ^= (b & mask)       b ^= (a &
         | mask)       a ^= (b & mask)
         | 
         | The in-place version of swap is generally discouraged because
         | compilers are smart
         | (https://stackoverflow.com/questions/36906/what-is-the-
         | fastes...) but I do wonder whether the masking in swap_if
         | obscures intent to the compiler enough to close the gap.
         | 
         | Assuming the mask is passed in, Godbolt puts OP's swap_if at 26
         | instructions, versus the above swap_if at 17 instructions:
         | https://godbolt.org/z/Wedco5hPv
        
           | ack_complete wrote:
           | You have the optimizer disabled and the functions are no-ops.
           | Additionally, masking that way makes the XOR no longer a no-
           | op, so only one masking operation is necessary -- but it
           | seems that GCC already knows this trick:
           | 
           | https://godbolt.org/z/qxMvcbrc8
        
             | kevinventullo wrote:
             | Oof, thank you for the correction.
        
           | Someone wrote:
           | > The in-place version of swap is generally discouraged
           | because compilers are smart
           | 
           | Isn't that more because CPUs slow down when there are
           | dependencies between instructions?
           | 
           | Compilers could (and may even do) fairly easily detect that
           | pattern (like they do with some versions of _popcnt_. See for
           | example
           | https://langdev.stackexchange.com/questions/3942/what-are-
           | th..., discussed here in
           | https://news.ycombinator.com/item?id=40987123) and compile it
           | to whatever is fastest on the target CPU/in the given context
           | (in some contexts, it can be compiled to nothing, just
           | changing the mapping between local variables and the
           | registers they're stored in)
        
             | gpderetta wrote:
             | > Isn't that more because CPUs slow down when there are
             | dependencies between instructions?
             | 
             | Indeed, while reg-reg moves can be resolved at the rename
             | stage and are effectively free latency-wise.
        
             | Almondsetat wrote:
             | CPUs have shortcuts in the pipeline to provide the results
             | to the dependent operation in time
        
         | a_e_k wrote:
         | This sort of use of turning conditionals into bitmasks and then
         | shuffling masked values around was fairly common back in the
         | day when writing code for SSE intrinsics.
         | 
         | You wouldn't want to have to check each lane individually since
         | that would spoil the point of using SSE. So you'd write masking
         | ops like this and keep things parallel, tThen judiciously use
         | any() and all() type intrinsics to check whether the lanes
         | could agree to skip certain blocks conditionally.
         | 
         | (This was before ISPC and before autovectorization to handle it
         | for you.)
        
           | djmips wrote:
           | Does ISPC + auto vectorization truly remove the need to
           | attend to these details in the most optimized code?
        
             | a_e_k wrote:
             | They can, yes.
             | 
             | For example, here's a simple Compiler Explorer link with a
             | simple loop conditionally copying from one of two values:
             | 
             | https://godbolt.org/z/TbeK99sx1.
             | 
             | If you look at the assembly code produced, you'll see in
             | lines 9-11, it's doing a similar thing as the swap_if()
             | function: ANDing one value with the mask, ANDing the other
             | with the negation of the mask, and then ORing them
             | together. And it produced that automatically from the
             | ternary in the loop.
             | 
             | They can also do things like automatically generate
             | epilogues with a serial loop to handle the leftover items
             | if the loop count isn't an even multiple of the SIMD width.
             | That's something else that I remember writing by hand
             | before autovectorizers.
             | 
             | That said, coaxing an autovectorizer to actually vectorize
             | a loop can be an art in itself. They're notoriously
             | finicky. My favorite used to be the one in the old
             | ("classic") ICC compiler where you use compiler flags to
             | get a nice optimization report which would explain exactly
             | why it chose what it did for each loop. You could also set
             | #pragma's which would noisily warn or error out if the
             | associated loop failed to vectorize for some reason.
             | 
             | (And of course, a lot of this stuff you might just use GPU
             | compute for now.)
        
               | mgaunard wrote:
               | normally there is an instruction for this called blend
               | (since SSE4.1).
               | 
               | There shouldn't be a need to do any bitwise magic.
        
               | jcla1 wrote:
               | You can get gcc to generate the blend instruction in the
               | example given - you just have to 'force' it to use SIMD
               | more aggressively by, for example, using -march=native.
        
               | mgaunard wrote:
               | That's not forcing anything, if you're not getting blend
               | you're just targeting an old ISA without that
               | instruction.
               | 
               | It's up to you to choose your target adequately based on
               | your deployment requirements.
        
       | memset wrote:
       | Fascinating! How does one learn to write C code that will make
       | use of specific asm instructions?
       | 
       | SIMDjson is another example that comes to mind. The conceit of C
       | is that you do t have control over the underlying machine
       | instructions without inlining it yourself. So how do people write
       | C code with cpu optimizations like this?
        
         | anonymoushn wrote:
         | Use inline assembly or intrinsics. Read the code of stuff that
         | does these things.
         | 
         | Resources sufficient for implementing simdjson or similar
         | vectorized parsers:
         | https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
         | https://lemire.me/blog/ http://0x80.pl/articles/
        
         | addaon wrote:
         | There are three approaches:
         | 
         | 1) Use intrinsics, if your platform provides intrinsics for the
         | instructions you want to use.
         | 
         | 2) Use inline assembly (which I suppose doesn't technically
         | count as writing C code, but is very much part of the story).
         | 
         | 3) Write C code carefully against a chosen compiler with chosen
         | optimization flags, and inspect the assembly. Especially for
         | small chunks of code, a tool like godbolt.org is indispensable.
         | Basically, you're writing the assembly you want the compiler to
         | generate (either explicitly, or just in your head), then
         | writing C code that seems likely to generate that sequence of
         | instructions, then tweaking the code until the output you want
         | is generated. If this is a super important optimization, it's
         | also reasonable to add a build step that inspects the generated
         | assembly and fails the build if it doesn't match the desired
         | pattern; but in that case, writing inline assembly is usually
         | easier.
        
           | vlovich123 wrote:
           | Option 4 is sometimes compiler annotations like
           | __builtin_expect_with_probability which you could use to
           | assign a branch a value of 50% which should coerce it to pick
           | cmov (same risk as 3 though in that you need to inspect
           | compiler output).
        
             | addaon wrote:
             | Yep, there's an entire book to be written on techniques
             | that you can use to express your understanding of code to
             | the compiler so that the compiler's chosen implementation
             | of it matches yours. Plenty of __builtins and
             | __attributes__, and even just the ordering of if/else
             | statements, grouping statements into local scopes with {},
             | using or not using a temporary variable...
        
         | adelpozo wrote:
         | I would say it is a moving target if the goal is to write C
         | code that compiles to a seemingly magical cpu instruction. As
         | other have pointed out, learning assembly and compiler explorer
         | are useful things.
         | 
         | As a way to show the wonderful complexities of compilers check
         | the TOC of https://shop.elsevier.com/books/optimizing-
         | compilers-for-mod...
        
       | kimixa wrote:
       | > Back in 2000, AMD included cmov in its 64-bit x86 ISA
       | extensions. Then, Intel had to adopt them when Itanium flopped.
       | 
       | Wasn't "cmov" one of the things added for the pentium pro? So it
       | wasn't instruction compatible - hence the "i686" prefix to a lot
       | of compiler triples?
        
         | pbsd wrote:
         | Yes, that is correct.
        
         | tedunangst wrote:
         | Intel was so embarrassed by the failure of itanium they
         | invented a Time Machine and went back and added the instruction
         | to a 1995 CPU. Deceptive and anti-competitive!
        
           | basementcat wrote:
           | Intel failed to predict the timeline branch in which i686
           | arch had cmov so they had to roll back and replay it.
        
             | winternewt wrote:
             | Also known as Speculative CPU manufacturing
        
       | dgl wrote:
       | The most important bit of this is in the conclusion:
       | Before we conclude anything, we should remind ourselves of its
       | limitations. The tests run were on completely random data. Truly
       | random data seldom occurs in real life.
       | 
       | Linus famously ranted about CMOV in
       | https://yarchive.net/comp/linux/cmov.html (2007, so potentially
       | more modern architectures are better at some of this) and he
       | says:                 if you KNOW the branch is totally
       | unpredictable, cmov is often good for       performance. But a
       | compiler almost never knows that.
       | 
       | As usual with optimizations like this you have to benchmark and
       | even then if your sample isn't representative it might not mean
       | much.
        
         | tialaramex wrote:
         | You can see how dramatically the actual data changes sort
         | performance in e.g. this (summary of the current unstable sort
         | in Rust, ipnsort)
         | 
         | https://github.com/Voultapher/sort-research-rs/blob/main/wri...
         | 
         | Notice how random_s95 is _worse_ (not by much, but it 's there)
         | than fully random. random_s95 is 95% sorted data, but 5%
         | unsorted, simulating a common "sort, do stuff, append, repeat"
         | pattern we see in a lot of software.
         | 
         | In contrast the sorted cases are almost instant, and random_d20
         | (only 20 distinct values, chosen at random, but as a result the
         | sorted output needs much fewer comparions) is very fast.
        
         | jnordwick wrote:
         | When Linus made that comment cmov was like a 6 cycle latency.
         | For the last decade it has been 1 cycle, and I don't think
         | there is any scenario where cmov is now slower than a branch.
        
           | haberman wrote:
           | The problem is not the 1 cycle latency, but the data
           | dependency on both values. A correctly-predicted branch cuts
           | the dependency on the value that is not used.
           | 
           | I've definitely measured scenarios where cmov/branchless is
           | slower than a branch for a given algorithm. Especially if the
           | branchless version is doing a bit more work to avoid the
           | branch.
        
             | unnah wrote:
             | Good point. It makes me wonder if modern out-of-order
             | processors can skip performing unused computations
             | altogether, if their result registers are overwritten by
             | other data later (in program order).
        
               | ants_a wrote:
               | No, and it feels unlikely that they will either.
        
               | gpderetta wrote:
               | CPUs could in theory speculate CMOV, reintroducing
               | prediction when predictable. But after Spectre, IIRC
               | Intel now guarantees that CMOV is never speculated.
        
             | gpderetta wrote:
             | It is both though. At 6+ cycles there are only a few places
             | where CMOV is a win. At 1 cycle you can be more liberal
             | with its use and the lack of dependency breaking is a
             | tradeoff.
        
           | clausecker wrote:
           | Are you sure? I recall cmov always having single cycle
           | latency.
        
             | BoardsOfCanada wrote:
             | I know that it was decoded into 2 micro-ops and thus had to
             | be decoded by the wide decoder, so perhaps 2 cycles?
        
               | clausecker wrote:
               | That could be. Agner's tables seem to confirm that.
        
           | mgaunard wrote:
           | a cmov-based approach is necessarily slower than a branch-
           | based approach that was correctly predicted, since cmov
           | requires computing both branches then selecting the result at
           | the end.
        
         | gpderetta wrote:
         | Linus was specifically ranting about compilers inserting CMOV.
         | These days it is actually a pain to get GCC or clang to
         | generate a CMOV when you specifically want it.
         | 
         | Also, CMOV did indeed get significantly better.
        
       | pcwalton wrote:
       | Note that random data is not a common case for sorting
       | algorithms. It'd be interesting to see how the numbers change on
       | partially-, mostly-, and fully-sorted data.
        
         | kragen wrote:
         | the article does mention this:
         | 
         | > _What can we conclude from this discovery? Before we conclude
         | anything, we should remind ourselves of its limitations. The
         | tests run were on completely random data. Truly random data
         | seldom occurs in real life._
         | 
         | it's true that it's not a common case, but it's probably the
         | simplest objectively justifiable case; any particular case of
         | the others requires more elaborate justification for
         | privileging it. why 90% sorted instead of 80%, say? or why do
         | we give 20% weighting to the randomized-data case and 80%
         | weighting to the 90%-sorted case, rather than some other
         | weighting?
         | 
         | and it does at least guarantee exploration of a good part of
         | the algorithm's behavior space. i mean, bogosort is optimal on
         | fully-sorted data, right? as long as you do the check for
         | sortedness before the random shuffle instead of after it
         | 
         | it's important that your sort not _explode_ on mostly-sorted or
         | mostly-reverse-sorted data (myers 's 'bog-standard quicksort'
         | uses the last element as the pivot and consequently goes
         | quadratic), but beyond that, the particular weighting of
         | relative importance to assign to random input vs. 99% sorted
         | vs. 90% sorted vs. 90% reverse sorted--that requires some
         | application-specific justification
         | 
         | aside from the ones you mentioned, another interesting case is
         | a large array of records that all have equal keys. median-of-
         | three quicksort or random-pivot quicksort does fine on mostly-
         | or fully-sorted data, but still sucks when everything is equal!
        
       | xiaodai wrote:
       | The radix sort implementation is not optimal. Instead of sorting
       | 1 digit at a time, it should be sorting 11 digits at a time to
       | saturate the cache lines.
       | 
       | So the baseline radix sort can be even faster.
        
         | djmips wrote:
         | From the article - "This radix sort would be easy to make even
         | faster. Instead, let us make one that is slower"
         | 
         | The radix sort was used to provide an estimate for how much
         | better the quicksort could improve.
        
       | orlp wrote:
       | Rather than std::swap_if which is rather niche, I would prefer to
       | see std::select(cond, if_true, if_false), with the guarantee that
       | unlike the ternary operator it eagerly evaluates both arguments
       | and selects between them branchlessly if possible.
       | 
       | Something similar is coming to Rust as well: https://doc.rust-
       | lang.org/nightly/core/intrinsics/fn.select_...
        
         | account42 wrote:
         | Wouldn't the sensible lower level primitive be Clang's
         | __builtin_unpredictable? If you standardize that then a librry
         | can easily build your select function on top of it. I guess
         | GCC's __builtin_expect_with_probability with probabilily 0.5
         | should have the same meaning.
        
           | gpderetta wrote:
           | > I guess GCC's __builtin_expect_with_probability with
           | probabilily 0.5 should have the same meaning.
           | 
           | But it doesn't! the sequence 1010101010... has probability
           | 50% that it assumes one or the other value, yet is completely
           | predictable.
        
             | tialaramex wrote:
             | Also the compiler vendors have seen this movie before and
             | know how it ends. Too often a "performance hint" is garbage
             | and ignoring it will improve performance.
             | 
             | Which is unfortunate because these are genuinely sometimes
             | useful, but it's a "We can't have nice things" situation -
             | for every expert who is carefully applying a hint to a hot
             | path after careful reading of the manual for the specific
             | CPU model they use, there are dozens of idiots who will
             | copy-paste that hint believing that it's magic speed-up
             | juice, never having measured and with no idea what it's
             | really doing or why.
             | 
             | The intrinsics have the advantage that at least you're
             | expressing what you meant, and it is obvious whose fault it
             | is when that's slower because you screwed up - hints make
             | it too easy for programmers to justify a tantrum when their
             | "hint" isn't magic speed-up juice after all.
        
       | adrian_b wrote:
       | A correction to the history mentioned in the article: CMOV has
       | not been added by AMD around 2000.
       | 
       | CMOV has been added by Intel in 1995, to Pentium Pro. It was the
       | most important addition to the x86 ISA added by Pentium Pro. So
       | CMOV was supported by Pentium Pro and its successors, like
       | Pentium 2 or Pentium III, but it was not supported by Pentium,
       | Pentium with MMX or AMD K6 CPUs. AMD has added CMOV starting with
       | Athlon, in 1999.
       | 
       | Pentium Pro was the first Intel CPU with out-of-order execution
       | and it did much more speculative execution than its predecessors.
       | Therefore it had the need for CMOV to limit the performance loss
       | caused by branch mispredictions.
        
       | lynx23 wrote:
       | Heh, nice! This article reminded me of a great talk I recently
       | noticed:
       | 
       | Rethinking Binary Search: Improving on a Classic with AI
       | Assistance: https://www.youtube.com/watch?v=FAGf5Xr8HZU
       | 
       | The gist is rather simple: Assuming that a substiantial amount of
       | your searches results in _no result_ , you can bias the binary
       | search to jump farther then just half, improving runtime
       | noticeably.
        
       | austin-cheney wrote:
       | Matter to whom?
       | 
       | If your primary audience is other developers then it absolutely
       | does not matter. In fact it's a way to expose deep anger. All
       | that matters is convenience and measurements just piss people
       | off, because measuring anything is just too hard.
       | 
       | If your primary audience is business, especially transactional
       | finance, then absolutely yes. Squeeze absolutely every
       | millisecond out, measure absolutely everything in terms of
       | transaction quantity, focus on concurrent operations.
        
         | jart wrote:
         | You have to work on computing frontiers like LLMs where people
         | are feeling a performance crunch. Measuring things is what
         | separates software engineering from computer programming.
         | Software engineering turns into social science when your
         | hardware is 10x faster than the thing you're building, since
         | then you have no need to measure things. Without extreme self-
         | imposed discipline, your work in that kind of environment will
         | devolve into bloated decadence, unless you move on to the next
         | computing frontier. Convenience is the kind of thing a man
         | values when he isn't pushing against any fundamental limits
         | imposed by the universe.
        
       | pronoiac wrote:
       | Here's the article linked as https; I'll email the mods.
       | 
       | https://cantrip.org/sortfast.html
        
       | MatthiasPortzel wrote:
       | It's weird to admit that this is an arbitrary dataset that
       | doesn't mimic real-world use cases, then admit that radix sort is
       | far faster for this situation, then conclude that low-level
       | optimizations are important and a cmov-wrapper should be added to
       | the standard library to make programs faster. It's a great post I
       | just disagree with the conclusion.
        
         | snek_case wrote:
         | True. You would think it's maybe not that hard to find a real
         | dataset to sort somewhere.
        
       | kazinator wrote:
       | > I call this bog-standard, but the quicksorts most easily found
       | online are always oddly pessimized, both more complicated and
       | slower. This one uses the "Lomuto partition", which is simpler
       | than Hoare's.
       | 
       | The Lomuto partitioning is a naive vandalism of Hoare's original,
       | which causes quicksort to have quadratic behavior on sorted
       | input.
        
       ___________________________________________________________________
       (page generated 2024-08-22 17:01 UTC)