[HN Gopher] Understanding DeepMind's sorting algorithm
       ___________________________________________________________________
        
       Understanding DeepMind's sorting algorithm
        
       Author : jart
       Score  : 212 points
       Date   : 2023-06-12 17:30 UTC (5 hours ago)
        
 (HTM) web link (justine.lol)
 (TXT) w3m dump (justine.lol)
        
       | mlochbaum wrote:
       | Hang on, you can't just quote MB/s numbers for an O(n log(n))
       | sort. What length were these tests run at?
       | 
       | The code size might not end up quite as good (also requires
       | malloc), but a branchless merge sort is a contender for a fast
       | and lightweight sort. Just published, tiny-sort-rs[0] cites 632
       | bytes and looks like ~350MB/s at 1e4 elements on Zen 3. In my
       | tests, my own pisort[1] benches a little over twice as fast as
       | LongSort, but it uses sorting networks as the base case so it's
       | like 5KB. It's roughly based on piposort[2] which has more
       | complicated recursion but a simpler base case.
       | 
       | 400 MB/s seems a bit slow for a radix sort on that hardware: I'm
       | hitting those numbers on my i5-6200U, which has less than half
       | the clock rate, with my own radix sort. Recommend checking
       | ska_sort_copy from [3] as it has about the same performance.
       | 
       | [0] https://github.com/Voultapher/tiny-sort-rs
       | 
       | [1]
       | https://github.com/mlochbaum/SingeliSort/blob/master/src/mer...
       | 
       | [2] https://github.com/scandum/piposort
       | 
       | [3] https://github.com/skarupke/ska_sort
        
         | mlochbaum wrote:
         | Just realized that obviously you don't need stability if you're
         | using in-place quicksort, so the tiny-sort heapsort is a better
         | recommendation. 304 bytes, although the scaling to large arrays
         | is much worse because of the awful access patterns.
        
         | jstanley wrote:
         | > What length were these tests run at?
         | 
         | The first example is "assembly code they published for sorting
         | an array with three items" - this isn't an entire general-
         | purpose sorting algorithm, it's just the innermost part.
        
           | mlochbaum wrote:
           | Second part of the article, starting at "I thought it'd be
           | useful to share something that's actually portable and
           | executable".
        
           | srcreigh wrote:
           | The alpha dev post claims 1.7% improvement on large sequences
           | (250k+)
        
             | refulgentis wrote:
             | Yes, and as both posts say, that's because large sequences
             | are implemented by building up from small sequences :)
        
       | Semisweet6708 wrote:
       | [dead]
        
       | tabtab wrote:
       | Several years ago I read about using genetic algorithms to
       | "evolve" better mini-sorts. I wonder how the two compare.
        
       | cammil wrote:
       | Why is sorting not done by specialised hardware? Or is it
       | already?
        
         | bmc7505 wrote:
         | It can be done for fixed-length lists. Optimal sorting networks
         | [1] are an active research topic with many interesting
         | connections to differentiable sorting and ranking [2].
         | 
         | [1]:
         | https://en.wikipedia.org/wiki/Sorting_network#Optimal_sortin...
         | 
         | [2]: https://arxiv.org/abs/2105.04019
        
       | fred256 wrote:
       | This reminds me of challenge 28 from the game Human Resource
       | Machine.
        
       | smodad wrote:
       | I just realized that Justine was the person responsible for the
       | massive reduction in the memory footprint of the Llama models
       | back in March.[1] Super impressive! These are my favorite kinds
       | of blog posts.
       | 
       | [1] https://github.com/ggerganov/llama.cpp/pull/613
        
         | gajnadsgjoas wrote:
         | You wanted to say the one was banned by the author because of
         | all the drama that followed
        
           | jimsimmons wrote:
           | What drama. Ooc
        
             | pcj-github wrote:
             | https://github.com/github-drama/github-drama/pull/46
        
         | m00x wrote:
         | It's just using mmap, nothing too impressive. It's a nice
         | contribution nonetheless.
        
       | srcreigh wrote:
       | Is it strange that it's slower in jart's testing but claimed to
       | be faster in the AlphaDev blog post?
       | 
       | jart doesn't provide detail about length of sequences used in
       | testing, and AlphaDev basically says that between 6 and 249,999
       | elements the optimizations are slower (they only claim
       | improvement for very small and 250k+ element sequences).
       | 
       | The AlphaDev numbers are so curious as well. AFAICT there's extra
       | branching when you splice the tiny-sequence optimized versions
       | (slower), but better sorting for the tiny sequences (faster).
       | 
       | Is it, like, branch prediction gets an edge when the leaf nodes
       | of the recursion are all sorting tiny sequences? In jart's code,
       | it's DFS, which I can only guess would trample a bit on branch
       | prediction. I wonder if a BFS search could be better
       | 
       | No idea what would cause this though, curious if anyone has other
       | ideas, I really don't know.
        
       | CalChris wrote:
       | mov %rdx,%rcx
       | 
       | Wouldn't this _mov_ instruction be handled by the register
       | renamer (Allocate /Rename/MoveElimination/ZeroIdiom) at
       | essentially 'zero' cost? Yet clearly they're measuring a
       | difference. I'll be curious what Agner Fog and Peter Cordes
       | think.
       | 
       | Answer: renaming can fail if the operands aren't ready and it
       | isn't zero cost, just less.
        
       ___________________________________________________________________
       (page generated 2023-06-12 23:00 UTC)