[HN Gopher] New book-sorting algorithm almost reaches perfection
       ___________________________________________________________________
        
       New book-sorting algorithm almost reaches perfection
        
       Author : isaacfrond
       Score  : 246 points
       Date   : 2025-01-24 15:50 UTC (1 days ago)
        
 (HTM) web link (www.quantamagazine.org)
 (TXT) w3m dump (www.quantamagazine.org)
        
       | baruchel wrote:
       | I recall presenting a problem to my students a few years ago
       | based on the 'Library Sort' algorithm. I still clearly remember
       | the title of the original paper: 'Insertion Sort is O(n log n)'.
        
         | deathanatos wrote:
         | Presumably this:
         | https://www3.cs.stonybrook.edu/~bender/newpub/BenderFaMo06-l...
         | 
         | Kind of a click-baity title.
        
           | foota wrote:
           | It's sort of click bait, but also true, right? This is just
           | insertion sort using non-fixed indexing?
        
         | zeroonetwothree wrote:
         | This is a different problem though, despite the similar name.
        
       | acbart wrote:
       | I was actually just looking at this problem last week, when I
       | realized I needed to keep items in a database table positioned
       | arbitrarily, ideally without needing to manipulate the rest of
       | the list. So if a user adds in a new element after element 5,
       | that element becomes 6, without needing to update the original
       | item that came after element 5. There are indeed very
       | sophisticated algorithms to manage this problem and minimize
       | theoretical bounds. But it also seemed like the simplest solution
       | for this particular version of the problem is to just use
       | fractional amounts, and occasionally pay a penalty to relayout
       | the list.
        
         | tetris11 wrote:
         | Isn't that hash table chaining?
        
           | jon_richards wrote:
           | More like tree rebalancing.
        
         | jes5199 wrote:
         | like line numbers in an old BASIC program
        
         | williamdclt wrote:
         | Ha, I had this exact problem asked as an interview question.
         | 
         | IIRC their real life solution was to leave gaps between
         | elements (eg index 0, 100, 200... instead of 0, 1, 2) and re
         | index when needed. Probably works well enough, what I came up
         | with is (as you say) the idea of fractional indexing, but
         | dealing with decimals is a pain so you can instead represent
         | them as vectors, which you can just represent as a string of
         | numbers that you sort lexicographically.
         | 
         | So an element inserted between elements 1 and 2 gets an index
         | 11 (anything between 11-19 is fine). Between 1 and 11 is 101.
         | Between 11 and 2 is 12 (or anything between 12-19). Note that
         | these indexes are not numbers, they're string that are compared
         | lexicographically.
         | 
         | I'm sure there's downsides, eg it takes a lot more memory to
         | sort these indexes (strings are much larger than numbers). It
         | feels a bit too smart to not have unexpected downsides.
        
           | Etheryte wrote:
           | Very similar to my first intuition on how to approach this
           | problem. An interesting question that comes to mind is how
           | should you pick index sizes and how often should you
           | rebalance the whole thing. Make the index too large and
           | you're wasting a lot of space you'll never need, too small
           | and you're forced to rebalance too often. I'm guessing an
           | ideal index size is such that you can rebalance at most once
           | a night with a cron job and then avoid rebalances the whole
           | rest of the day.
           | 
           | To be fair, this sounds like one of those classic problems
           | that someone for sure already figured out in the 50s or 60s,
           | just under a different name and/or context. Hash chaining is
           | a similar idea, but not quite the same.
        
           | zeroonetwothree wrote:
           | That is how I implemented it at work around 9 years ago. Use
           | strings for ordering and if you use the full byte range they
           | end up fairly compact (rather than just 1-9 as in your
           | example). There are some edge cases that can cause it to
           | balloon in size so there is a separate reordering step but it
           | almost never needs to be invoked.
        
           | WalterBright wrote:
           | > solution was to leave gaps between elements (eg index 0,
           | 100, 200... instead of 0, 1, 2)
           | 
           | Reminds me of my days writing BASIC programs.
        
           | thaumasiotes wrote:
           | > IIRC their real life solution was to leave gaps between
           | elements (eg index 0, 100, 200... instead of 0, 1, 2) and re
           | index when needed. Probably works well enough, what I came up
           | with is (as you say) the idea of fractional indexing
           | 
           | Those are the same thing. Leaving gaps is fractional
           | indexing. It's just fixed-point rather than floating point.
           | 
           | > an element inserted between elements 1 and 2 gets an index
           | 11 (anything between 11-19 is fine). Between 1 and 11 is 101.
           | Between 11 and 2 is 12 (or anything between 12-19). Note that
           | these indexes are not numbers, they're string that are
           | compared lexicographically.
           | 
           | This reminds me of the most interesting method of generating
           | random integers in an arbitrary range from random bits:
           | interpret the bitstream as a string representing a real
           | number (in binary) between 0 and 1. If, for example, you want
           | to use bits to generate a number between 0 and 5, you divide
           | the unit interval into sixths, and examine bits until you've
           | determined conclusively that your number lies within one of
           | those intervals:                    +---- 0 = 0.000000000...
           | ---+          |         0.000 -> 0        |          |
           | 0.00100 -> 0      |          +-- 1/6 = 0.001010101...  --+
           | |         0.0011 -> 1       |          |         0.0100 -> 1
           | |          +-- 2/6 = 0.010101010...  --+          |
           | 0.011 -> 2        |          +-- 3/6 = 0.100000000...  --+
           | |         0.100 -> 3        |          +-- 4/6 =
           | 0.101010101...  --+          |         0.1011 -> 4       |
           | |         0.1100 -> 4       |          +-- 5/6 =
           | 0.110101010...  --+          |         0.11011 -> 5      |
           | |         0.111 -> 5        |
           | +---------------------------+
        
           | macNchz wrote:
           | The "real life" solution of leaving gaps & reindexing is
           | actually my earliest programming memory (as a
           | teenager)/lesson in cleverness, of feeling like I _should_
           | have been able to come up with a more clever /optimal/
           | _~~scalable~~_ solution but settling for this awkward 100 200
           | 300 thing. Nowadays I 've used that approach like...dozens of
           | times over the decades of real world projects since, and I'm
           | still maintaining that web app I made in 2003, so I'm very
           | glad I never came up with something unduly clever.
        
         | zeroonetwothree wrote:
         | Wikipedia has a section on this algorithm under exponential
         | labels: https://en.m.wikipedia.org/wiki/List-labeling_problem
         | 
         | Basically it works as long as your label space is large
         | compared to the number of items. The more sophisticated methods
         | are necessary when that isn't the case. For example, say you
         | have 4 bytes for the label and 1 billion items.
        
           | crdrost wrote:
           | This sort of problem also occurs when you're trying to do
           | CRDTs, which can roughly be described also as "design
           | something that does Git better."
           | 
           | So e.g. to frame this, one approach to a CRDT is to just
           | treat the document as a list of facts, "line 1 is 'foo', line
           | 2 is 'bar'", and each fact has a number of times it has been
           | asserted, and to "merge" you just add together the assertion
           | counts, and then you can detect conflicts when a fact has
           | been asserted more than once or fewer than zero times. So a
           | patch says "change line 2 to 'baz'", this becomes "unassert
           | that line 2 is 'bar', assert that line 2 is 'baz'" and it
           | conflicts with a patch that says "change line 2 to 'quux'"
           | because the fact "line 2 is 'bar'" has an assertion count of
           | -1.
           | 
           | But anyway, in this context you might want to allow inserting
           | lines, and then you have the list-labeling problem, you don't
           | want the patch to unassert lines 4,5,6 just to insert a new
           | line after line 3. So then an obvious thing is to just use a
           | broader conception of numbers, say "line 3.5 is <X>" when you
           | insert, and then we hide the line numbers from the user
           | anyways, they don't need to know that internally the line
           | numbers of the 7 lines go "1, 2, 3, 3.5, 4, 5, 6".
           | 
           | So then you need a relabeling step because you eventually
           | have some line at 3.198246315 and you want to be able to say
           | "yeah, that's actually line 27, let's have some sanity again
           | in this thing."
           | 
           | This also maybe hints at the fun of adding randomization,
           | consider that one person might add line 3.5, then add line
           | 3.75, and then remove line 3.5; meanwhile someone else might
           | add a different line 3.5, add a line 3.25, and then remove
           | their line 3.5, and these patches would both amount to
           | "assert line 3.25 is A, assert line 3.75 is B", and would
           | merge without conflict. This means that _in general_ if two
           | people are messing with the same part of the same document
           | asynchronously, this model is _not_ able to consistently flag
           | a merge failure in that case, but will sometimes instead just
           | randomly order the lines that were added.
           | 
           | We can then just make that a feature rather than a fault: you
           | don't insert at 3.5, which is 3 + (4 - 3) / 2, rather you
           | insert at 3 + (4 -- 3) * rand(). And then when two people
           | both try to insert 12 lines between line 3 and 4
           | independently, when you merge them together, you get 24 lines
           | from both, in their original orders but interleaved randomly,
           | and like that's not the end of the world, it might be
           | legitimately better than just declaring a merge failure and
           | harrumphing at the user.
        
             | phi-go wrote:
             | > This sort of problem also occurs when you're trying to do
             | CRDTs, which can roughly be described also as "design
             | something that does Git better."
             | 
             | Aren't the goals of git and CRDTs different. With git you
             | want to get the merged result to be semantically correct.
             | With CRDTs you want to achieve convergence (so no merge
             | conflicts), as far as I know semantically correct
             | convergence (not sure what to correct term is) is not
             | really possible as it is too difficult to encode for CRDTs,
             | though. Isn't that why CRDTs are mostly used for
             | multiplayer interactive applications where these kinds of
             | mismatches are quickly seen?
        
               | funcDropShadow wrote:
               | The technically correct term -- at least in reduction
               | systems -- would be confluence not convergence.
        
         | funcDropShadow wrote:
         | Theoretically, using fractionals as list labels requires
         | unbounded amounts of memory to store the fractions. In practice
         | that is extremely limitted. But the difference becomes really a
         | problem. If you are not just assigning ordered labels to a
         | collection, but if you want to use these labels directly as
         | array indices to store the elements. That would be a more
         | literal modelling of the library sorting problem.
        
         | globular-toast wrote:
         | A linked list? I guess you are forgetting to mention other
         | requirements.
        
       | AstroJetson wrote:
       | Cool. This is Lisa and Laura, they are my two retired teachers
       | that are now helping in the library to sort shelves.
       | 
       | Can you explain to them what you want them to now do?
       | 
       | I read the paper and I would like to say, it's confusing at best.
       | But I'm hoping you can help out.
        
         | gessha wrote:
         | A lot of things are confusing at first but then we discover
         | intuitive ways to explain and teach them to others.
         | 
         | Also, don't discredit the intelligence of retirees, teachers
         | and librarians in the same post. It's bad for karma and makes
         | you sound like a douche. I'm sure they can figure it out.
        
           | AstroJetson wrote:
           | Wasn't my intent. The article was "how to do this" with lots
           | of great pictures. Linked page was full of high level math. I
           | was mocking the "This one simple trick can make your shelves
           | full" attitude.
           | 
           | So you answered and I'll ask, "what do they do, it does not
           | seem simple" except for the Database people filling ....
           | 
           | Thanks (former teacher)
        
       | aladinn57 wrote:
       | Sounds promising! What's the key innovation behind it?
        
       | doodpants wrote:
       | Off topic, but now I want a screen saver of the animation at the
       | top of the article...
        
       | ColinDabritz wrote:
       | "Kuszmaul remembers being surprised that a tool normally used to
       | ensure privacy could confer other benefits."
       | 
       | It was surprising to me too! But reflecting on it more closely,
       | most performance isn't about "faster" in a literal sense of "more
       | instructions run per time", but about carefully choosing how to
       | do less work. The security property here being "history
       | independence" is also in a way stating "we don't need to, and
       | literally cannot, do any work that tracks history".
       | 
       | It's definitely an interesting approach to performance,
       | essentially using cryptography as a contraint to prevent more
       | work. What properties do we need, and what properties can we
       | ignore? The question becomes if we MUST ignore this property
       | cryptographically, how does that affect the process and the
       | related performance?
       | 
       | It certainly feels like it may be a useful perspective, a
       | rigorous approach to performance that may be a path to more
       | improvements in key cases.
        
         | zeroonetwothree wrote:
         | The actual better algorithm does use history dependence though.
         | So I found this part of the article misleading.
        
         | nickburns wrote:
         | So it's basically a matter of figuring out what problem context
         | can and should be selectively hidden from the algorithm in
         | order to make it work 'smarter' and not 'harder.' Weird.
        
         | pradn wrote:
         | That's a good insight. I had always thought the key to good
         | algorithm / data structure design was to use all the
         | information present in the data set. For example, if you know a
         | list is sorted, you can use binary sort. But perhaps choosing
         | how much of it to omit is key, too. It comes up less often,
         | however. I can't think of a simple example.
        
           | koolba wrote:
           | > But perhaps choosing how much of it to omit is key, too. It
           | comes up less often, however. I can't think of a simple
           | example.
           | 
           | An example of that is a linked list with no particular sort
           | order. By not caring about maintaining an order the insertion
           | appends or preprends a node and is O(1).
           | 
           | As soon as you have to maintain any additional invariant, the
           | insertion cost goes up. Either directly or amortized.
        
           | brookst wrote:
           | A real world analogue is detective stories that work by
           | adding irrelevant details and it's up to the reader to filter
           | down to the truly important details to solve the case.
           | 
           | To generalize it, if figuring out what information is
           | important is more expensive than assuming none of it is,
           | better to simplify.
        
         | Thorrez wrote:
         | I don't think what you're saying is accurate. Your statement
         | would be correct if the measure of how slow the algorithm is is
         | how much compute time it takes.
         | 
         | But the measurement they're actually using is how many books
         | need to be moved. They're free to use infinite compute time
         | AIUI.
        
           | brookst wrote:
           | I think the point is that having less information available
           | can improve performance because it eliminates processing of
           | extraneous data.
           | 
           | Said another way, stateless approaches are simpler than
           | stateful. There can be an instinct to optimize by using more
           | information, but in this case at least, that was a red
           | herring and adding the constraint improved the results.
        
             | delusional wrote:
             | If there's no cap on computation time processing extraneous
             | data is free.
        
               | brookst wrote:
               | Also if your algorithm is perfect and will never be
               | misled by extraneous info.
        
       | tetris11 wrote:
       | Clean links for mobile users
       | 
       | Game: https://patorjk.com/games/subpixel-snake/ Code:
       | https://github.com/patorjk/subpixel-snake
       | 
       | Resource for subpixel geometries:
       | https://geometrian.com/resources/subpixelzoo/
        
         | TheRealNGenius wrote:
         | Wrong post
        
         | 9dev wrote:
         | This probably belongs to the post on subpixel snake, not this
         | one on the library sort algorithm :)
        
       | thaumasiotes wrote:
       | > lowering the upper bound to (log n) times (log log n)^3 --
       | equivalent to (log n)^(1.000...1)
       | 
       | This is true! One of the things I find cool about big-O
       | complexity using polynomial reference classes is that logarithms
       | give you an infinitesimal value. Take that, "infinitesimals don't
       | really exist" people!
        
         | sfelicio wrote:
         | Wait. What? Do you have a reference where I can learn about
         | this?
        
           | thaumasiotes wrote:
           | I don't really understand what you're asking for. You can
           | easily just verify the claim for yourself.
           | 
           | The limit as x goes to infinity of (ln x) / (x^k) is zero for
           | all positive k. So if you want to approximate the behavior of
           | a logarithm with a function of the form f(x) = x^k, k must be
           | greater than 0 [because x^0 = o(ln x)], but less than every
           | positive real number. This is the definition of an
           | infinitesimal.
           | 
           | That's why it makes sense to say that x (log x)^3 is
           | equivalent to x^{1.000000...[infinite 0s]...1}. Though in
           | this analogy, it's more like x^{1.000000...3}.
        
             | da39a3ee wrote:
             | >> Do you have a reference where I can learn about this?
             | 
             | > I don't really understand what you're asking for.
             | 
             | That comes across as rude and condescending; please don't
             | communicate like that. Your English skills seem OK but in
             | case you still need help, they're asking for a "reference"
             | -- that's an online source of some sort probably, where
             | they can "learn about" it -- that means that the reference
             | would contain content helping them to understand the topic
             | better. For example, the helpful content you gave after
             | your rude and condescending initial comment would be an
             | example of such content.
        
               | User23 wrote:
               | Only believing things that you can easily reason through
               | if you are presented with some external authority really
               | is a cognitive antipattern though.
               | 
               | I'd say that prologue was more medicinal than
               | condescending. It's good to remind people in the age of
               | LLMs and ubiquitous search that "you can just think
               | things."
        
               | thaumasiotes wrote:
               | OK. What reference would you point them to? What makes
               | you think it's what they want?
        
       | msmitha wrote:
       | I was shocked to discover how the British Library (millons of
       | books, many many new ones every week) manages this. The first
       | book to arrive earlier this year went on the shelf at
       | 2025.0000001. The next went at 2025.0000002, right next to it.
       | The electronic catalogue does the rest. No need to re-shuffle
       | books around, but not a solution supportive of book-browsing.
        
         | crazygringo wrote:
         | Reminds me of how Amazon doesn't arrange its items by
         | similarity like a store does, so a model of vacuum cleaner
         | might be next to a set of kitchen plates. They actually
         | intentionally avoid similarities so pickers won't accidentally
         | grab a similar but wrong thing.
         | 
         | I lose track of so many infrequently used objects in my home --
         | which storage bin in which closet did I put the x-acto blade
         | refills in? -- and one bin will overflow while another is half-
         | empty because I try to keep similar items together. Sometimes I
         | fantasize about tracking every possession in a spreadsheet that
         | points to which bin, so I'd never lose anything and could
         | always use storage space with maximum efficiency. But then I
         | know I'd be lazy and skip updating the spreadsheet when I put
         | something new away... plus it just seems so inhumanly weird,
         | like something a robot would so rather than a person.
        
       | ryao wrote:
       | Is there any reason to think that this algorithm will actually be
       | faster than what is currently done in practice?
       | 
       | For arrays in B-tree nodes, which is the main place where I have
       | encountered this problem, I doubt it will be faster than just
       | using `memmove()`, and for truly large arrays, it would be easier
       | to use a B tree.
       | 
       | If that is the case, this joins a number of algorithms that are
       | asymptotically faster than algorithms in use and paradoxically
       | slower than them. Examples of such algorithms include all of the
       | fast matrix multiplication algorithms, which are slower than good
       | implementations of the the textbook O(n^3) algorithm (GEMM).
        
         | melvyn2 wrote:
         | These are sometimes called Galactic Algorithms:
         | https://en.wikipedia.org/wiki/Galactic_algorithm
         | 
         | The very first example on the page has a pretty good quote
         | describing their usefulness:
         | 
         | >>> An example of a galactic algorithm is the fastest known way
         | to multiply two numbers, which is based on a 1729-dimensional
         | Fourier transform. It needs O(n log n) bit operations, but as
         | the constants hidden by the big O notation are large, it is
         | never used in practice. However, it also shows why galactic
         | algorithms may still be useful. The authors state: "we are
         | hopeful that with further refinements, the algorithm might
         | become practical for numbers with merely billions or trillions
         | of digits."
        
           | dchftcs wrote:
           | One interesting thing about working on numbers so large is
           | that it becomes harder to assume that memory access is
           | constant time. It's already not that easy now (cache miss vs
           | hit, and external memory), but when you operate on data sizes
           | that large, memory accesses matter by at least the square or
           | cube root of N. If you require largely sequential access it
           | is possible to outperform an algorithm that requires
           | conditional random access by some function of those factors,
           | which can make some algorithms look much worse or better
           | compared to the classical analysis. Of course, it still does
           | not matter whether your algorithm is going to finish in a
           | couple of trillion years or a trillion trillion years.
        
       | teleforce wrote:
       | Am I the only one that's trying to find the main papers that're
       | being described in the article both the original problem, and the
       | near optimal algorithm paper [1],[2].
       | 
       | Apparently both of them are linked deep inside the article,
       | perhaps Quanta can make it mandatory to list all the reference
       | towards the end of the article, it will be very helpful for the
       | readers.
       | 
       | [1] Nearly Optimal List Labeling:
       | 
       | https://arxiv.org/abs/2405.00807
       | 
       | [2] A sparse table implementation of priority queues:
       | 
       | https://link.springer.com/chapter/10.1007/3-540-10843-2_34
        
         | svat wrote:
         | Both papers are linked very clearly in the article, and it was
         | easy to quickly find them just skimming through the article
         | without even reading:
         | 
         | > _This problem was introduced in a 1981 paper_
         | 
         | where "1981 paper" links to
         | https://link.springer.com/chapter/10.1007/3-540-10843-2_34
         | 
         | and in the next paragraph:
         | 
         | > _Last year, in a study that was presented at the Foundations
         | of Computer Science conference in Chicago, a team of seven
         | researchers_
         | 
         | where "a study" links to https://arxiv.org/abs/2405.00807
         | 
         | Note that both of these are in the introductory section (third
         | and fourth paragraphs), before the article goes into details,
         | history and context. If this is what counts as "Apparently both
         | of them are linked deep inside the article", then it appears
         | there exist very different ideas of what "deep inside" means.
        
       | xpe wrote:
       | I'm trying to figure out a key constraint. Does the problem
       | statement assume a fixed-length pre-allocated array?
        
         | chongli wrote:
         | No, it doesn't assume an array at all. It's a data structure
         | that maintains a totally ordered set with 3 operations:
         | 
         | insert(X), delete(X), label(X)
         | 
         | Where label extracts the label of element X (which had
         | previously been inserted but not deleted). The label is a
         | number from 0 to n-1, where n is the number of elements
         | currently stored.
        
       ___________________________________________________________________
       (page generated 2025-01-25 23:02 UTC)