[HN Gopher] New book-sorting algorithm almost reaches perfection
___________________________________________________________________
New book-sorting algorithm almost reaches perfection
Author : isaacfrond
Score : 246 points
Date : 2025-01-24 15:50 UTC (1 days ago)
(HTM) web link (www.quantamagazine.org)
(TXT) w3m dump (www.quantamagazine.org)
| baruchel wrote:
| I recall presenting a problem to my students a few years ago
| based on the 'Library Sort' algorithm. I still clearly remember
| the title of the original paper: 'Insertion Sort is O(n log n)'.
| deathanatos wrote:
| Presumably this:
| https://www3.cs.stonybrook.edu/~bender/newpub/BenderFaMo06-l...
|
| Kind of a click-baity title.
| foota wrote:
| It's sort of click bait, but also true, right? This is just
| insertion sort using non-fixed indexing?
| zeroonetwothree wrote:
| This is a different problem though, despite the similar name.
| acbart wrote:
| I was actually just looking at this problem last week, when I
| realized I needed to keep items in a database table positioned
| arbitrarily, ideally without needing to manipulate the rest of
| the list. So if a user adds in a new element after element 5,
| that element becomes 6, without needing to update the original
| item that came after element 5. There are indeed very
| sophisticated algorithms to manage this problem and minimize
| theoretical bounds. But it also seemed like the simplest solution
| for this particular version of the problem is to just use
| fractional amounts, and occasionally pay a penalty to relayout
| the list.
| tetris11 wrote:
| Isn't that hash table chaining?
| jon_richards wrote:
| More like tree rebalancing.
| jes5199 wrote:
| like line numbers in an old BASIC program
| williamdclt wrote:
| Ha, I had this exact problem asked as an interview question.
|
| IIRC their real life solution was to leave gaps between
| elements (eg index 0, 100, 200... instead of 0, 1, 2) and re
| index when needed. Probably works well enough, what I came up
| with is (as you say) the idea of fractional indexing, but
| dealing with decimals is a pain so you can instead represent
| them as vectors, which you can just represent as a string of
| numbers that you sort lexicographically.
|
| So an element inserted between elements 1 and 2 gets an index
| 11 (anything between 11-19 is fine). Between 1 and 11 is 101.
| Between 11 and 2 is 12 (or anything between 12-19). Note that
| these indexes are not numbers, they're string that are compared
| lexicographically.
|
| I'm sure there's downsides, eg it takes a lot more memory to
| sort these indexes (strings are much larger than numbers). It
| feels a bit too smart to not have unexpected downsides.
| Etheryte wrote:
| Very similar to my first intuition on how to approach this
| problem. An interesting question that comes to mind is how
| should you pick index sizes and how often should you
| rebalance the whole thing. Make the index too large and
| you're wasting a lot of space you'll never need, too small
| and you're forced to rebalance too often. I'm guessing an
| ideal index size is such that you can rebalance at most once
| a night with a cron job and then avoid rebalances the whole
| rest of the day.
|
| To be fair, this sounds like one of those classic problems
| that someone for sure already figured out in the 50s or 60s,
| just under a different name and/or context. Hash chaining is
| a similar idea, but not quite the same.
| zeroonetwothree wrote:
| That is how I implemented it at work around 9 years ago. Use
| strings for ordering and if you use the full byte range they
| end up fairly compact (rather than just 1-9 as in your
| example). There are some edge cases that can cause it to
| balloon in size so there is a separate reordering step but it
| almost never needs to be invoked.
| WalterBright wrote:
| > solution was to leave gaps between elements (eg index 0,
| 100, 200... instead of 0, 1, 2)
|
| Reminds me of my days writing BASIC programs.
| thaumasiotes wrote:
| > IIRC their real life solution was to leave gaps between
| elements (eg index 0, 100, 200... instead of 0, 1, 2) and re
| index when needed. Probably works well enough, what I came up
| with is (as you say) the idea of fractional indexing
|
| Those are the same thing. Leaving gaps is fractional
| indexing. It's just fixed-point rather than floating point.
|
| > an element inserted between elements 1 and 2 gets an index
| 11 (anything between 11-19 is fine). Between 1 and 11 is 101.
| Between 11 and 2 is 12 (or anything between 12-19). Note that
| these indexes are not numbers, they're string that are
| compared lexicographically.
|
| This reminds me of the most interesting method of generating
| random integers in an arbitrary range from random bits:
| interpret the bitstream as a string representing a real
| number (in binary) between 0 and 1. If, for example, you want
| to use bits to generate a number between 0 and 5, you divide
| the unit interval into sixths, and examine bits until you've
| determined conclusively that your number lies within one of
| those intervals: +---- 0 = 0.000000000...
| ---+ | 0.000 -> 0 | |
| 0.00100 -> 0 | +-- 1/6 = 0.001010101... --+
| | 0.0011 -> 1 | | 0.0100 -> 1
| | +-- 2/6 = 0.010101010... --+ |
| 0.011 -> 2 | +-- 3/6 = 0.100000000... --+
| | 0.100 -> 3 | +-- 4/6 =
| 0.101010101... --+ | 0.1011 -> 4 |
| | 0.1100 -> 4 | +-- 5/6 =
| 0.110101010... --+ | 0.11011 -> 5 |
| | 0.111 -> 5 |
| +---------------------------+
| macNchz wrote:
| The "real life" solution of leaving gaps & reindexing is
| actually my earliest programming memory (as a
| teenager)/lesson in cleverness, of feeling like I _should_
| have been able to come up with a more clever /optimal/
| _~~scalable~~_ solution but settling for this awkward 100 200
| 300 thing. Nowadays I 've used that approach like...dozens of
| times over the decades of real world projects since, and I'm
| still maintaining that web app I made in 2003, so I'm very
| glad I never came up with something unduly clever.
| zeroonetwothree wrote:
| Wikipedia has a section on this algorithm under exponential
| labels: https://en.m.wikipedia.org/wiki/List-labeling_problem
|
| Basically it works as long as your label space is large
| compared to the number of items. The more sophisticated methods
| are necessary when that isn't the case. For example, say you
| have 4 bytes for the label and 1 billion items.
| crdrost wrote:
| This sort of problem also occurs when you're trying to do
| CRDTs, which can roughly be described also as "design
| something that does Git better."
|
| So e.g. to frame this, one approach to a CRDT is to just
| treat the document as a list of facts, "line 1 is 'foo', line
| 2 is 'bar'", and each fact has a number of times it has been
| asserted, and to "merge" you just add together the assertion
| counts, and then you can detect conflicts when a fact has
| been asserted more than once or fewer than zero times. So a
| patch says "change line 2 to 'baz'", this becomes "unassert
| that line 2 is 'bar', assert that line 2 is 'baz'" and it
| conflicts with a patch that says "change line 2 to 'quux'"
| because the fact "line 2 is 'bar'" has an assertion count of
| -1.
|
| But anyway, in this context you might want to allow inserting
| lines, and then you have the list-labeling problem, you don't
| want the patch to unassert lines 4,5,6 just to insert a new
| line after line 3. So then an obvious thing is to just use a
| broader conception of numbers, say "line 3.5 is <X>" when you
| insert, and then we hide the line numbers from the user
| anyways, they don't need to know that internally the line
| numbers of the 7 lines go "1, 2, 3, 3.5, 4, 5, 6".
|
| So then you need a relabeling step because you eventually
| have some line at 3.198246315 and you want to be able to say
| "yeah, that's actually line 27, let's have some sanity again
| in this thing."
|
| This also maybe hints at the fun of adding randomization,
| consider that one person might add line 3.5, then add line
| 3.75, and then remove line 3.5; meanwhile someone else might
| add a different line 3.5, add a line 3.25, and then remove
| their line 3.5, and these patches would both amount to
| "assert line 3.25 is A, assert line 3.75 is B", and would
| merge without conflict. This means that _in general_ if two
| people are messing with the same part of the same document
| asynchronously, this model is _not_ able to consistently flag
| a merge failure in that case, but will sometimes instead just
| randomly order the lines that were added.
|
| We can then just make that a feature rather than a fault: you
| don't insert at 3.5, which is 3 + (4 - 3) / 2, rather you
| insert at 3 + (4 -- 3) * rand(). And then when two people
| both try to insert 12 lines between line 3 and 4
| independently, when you merge them together, you get 24 lines
| from both, in their original orders but interleaved randomly,
| and like that's not the end of the world, it might be
| legitimately better than just declaring a merge failure and
| harrumphing at the user.
| phi-go wrote:
| > This sort of problem also occurs when you're trying to do
| CRDTs, which can roughly be described also as "design
| something that does Git better."
|
| Aren't the goals of git and CRDTs different. With git you
| want to get the merged result to be semantically correct.
| With CRDTs you want to achieve convergence (so no merge
| conflicts), as far as I know semantically correct
| convergence (not sure what to correct term is) is not
| really possible as it is too difficult to encode for CRDTs,
| though. Isn't that why CRDTs are mostly used for
| multiplayer interactive applications where these kinds of
| mismatches are quickly seen?
| funcDropShadow wrote:
| The technically correct term -- at least in reduction
| systems -- would be confluence not convergence.
| funcDropShadow wrote:
| Theoretically, using fractionals as list labels requires
| unbounded amounts of memory to store the fractions. In practice
| that is extremely limitted. But the difference becomes really a
| problem. If you are not just assigning ordered labels to a
| collection, but if you want to use these labels directly as
| array indices to store the elements. That would be a more
| literal modelling of the library sorting problem.
| globular-toast wrote:
| A linked list? I guess you are forgetting to mention other
| requirements.
| AstroJetson wrote:
| Cool. This is Lisa and Laura, they are my two retired teachers
| that are now helping in the library to sort shelves.
|
| Can you explain to them what you want them to now do?
|
| I read the paper and I would like to say, it's confusing at best.
| But I'm hoping you can help out.
| gessha wrote:
| A lot of things are confusing at first but then we discover
| intuitive ways to explain and teach them to others.
|
| Also, don't discredit the intelligence of retirees, teachers
| and librarians in the same post. It's bad for karma and makes
| you sound like a douche. I'm sure they can figure it out.
| AstroJetson wrote:
| Wasn't my intent. The article was "how to do this" with lots
| of great pictures. Linked page was full of high level math. I
| was mocking the "This one simple trick can make your shelves
| full" attitude.
|
| So you answered and I'll ask, "what do they do, it does not
| seem simple" except for the Database people filling ....
|
| Thanks (former teacher)
| aladinn57 wrote:
| Sounds promising! What's the key innovation behind it?
| doodpants wrote:
| Off topic, but now I want a screen saver of the animation at the
| top of the article...
| ColinDabritz wrote:
| "Kuszmaul remembers being surprised that a tool normally used to
| ensure privacy could confer other benefits."
|
| It was surprising to me too! But reflecting on it more closely,
| most performance isn't about "faster" in a literal sense of "more
| instructions run per time", but about carefully choosing how to
| do less work. The security property here being "history
| independence" is also in a way stating "we don't need to, and
| literally cannot, do any work that tracks history".
|
| It's definitely an interesting approach to performance,
| essentially using cryptography as a contraint to prevent more
| work. What properties do we need, and what properties can we
| ignore? The question becomes if we MUST ignore this property
| cryptographically, how does that affect the process and the
| related performance?
|
| It certainly feels like it may be a useful perspective, a
| rigorous approach to performance that may be a path to more
| improvements in key cases.
| zeroonetwothree wrote:
| The actual better algorithm does use history dependence though.
| So I found this part of the article misleading.
| nickburns wrote:
| So it's basically a matter of figuring out what problem context
| can and should be selectively hidden from the algorithm in
| order to make it work 'smarter' and not 'harder.' Weird.
| pradn wrote:
| That's a good insight. I had always thought the key to good
| algorithm / data structure design was to use all the
| information present in the data set. For example, if you know a
| list is sorted, you can use binary sort. But perhaps choosing
| how much of it to omit is key, too. It comes up less often,
| however. I can't think of a simple example.
| koolba wrote:
| > But perhaps choosing how much of it to omit is key, too. It
| comes up less often, however. I can't think of a simple
| example.
|
| An example of that is a linked list with no particular sort
| order. By not caring about maintaining an order the insertion
| appends or preprends a node and is O(1).
|
| As soon as you have to maintain any additional invariant, the
| insertion cost goes up. Either directly or amortized.
| brookst wrote:
| A real world analogue is detective stories that work by
| adding irrelevant details and it's up to the reader to filter
| down to the truly important details to solve the case.
|
| To generalize it, if figuring out what information is
| important is more expensive than assuming none of it is,
| better to simplify.
| Thorrez wrote:
| I don't think what you're saying is accurate. Your statement
| would be correct if the measure of how slow the algorithm is is
| how much compute time it takes.
|
| But the measurement they're actually using is how many books
| need to be moved. They're free to use infinite compute time
| AIUI.
| brookst wrote:
| I think the point is that having less information available
| can improve performance because it eliminates processing of
| extraneous data.
|
| Said another way, stateless approaches are simpler than
| stateful. There can be an instinct to optimize by using more
| information, but in this case at least, that was a red
| herring and adding the constraint improved the results.
| delusional wrote:
| If there's no cap on computation time processing extraneous
| data is free.
| brookst wrote:
| Also if your algorithm is perfect and will never be
| misled by extraneous info.
| tetris11 wrote:
| Clean links for mobile users
|
| Game: https://patorjk.com/games/subpixel-snake/ Code:
| https://github.com/patorjk/subpixel-snake
|
| Resource for subpixel geometries:
| https://geometrian.com/resources/subpixelzoo/
| TheRealNGenius wrote:
| Wrong post
| 9dev wrote:
| This probably belongs to the post on subpixel snake, not this
| one on the library sort algorithm :)
| thaumasiotes wrote:
| > lowering the upper bound to (log n) times (log log n)^3 --
| equivalent to (log n)^(1.000...1)
|
| This is true! One of the things I find cool about big-O
| complexity using polynomial reference classes is that logarithms
| give you an infinitesimal value. Take that, "infinitesimals don't
| really exist" people!
| sfelicio wrote:
| Wait. What? Do you have a reference where I can learn about
| this?
| thaumasiotes wrote:
| I don't really understand what you're asking for. You can
| easily just verify the claim for yourself.
|
| The limit as x goes to infinity of (ln x) / (x^k) is zero for
| all positive k. So if you want to approximate the behavior of
| a logarithm with a function of the form f(x) = x^k, k must be
| greater than 0 [because x^0 = o(ln x)], but less than every
| positive real number. This is the definition of an
| infinitesimal.
|
| That's why it makes sense to say that x (log x)^3 is
| equivalent to x^{1.000000...[infinite 0s]...1}. Though in
| this analogy, it's more like x^{1.000000...3}.
| da39a3ee wrote:
| >> Do you have a reference where I can learn about this?
|
| > I don't really understand what you're asking for.
|
| That comes across as rude and condescending; please don't
| communicate like that. Your English skills seem OK but in
| case you still need help, they're asking for a "reference"
| -- that's an online source of some sort probably, where
| they can "learn about" it -- that means that the reference
| would contain content helping them to understand the topic
| better. For example, the helpful content you gave after
| your rude and condescending initial comment would be an
| example of such content.
| User23 wrote:
| Only believing things that you can easily reason through
| if you are presented with some external authority really
| is a cognitive antipattern though.
|
| I'd say that prologue was more medicinal than
| condescending. It's good to remind people in the age of
| LLMs and ubiquitous search that "you can just think
| things."
| thaumasiotes wrote:
| OK. What reference would you point them to? What makes
| you think it's what they want?
| msmitha wrote:
| I was shocked to discover how the British Library (millons of
| books, many many new ones every week) manages this. The first
| book to arrive earlier this year went on the shelf at
| 2025.0000001. The next went at 2025.0000002, right next to it.
| The electronic catalogue does the rest. No need to re-shuffle
| books around, but not a solution supportive of book-browsing.
| crazygringo wrote:
| Reminds me of how Amazon doesn't arrange its items by
| similarity like a store does, so a model of vacuum cleaner
| might be next to a set of kitchen plates. They actually
| intentionally avoid similarities so pickers won't accidentally
| grab a similar but wrong thing.
|
| I lose track of so many infrequently used objects in my home --
| which storage bin in which closet did I put the x-acto blade
| refills in? -- and one bin will overflow while another is half-
| empty because I try to keep similar items together. Sometimes I
| fantasize about tracking every possession in a spreadsheet that
| points to which bin, so I'd never lose anything and could
| always use storage space with maximum efficiency. But then I
| know I'd be lazy and skip updating the spreadsheet when I put
| something new away... plus it just seems so inhumanly weird,
| like something a robot would so rather than a person.
| ryao wrote:
| Is there any reason to think that this algorithm will actually be
| faster than what is currently done in practice?
|
| For arrays in B-tree nodes, which is the main place where I have
| encountered this problem, I doubt it will be faster than just
| using `memmove()`, and for truly large arrays, it would be easier
| to use a B tree.
|
| If that is the case, this joins a number of algorithms that are
| asymptotically faster than algorithms in use and paradoxically
| slower than them. Examples of such algorithms include all of the
| fast matrix multiplication algorithms, which are slower than good
| implementations of the the textbook O(n^3) algorithm (GEMM).
| melvyn2 wrote:
| These are sometimes called Galactic Algorithms:
| https://en.wikipedia.org/wiki/Galactic_algorithm
|
| The very first example on the page has a pretty good quote
| describing their usefulness:
|
| >>> An example of a galactic algorithm is the fastest known way
| to multiply two numbers, which is based on a 1729-dimensional
| Fourier transform. It needs O(n log n) bit operations, but as
| the constants hidden by the big O notation are large, it is
| never used in practice. However, it also shows why galactic
| algorithms may still be useful. The authors state: "we are
| hopeful that with further refinements, the algorithm might
| become practical for numbers with merely billions or trillions
| of digits."
| dchftcs wrote:
| One interesting thing about working on numbers so large is
| that it becomes harder to assume that memory access is
| constant time. It's already not that easy now (cache miss vs
| hit, and external memory), but when you operate on data sizes
| that large, memory accesses matter by at least the square or
| cube root of N. If you require largely sequential access it
| is possible to outperform an algorithm that requires
| conditional random access by some function of those factors,
| which can make some algorithms look much worse or better
| compared to the classical analysis. Of course, it still does
| not matter whether your algorithm is going to finish in a
| couple of trillion years or a trillion trillion years.
| teleforce wrote:
| Am I the only one that's trying to find the main papers that're
| being described in the article both the original problem, and the
| near optimal algorithm paper [1],[2].
|
| Apparently both of them are linked deep inside the article,
| perhaps Quanta can make it mandatory to list all the reference
| towards the end of the article, it will be very helpful for the
| readers.
|
| [1] Nearly Optimal List Labeling:
|
| https://arxiv.org/abs/2405.00807
|
| [2] A sparse table implementation of priority queues:
|
| https://link.springer.com/chapter/10.1007/3-540-10843-2_34
| svat wrote:
| Both papers are linked very clearly in the article, and it was
| easy to quickly find them just skimming through the article
| without even reading:
|
| > _This problem was introduced in a 1981 paper_
|
| where "1981 paper" links to
| https://link.springer.com/chapter/10.1007/3-540-10843-2_34
|
| and in the next paragraph:
|
| > _Last year, in a study that was presented at the Foundations
| of Computer Science conference in Chicago, a team of seven
| researchers_
|
| where "a study" links to https://arxiv.org/abs/2405.00807
|
| Note that both of these are in the introductory section (third
| and fourth paragraphs), before the article goes into details,
| history and context. If this is what counts as "Apparently both
| of them are linked deep inside the article", then it appears
| there exist very different ideas of what "deep inside" means.
| xpe wrote:
| I'm trying to figure out a key constraint. Does the problem
| statement assume a fixed-length pre-allocated array?
| chongli wrote:
| No, it doesn't assume an array at all. It's a data structure
| that maintains a totally ordered set with 3 operations:
|
| insert(X), delete(X), label(X)
|
| Where label extracts the label of element X (which had
| previously been inserted but not deleted). The label is a
| number from 0 to n-1, where n is the number of elements
| currently stored.
___________________________________________________________________
(page generated 2025-01-25 23:02 UTC)