[HN Gopher] New Book-Sorting Algorithm Almost Reaches Perfection
___________________________________________________________________
New Book-Sorting Algorithm Almost Reaches Perfection
Author : isaacfrond
Score : 148 points
Date : 2025-01-24 15:50 UTC (7 hours ago)
(HTM) web link (www.quantamagazine.org)
(TXT) w3m dump (www.quantamagazine.org)
| baruchel wrote:
| I recall presenting a problem to my students a few years ago
| based on the 'Library Sort' algorithm. I still clearly remember
| the title of the original paper: 'Insertion Sort is O(n log n)'.
| deathanatos wrote:
| Presumably this:
| https://www3.cs.stonybrook.edu/~bender/newpub/BenderFaMo06-l...
|
| Kind of a click-baity title.
| foota wrote:
| It's sort of click bait, but also true, right? This is just
| insertion sort using non-fixed indexing?
| zeroonetwothree wrote:
| This is a different problem though, despite the similar name.
| acbart wrote:
| I was actually just looking at this problem last week, when I
| realized I needed to keep items in a database table positioned
| arbitrarily, ideally without needing to manipulate the rest of
| the list. So if a user adds in a new element after element 5,
| that element becomes 6, without needing to update the original
| item that came after element 5. There are indeed very
| sophisticated algorithms to manage this problem and minimize
| theoretical bounds. But it also seemed like the simplest solution
| for this particular version of the problem is to just use
| fractional amounts, and occasionally pay a penalty to relayout
| the list.
| tetris11 wrote:
| Isn't that hash table chaining?
| jon_richards wrote:
| More like tree rebalancing.
| jes5199 wrote:
| like line numbers in an old BASIC program
| williamdclt wrote:
| Ha, I had this exact problem asked as an interview question.
|
| IIRC their real life solution was to leave gaps between
| elements (eg index 0, 100, 200... instead of 0, 1, 2) and re
| index when needed. Probably works well enough, what I came up
| with is (as you say) the idea of fractional indexing, but
| dealing with decimals is a pain so you can instead represent
| them as vectors, which you can just represent as a string of
| numbers that you sort lexicographically.
|
| So an element inserted between elements 1 and 2 gets an index
| 11 (anything between 11-19 is fine). Between 1 and 11 is 101.
| Between 11 and 2 is 12 (or anything between 12-19). Note that
| these indexes are not numbers, they're string that are compared
| lexicographically.
|
| I'm sure there's downsides, eg it takes a lot more memory to
| sort these indexes (strings are much larger than numbers). It
| feels a bit too smart to not have unexpected downsides.
| Etheryte wrote:
| Very similar to my first intuition on how to approach this
| problem. An interesting question that comes to mind is how
| should you pick index sizes and how often should you
| rebalance the whole thing. Make the index too large and
| you're wasting a lot of space you'll never need, too small
| and you're forced to rebalance too often. I'm guessing an
| ideal index size is such that you can rebalance at most once
| a night with a cron job and then avoid rebalances the whole
| rest of the day.
|
| To be fair, this sounds like one of those classic problems
| that someone for sure already figured out in the 50s or 60s,
| just under a different name and/or context. Hash chaining is
| a similar idea, but not quite the same.
| zeroonetwothree wrote:
| That is how I implemented it at work around 9 years ago. Use
| strings for ordering and if you use the full byte range they
| end up fairly compact (rather than just 1-9 as in your
| example). There are some edge cases that can cause it to
| balloon in size so there is a separate reordering step but it
| almost never needs to be invoked.
| WalterBright wrote:
| > solution was to leave gaps between elements (eg index 0,
| 100, 200... instead of 0, 1, 2)
|
| Reminds me of my days writing BASIC programs.
| thaumasiotes wrote:
| > IIRC their real life solution was to leave gaps between
| elements (eg index 0, 100, 200... instead of 0, 1, 2) and re
| index when needed. Probably works well enough, what I came up
| with is (as you say) the idea of fractional indexing
|
| Those are the same thing. Leaving gaps is fractional
| indexing. It's just fixed-point rather than floating point.
|
| > an element inserted between elements 1 and 2 gets an index
| 11 (anything between 11-19 is fine). Between 1 and 11 is 101.
| Between 11 and 2 is 12 (or anything between 12-19). Note that
| these indexes are not numbers, they're string that are
| compared lexicographically.
|
| This reminds me of the most interesting method of generating
| random integers in an arbitrary range from random bits:
| interpret the bitstream as a string representing a real
| number (in binary) between 0 and 1. If, for example, you want
| to use bits to generate a number between 0 and 5, you divide
| the unit interval into sixths, and examine bits until you've
| determined conclusively that your number lies within one of
| those intervals: +---- 0 = 0.000000000...
| ---+ | 0.000 -> 0 | |
| 0.00100 -> 0 | +-- 1/6 = 0.001010101... --+
| | 0.0011 -> 1 | | 0.0100 -> 1
| | +-- 2/6 = 0.010101010... --+ |
| 0.011 -> 2 | +-- 3/6 = 0.100000000... --+
| | 0.100 -> 3 | +-- 4/6 =
| 0.101010101... --+ | 0.1011 -> 4 |
| | 0.1100 -> 4 | +-- 5/6 =
| 0.110101010... --+ | 0.11011 -> 5 |
| | 0.111 -> 5 |
| +---------------------------+
| macNchz wrote:
| The "real life" solution of leaving gaps & reindexing is
| actually my earliest programming memory (as a
| teenager)/lesson in cleverness, of feeling like I _should_
| have been able to come up with a more clever /optimal/
| _~~scalable~~_ solution but settling for this awkward 100 200
| 300 thing. Nowadays I 've used that approach like...dozens of
| times over the decades of real world projects since, and I'm
| still maintaining that web app I made in 2003, so I'm very
| glad I never came up with something unduly clever.
| zeroonetwothree wrote:
| Wikipedia has a section on this algorithm under exponential
| labels: https://en.m.wikipedia.org/wiki/List-labeling_problem
|
| Basically it works as long as your label space is large
| compared to the number of items. The more sophisticated methods
| are necessary when that isn't the case. For example, say you
| have 4 bytes for the label and 1 billion items.
| crdrost wrote:
| This sort of problem also occurs when you're trying to do
| CRDTs, which can roughly be described also as "design
| something that does Git better."
|
| So e.g. to frame this, one approach to a CRDT is to just
| treat the document as a list of facts, "line 1 is 'foo', line
| 2 is 'bar'", and each fact has a number of times it has been
| asserted, and to "merge" you just add together the assertion
| counts, and then you can detect conflicts when a fact has
| been asserted more than once or fewer than zero times. So a
| patch says "change line 2 to 'baz'", this becomes "unassert
| that line 2 is 'bar', assert that line 2 is 'baz'" and it
| conflicts with a patch that says "change line 2 to 'quux'"
| because the fact "line 2 is 'bar'" has an assertion count of
| -1.
|
| But anyway, in this context you might want to allow inserting
| lines, and then you have the list-labeling problem, you don't
| want the patch to unassert lines 4,5,6 just to insert a new
| line after line 3. So then an obvious thing is to just use a
| broader conception of numbers, say "line 3.5 is <X>" when you
| insert, and then we hide the line numbers from the user
| anyways, they don't need to know that internally the line
| numbers of the 7 lines go "1, 2, 3, 3.5, 4, 5, 6".
|
| So then you need a relabeling step because you eventually
| have some line at 3.198246315 and you want to be able to say
| "yeah, that's actually line 27, let's have some sanity again
| in this thing."
|
| This also maybe hints at the fun of adding randomization,
| consider that one person might add line 3.5, then add line
| 3.75, and then remove line 3.5; meanwhile someone else might
| add a different line 3.5, add a line 3.25, and then remove
| their line 3.5, and these patches would both amount to
| "assert line 3.25 is A, assert line 3.75 is B", and would
| merge without conflict. This means that _in general_ if two
| people are messing with the same part of the same document
| asynchronously, this model is _not_ able to consistently flag
| a merge failure in that case, but will sometimes instead just
| randomly order the lines that were added.
|
| We can then just make that a feature rather than a fault: you
| don't insert at 3.5, which is 3 + (4 - 3) / 2, rather you
| insert at 3 + (4 -- 3) * rand(). And then when two people
| both try to insert 12 lines between line 3 and 4
| independently, when you merge them together, you get 24 lines
| from both, in their original orders but interleaved randomly,
| and like that's not the end of the world, it might be
| legitimately better than just declaring a merge failure and
| harrumphing at the user.
| AstroJetson wrote:
| Cool. This is Lisa and Laura, they are my two retired teachers
| that are now helping in the library to sort shelves.
|
| Can you explain to them what you want them to now do?
|
| I read the paper and I would like to say, it's confusing at best.
| But I'm hoping you can help out.
| gessha wrote:
| A lot of things are confusing at first but then we discover
| intuitive ways to explain and teach them to others.
|
| Also, don't discredit the intelligence of retirees, teachers
| and librarians in the same post. It's bad for karma and makes
| you sound like a douche. I'm sure they can figure it out.
| AstroJetson wrote:
| Wasn't my intent. The article was "how to do this" with lots
| of great pictures. Linked page was full of high level math. I
| was mocking the "This one simple trick can make your shelves
| full" attitude.
|
| So you answered and I'll ask, "what do they do, it does not
| seem simple" except for the Database people filling ....
|
| Thanks (former teacher)
| aladinn57 wrote:
| Sounds promising! What's the key innovation behind it?
| doodpants wrote:
| Off topic, but now I want a screen saver of the animation at the
| top of the article...
| ColinDabritz wrote:
| "Kuszmaul remembers being surprised that a tool normally used to
| ensure privacy could confer other benefits."
|
| It was surprising to me too! But reflecting on it more closely,
| most performance isn't about "faster" in a literal sense of "more
| instructions run per time", but about carefully choosing how to
| do less work. The security property here being "history
| independence" is also in a way stating "we don't need to, and
| literally cannot, do any work that tracks history".
|
| It's definitely an interesting approach to performance,
| essentially using cryptography as a contraint to prevent more
| work. What properties do we need, and what properties can we
| ignore? The question becomes if we MUST ignore this property
| cryptographically, how does that affect the process and the
| related performance?
|
| It certainly feels like it may be a useful perspective, a
| rigorous approach to performance that may be a path to more
| improvements in key cases.
| zeroonetwothree wrote:
| The actual better algorithm does use history dependence though.
| So I found this part of the article misleading.
| nickburns wrote:
| So it's basically a matter of figuring out what problem context
| can and should be selectively hidden from the algorithm in
| order to make it work 'smarter' and not 'harder.' Weird.
| pradn wrote:
| That's a good insight. I had always thought the key to good
| algorithm / data structure design was to use all the
| information present in the data set. For example, if you know a
| list is sorted, you can use binary sort. But perhaps choosing
| how much of it to omit is key, too. It comes up less often,
| however. I can't think of a simple example.
| koolba wrote:
| > But perhaps choosing how much of it to omit is key, too. It
| comes up less often, however. I can't think of a simple
| example.
|
| An example of that is a linked list with no particular sort
| order. By not caring about maintaining an order the insertion
| appends or preprends a node and is O(1).
|
| As soon as you have to maintain any additional invariant, the
| insertion cost goes up. Either directly or amortized.
| tetris11 wrote:
| Clean links for mobile users
|
| Game: https://patorjk.com/games/subpixel-snake/ Code:
| https://github.com/patorjk/subpixel-snake
|
| Resource for subpixel geometries:
| https://geometrian.com/resources/subpixelzoo/
| TheRealNGenius wrote:
| Wrong post
| 9dev wrote:
| This probably belongs to the post on subpixel snake, not this
| one on the library sort algorithm :)
| thaumasiotes wrote:
| > lowering the upper bound to (log n) times (log log n)^3 --
| equivalent to (log n)^(1.000...1)
|
| This is true! One of the things I find cool about big-O
| complexity using polynomial reference classes is that logarithms
| give you an infinitesimal value. Take that, "infinitesimals don't
| really exist" people!
| sfelicio wrote:
| Wait. What? Do you have a reference where I can learn about
| this?
| msmitha wrote:
| I was shocked to discover how the British Library (millons of
| books, many many new ones every week) manages this. The first
| book to arrive earlier this year went on the shelf at
| 2025.0000001. The next went at 2025.0000002, right next to it.
| The electronic catalogue does the rest. No need to re-shuffle
| books around, but not a solution supportive of book-browsing.
| crazygringo wrote:
| Reminds me of how Amazon doesn't arrange its items by
| similarity like a store does, so a model of vacuum cleaner
| might be next to a set of kitchen plates. They actually
| intentionally avoid similarities so pickers won't accidentally
| grab a similar but wrong thing.
|
| I lose track of so many infrequently used objects in my home --
| which storage bin in which closet did I put the x-acto blade
| refills in? -- and one bin will overflow while another is half-
| empty because I try to keep similar items together. Sometimes I
| fantasize about tracking every possession in a spreadsheet that
| points to which bin, so I'd never lose anything and could
| always use storage space with maximum efficiency. But then I
| know I'd be lazy and skip updating the spreadsheet when I put
| something new away... plus it just seems so inhumanly weird,
| like something a robot would so rather than a person.
| ryao wrote:
| Is there any reason to think that this algorithm will actually be
| faster than what is currently done in practice?
|
| For arrays in B-tree nodes, which is the main place where I have
| encountered this problem, I doubt it will be faster than just
| using `memmove()`, and for truly large arrays, it would be easier
| to use a B tree.
|
| If that is the case, this joins a number of algorithms that are
| asymptotically faster than algorithms in use and paradoxically
| slower than them. Examples of such algorithms include all of the
| fast matrix multiplication algorithms, which are slower than good
| implementations of the the textbook O(n^3) algorithm (GEMM).
| melvyn2 wrote:
| These are sometimes called Galactic Algorithms:
| https://en.wikipedia.org/wiki/Galactic_algorithm
|
| The very first example on the page has a pretty good quote
| describing their usefulness:
|
| >>> An example of a galactic algorithm is the fastest known way
| to multiply two numbers, which is based on a 1729-dimensional
| Fourier transform. It needs O(n log n) bit operations, but as
| the constants hidden by the big O notation are large, it is
| never used in practice. However, it also shows why galactic
| algorithms may still be useful. The authors state: "we are
| hopeful that with further refinements, the algorithm might
| become practical for numbers with merely billions or trillions
| of digits."
| dchftcs wrote:
| One interesting thing about working on numbers so large is
| that it becomes harder to assume that memory access is
| constant time. It's already not that easy now (cache miss vs
| hit, and external memory), but when you operate on data sizes
| that large, memory accesses matter by at least the square or
| cuberoot of N. If you require largely sequential access it is
| possible to outperform an algorithm that requires conditional
| random access by some function of those factors, which can
| make some algorithms look much worse or better compared to
| the classical sense. Of course, it still does not matter
| whether your algorithm is going to finish in a couple of
| trillion years or a trillion trillion years.
___________________________________________________________________
(page generated 2025-01-24 23:00 UTC)