[HN Gopher] Beating TimSort at Merging
___________________________________________________________________
Beating TimSort at Merging
Author : andrewstetsenko
Score : 187 points
Date : 2021-07-13 16:46 UTC (6 hours ago)
(HTM) web link (earthly.dev)
(TXT) w3m dump (earthly.dev)
| danbruc wrote:
| The first Python implementation is bad - removing the first
| element in each iteration is O(n), the C implementation gets this
| right by maintaining one index into each list instead of
| modifying the lists.
| XelNika wrote:
| Yes, it says so in footnote 1 in the article. It's also not the
| benchmarked code so it doesn't really matter.
| danbruc wrote:
| I missed the footnote. My point was more that one could have
| compared this with the C implementation as heapq.merge is a
| different algorithm and so there is no actual comparison of
| the same algorithm in C and Python.
| min-cut wrote:
| Not only that, but also list(a + b) is producing two different
| new lists since (a + b) produces a list and list() constructs a
| new list copy. The benchmark would be faster if the OP just did
|
| def sort_test(): m2 = a + b; m2.sort()
|
| instead of
|
| def sort_test(): m2 = list(a + b); m2.sort()
|
| EDIT: It seems like OP fixed this issue in perf.py, but left it
| in test.py
| agbell wrote:
| You can check it, but I'm pretty sure the difference is
| negligible. But yeah, all benchmarks are using the code in
| perf and the pop code is just to demonstrate. It is not
| benched.
|
| Edit: dropping the extra list() from the blog code examples.
| min-cut wrote:
| I see about a 10% performance improvement on my local
| machine on the input in test.py when not constructing the
| unnecessary second list.
|
| I don't really buy that it reads nicer in prose since you
| can just do (a + b).sort() if you want. Plus, I feel like
| it's important for readability to not be unnecessarily
| redundant. Having list(a + b) code also risks creating
| misconceptions about how lists can be constructed and used
| in Python.
| agbell wrote:
| Yeah, good point. I think you are right about it being
| incorrect. Updating...
| tim-peters wrote:
| To get more gonzo, in CPython's list.sort(), the C code that
| actually merges two lists (which happen, in context, to be two
| contiguous array slices) is in listobject.c's `merge_at(()`
| function. That brings in the world of "galloping" (exponential
| search) optimizations, much faster than one-pair-at-a-time
| compares in many real-world cases.
|
| So that's a whole lot of additional complication, but it's the
| heart of what _really_ makes timsort shine in its "jaw dropping"
| cases.
|
| Tim (of "timsort" - which was an inside joke name at the start,
| because I never expected anything outside of CPython would use it
| ;-) ).
| j-pb wrote:
| Is there a reason to use galloping over an array vs a seek in a
| search tree?
|
| Better real world factors?
|
| Galloping, Demaine set intersection, worst case optimal joins,
| they all seem to be different aspects of the same underlying
| principle.
|
| So I'm very curious about the peculiarities in the sorting
| case.
| borski wrote:
| Gosh, comments like these from the actual author are a reminder
| of why I love HN. :)
| agbell wrote:
| Wow. Thanks for writing it. I believe this is the part you are
| referring to:
| https://github.com/python/cpython/blob/main/Objects/listobje...
| tim-peters wrote:
| Right, `merge_at()`. But that in turn calls small mountains
| of other C code. It is, alas, a complicated approach. Folding
| it in would be substantially more work than you've already
| done.
|
| The point: for inputs like your:
|
| a = list(range(-100, 1700)) b = list(range(1400, 1800))
|
| it would first use a variant of binary search to find where
| b[0] belongs in a. "Almost all of a" is consumed by that via
| a relatively tiny number of compares. It would similarly do a
| binary-ish search to find where a[-1] belongs in b. The tail
| end of b, from 1700 through 1799, then also never needs to be
| looked at again.
|
| In the lucky case that two input lists don't overlap at all,
| and a[-1] <= b[0], that's enough so that a "one-pair-at-a-
| time" compare _never_ needs to be done. In context (the two
| lists are contiguous slices of an array), no pointers at all
| need to be copied in that case either - it deduces that the
| combined array slices are already in sorted order in total
| time O(log(len(a)) + log(len(b))), and it's done.
| agbell wrote:
| Very Cool!
|
| Edit: Ok, I get it now. Don't just pop, binary search into
| it, so you can fast forward through. I'm not sure I'll get
| around to implementing that, at least not in the immediate
| future, but that makes total sense.
| jhokanson wrote:
| As someone that rarely works with Python lists, as opposed to
| numpy arrays, I was pleasantly surprised to see numpy does what I
| would expect in providing a mergesort option. I'm surprised
| Python doesn't, other than via heapq and only implemented in
| Python (according to my reading of the post and a very quick
| Google search).
|
| Oops, just for fun the numpy documentation currently states: "The
| datatype determines which of 'mergesort' or 'timsort' is actually
| used, even if 'mergesort' is specified. User selection at a finer
| scale is not currently available." Awesome ...
|
| Also, apparently mergesort may also be done using radix sort for
| integers "'mergesort' and 'stable' are mapped to radix sort for
| integer data types."
| kzrdude wrote:
| mergesort mandates that it's a stable sort, so I guess they get
| away with replacing "like for like" that way. For varying
| definitions of like.
| masklinn wrote:
| Why would Python provide a mergesort option when timsort was
| the replacement for a standard mergesort?
|
| `heapq.merge` is not a mergesort, it's a merge of sorted
| sequences (aka just the merge part of a mergesort).
| TheRealPomax wrote:
| Turns out that at least in terms of performance, everyone using
| numpy is fine with this. It just needs to run fast enough, not
| as fast as possible ;)
| tehjoker wrote:
| This is not true. I have written many libraries to improve on
| the performance of numpy for image processing applications.
| Sort backs many operations, such as unique, that result in
| painfully slow code.
| CogitoCogito wrote:
| I too have written a lot of extension code to speed up
| numpy code. It's often not even especially difficult since
| any code at that level tends to be very algorithmic and
| looks essentially the same if written in numpy or C++.
| Having control of the memory layout and algorithms can be a
| huge win.
|
| Of course I don't _usually_ do this, but that's just be
| most of the code I write doesn't need it. But it's not at
| all some crazy sort of thing to do.
| toxik wrote:
| Tangential but np.bincount is typically the fast version
| of np.unique. Not entirely the same thing, but it's worth
| knowing about it.
| dang wrote:
| Past TimSort threads, for anyone interested:
|
| _Timsort, the Python sorting algorithm_ -
| https://news.ycombinator.com/item?id=21196555 - Oct 2019 (131
| comments)
|
| _On the Worst-Case Complexity of TimSort_ -
| https://news.ycombinator.com/item?id=17883461 - Aug 2018 (74
| comments)
|
| _Timsort is a sorting algorithm that is efficient for real-world
| data_ - https://news.ycombinator.com/item?id=17436591 - July 2018
| (77 comments)
|
| _Functional verification with mechanical proofs of TimSort
| [pdf]_ - https://news.ycombinator.com/item?id=9778243 - June 2015
| (1 comment)
|
| _Timsort_ - https://news.ycombinator.com/item?id=3214527 - Nov
| 2011 (27 comments)
|
| _Java has switched from Mergesort to TimSort_ -
| https://news.ycombinator.com/item?id=752677 - Aug 2009 (13
| comments)
| rkalla wrote:
| Excellent read.
| thomas536 wrote:
| Are there similar relatively simple methods or simd tricks for
| multi-way merges?
| dragontamer wrote:
| I dunno about "multiway merge". But your standard 2-way merge
| is SIMD-optimized by having a binary-search across the
| "diagonals" of the so-called mergepath.
|
| Leading to O(lg(p * sqrt(2))) time for the longest comparison
| diagonal, where "p" is the number of processors. Including
| thread divergence, that's O(N/p * lg(p * sqrt(2))) total work
| done, or assuming constant-p, O(N) of work per 2-way SIMD merge
| (with a large "p" providing an arbitrarily large speedup: but
| in practice is probably limited to 1024 for modern GPU
| architectures. Since modern GPUs only really "gang up" into
| 1024-CUDA thread blocks / workgroups efficiently)
|
| https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf
|
| This provides a natural way at allowing ~1024 CUDA threads in a
| singular block to sort a list at O(n * log(n)) efficiently.
| agbell wrote:
| TimSort is cool. In it's worst case it looks no better than merge
| sort, but give it some partially order data, or small amounts of
| data, where it uses insertion sort, and it really shines.
| lyxell wrote:
| Interesting read. This post seems to be written by Adam Gordon
| Bell who is also the host of the excellect CoRecursive podcast.
| Recently featured at HN in The Untold Story of SQLite [0].
|
| [0]: https://news.ycombinator.com/item?id=27718701
| agbell wrote:
| That's me. Thanks for reading it and listening to the podcast.
| The SQLite episode is now nearly the most listened to, thanks
| to hacker news.
|
| On Merging lists: I was caught off guard by recommendation to
| use sort to merge sorted lists. Everyone I mentioned it to was
| surprised as well so I thought it was worthy of some
| investigation. This is my first C extension in Python and I got
| help so someone else might be able to do even better than this.
|
| It is interesting when our normal short hands for thinking
| about runtime complexity break down.
| CogitoCogito wrote:
| > It is interesting when our normal short hands for thinking
| about runtime complexity break down.
|
| Actually I think more interesting would be to write a python
| version following the algorithm of your C extension (i.e.
| only allow two lists, don't allow generators, don't return
| generators, etc). You could probably do that quite quickly
| and generate the same graphs. I would expect that python
| version to lie somewhere between your C extension and the
| heapq.merge() version which is more general.
|
| edit: If you were to do this, I wouldn't recommend using the
| python version in your blog since it pops the lists from the
| front. This seems to be O(n) in general:
|
| https://wiki.python.org/moin/TimeComplexity
|
| I did a 10 minute glance at the python source code and
| couldn't totally verify this (it needs more than 10
| minutes...), but it makes sense for popping from anywhere but
| the back to cause a list copy to occur. Maybe this isn't
| technically necessary when popping from the front, but I
| couldn't verify it. Either way it's easy to avoid by just not
| mutating the input lists anyway so I don't see a good reason
| not to go that way.
| agbell wrote:
| So we could determine what gains I am getting are due to C
| and the specific compares and which are due just being 2
| lists? That would be an interesting test.
|
| Maybe I'll do a follow-up. Thanks for reading the article.
| I suspect you know way more about C extensions in Python
| than I do. I saw you have a talk on this topic.
|
| I think you are totally right that pop is not the way to
| go, but using an offset to track the head, like the C
| example does. I actually put that in a footnote, because I
| was considering testing my python version but the SO answer
| mentioned pop being expensive.
|
| I think the simple version is still great as psuedo-code
| for communicating a solution though and hopefully it makes
| the c code a bit easier to follow once you've seen the
| python version.
| CogitoCogito wrote:
| Well the version I'm saying is essentially the same as
| the pop version you have so I don't think it's too
| complicated. You would just do something like this
| instead: def merge_sorted_lists(l1,
| l2): sorted_list = [] i = 0
| j = 0 while i < len(l1) or j < len(l2):
| if i == len(l1):
| sorted_list.append(l2[j]) j += 1
| continue if j == len(l2):
| sorted_list.append(l1[i]) i += 1
| continue if l1[i] < l2[j]:
| sorted_list.append(l1[i]) i += 1
| else: sorted_list.append(l2[j])
| j += 1 return sorted_list
|
| I haven't tested that (you probably should before using
| it), but it looks mostly right and is essentially the
| same algorithm as yours.
| forrestthewoods wrote:
| Great post. I almost did a spit take when I read:
|
| > Timsort overcomes this disadvantage by being written in C
| rather than Python.
|
| Yeah, Python is SLOW. Thankfully the author dug into C. That was
| nice to see.
|
| Edit: lol I have no idea why this is being downvoted. Y'all
| weird.
| ww520 wrote:
| That's what I suspected when reading the first part of the post
| where it states that heapq.merge() is slower than sort(). Hmm,
| isn't it comparing Apple to Orange since heapq.merge() is
| implement in Python and the system sort() would be in C?
| CGamesPlay wrote:
| I definitely used this to my advantage in ACM ICPC contests. Most
| of the students and professors wrote the problems and benchmarked
| against Java. However, on several problems, it boiled down to
| "write this specific algorithm for this class of problem, or use
| C." I once netted my team an extra problem after we timed out in
| Java and I just rewrote the teammate's solution in C++.
| peter422 wrote:
| Outside of some very specific parsing problems where Java
| standard library was very helpful, writing solutions in C was
| the way to go.
|
| Though while it could allow certain parts of the code to be
| less efficient, if you had completely the wrong algorithm it
| wasn't going to save you.
| CogitoCogito wrote:
| It seems like the built in heapq.merge() should be as fast as the
| implementation here. I mean if it's really just merging sorted
| lists, why wouldn't it be as fast as this implementation?
|
| edit: Okay one reason is that heapq.merge is written in python:
|
| https://github.com/python/cpython/blob/6252670732c68420c2a8b...
|
| It might also be since the python version has more options (like
| arbitrarily many iterables), but I presume it's the fact that the
| extension is written in C.
| typon wrote:
| Because it's written in python
| CogitoCogito wrote:
| Re-writing it in C wouldn't automatically make much of a
| difference. Of course a C implementation wouldn't be slower
| (at minimum you could flatten out the byte-code sequence and
| use the generated C api calls), but it wouldn't necessarily
| be noticeably faster. Given the features of the python
| version, it might very well be the case that writing the same
| code in C wouldn't speed things up if the same use-cases were
| supported. Though I would be curious to know if it actually
| would speed up significantly.
| nightpool wrote:
| TFA explains this, although they never actually links the
| StackOverflow answer that they quote from.
|
| > CPython's list.sort is implemented in C (avoiding interpreter
| overhead), while heapq.merge is mostly implemented in Python,
| and optimizes for the "many iterables" case in a way that slows
| the "two iterables" case.
|
| Going to the original StackOverflow answer, we can see the same
| user (ShadowRanger) expanding on this in a comment:
| https://stackoverflow.com/questions/464342/combining-two-sor...
|
| > The selling point for heapq.merge is that it doesn't require
| either the inputs or the outputs to be list; it can consume
| iterators/generators and produces a generator, so huge
| inputs/outputs (not stored in RAM at once) can be combined
| without swap thrashing. It also handles merging an arbitrary
| number of input iterables with lower overhead than might be
| expected (it uses a heap to coordinate the merging, so the
| overhead scales with the log of the number of iterables, not
| linearly, but as noted, that doesn't matter for the "two
| iterable" case).
| CogitoCogito wrote:
| Yeah looking at the all the features of the heapq.merge()
| version, I kind of doubt implementing it in C would speed it
| up much at all.
|
| edit: Yeah I now see where the author mentioned the design
| philosophy of the python version. It was a bit burried, but
| it answers my question like you say. I personally find that
| to be the most interesting part.
| [deleted]
| kzrdude wrote:
| Could Cython be used for writing the fast module here instead?
| benatkin wrote:
| Doesn't explain why Earthly.dev is under the BSL.
| quibono wrote:
| I don't know why you would expect a post with a title like this
| to touch on that, even though being wary of the BSL licence is
| fair.
___________________________________________________________________
(page generated 2021-07-13 23:00 UTC)