[HN Gopher] Changing std:sort at Google's scale and beyond
___________________________________________________________________
Changing std:sort at Google's scale and beyond
Author : ashvardanian
Score : 352 points
Date : 2022-04-20 15:57 UTC (7 hours ago)
(HTM) web link (danlark.org)
(TXT) w3m dump (danlark.org)
| jeffbee wrote:
| Huh. If I found that Reflection::ListFieldsMayFailOnStripped was
| a frequent caller of std::sort, to an extent that it was actually
| costing measurable CPU time, first thing I would do is find and
| fix all callers of _that_.
| gipp wrote:
| "Fix" how exactly? I'm not quite sure what you're suggesting.
| jeffbee wrote:
| C++ proto reflection is a convenience. If it actually shows
| up at scale in production, that smells like a design error.
| moultano wrote:
| All things show up at scale in production at Google.
| jeffbee wrote:
| In one sense yes, but in another sense even though Google
| is large their threshold for something being worth fixing
| based on efficiency justifications is correspondingly
| large.
| foota wrote:
| Frequently though, you're dealing with a death by a
| thousand paper cuts. Perhaps that call comes from some
| common library whose implementation requires reflection.
| moultano wrote:
| It isn't though. A typical developer will never care
| about 0.1% of unnecessary CPU usage in their application,
| but at Google, depending on what component it's in, that
| can easily be a million dollar find.
| usefulcat wrote:
| If you're measuring in terms of time, electricity, or
| hardware, then the threshold would tend to be lower for
| Google, not higher. A 0.1% savings on a million machines
| is ~1 million times more valuable than a 0.1% savings on
| a single machine.
|
| OTOH if the measurement is risk of breaking something,
| then it's not nearly so clear, since larger organizations
| tend to have more software and hardware, and therefore
| more opportunities for things to get broken.
| jeffbee wrote:
| You're saying the opposite of what the other guy said,
| though. They said that at scale small things are big. I'm
| saying that at scale .001% is still .001%.
| jcelerier wrote:
| > In one sense yes, but in another sense even though
| Google is large their threshold for something being worth
| fixing based on efficiency justifications is
| correspondingly large.
|
| not everything is a linear relationship
| pierrebai wrote:
| In the first example of problematic calls to nth_element, there
| is another bug: if the container contains nothing, it will call
| nth_element with begin(), begin() - 1, ...
|
| Also, typo: "compare functions much comply" -> "compare functions
| must comply"
| timmytokyo wrote:
| "It took us around 20 years to find something really appropriate,
| thanks to Andrei Alexandrescu."
|
| The way this is written, it sounds as if Andrei Alexandrescu is
| responsible for the 20 year delay.
| lauv0x wrote:
| samueloph wrote:
| I'm starting to question my understanding of algorithmic
| complexity, the author says things like "complexity being worst
| case O(n \log n)" and seems to talk about big O notation used for
| average cases as well.
|
| I've learned that big O notation is always and exclusively about
| the worst case, for average and best cases we use Theta Th and
| Omega O, respectively.
|
| Do people not follow these definitions strictly? Has the author
| committed a mistake or am I missing something?
|
| I did experience an interviewer once saying that my complexity
| analysis was wrong as the worst case would be extremely rare to
| hit, I did try to explain nicely that big O notation is always
| about the worst case, and that they were thinking about big Theta
| instead, now I wonder if I was wrong.
| waynecochran wrote:
| You are mixing two things together. The "average/expected case"
| and the "worst case" are two different functions, each with
| their own O, Th, O.
| Corence wrote:
| O, Theta, and Omega are independent from best, worst, and
| average case. You can mathematically establish upper, lower, or
| exact bounds on any of the average, best, or worst cases.
|
| It is perfectly valid to say "the best case is O(n)" which
| means the best case scales no worse than linearly. The "no
| worse" here is not describing the best case, but rather the
| strictness of your bound. You could say a sort algorithm is
| O(n!) since, yes, it does scale no worse than n! but it's not
| particularly helpful information.
|
| Big O notation is used imprecisely frequently (see also people
| saying a dataset is O(millions) to describe scale).
| SQueeeeeL wrote:
| You should look into amortized complexity. That's the
| formalized name for worst case runtime over multiple trials.
| samhw wrote:
| The other replies are great, but just to put it concisely: big
| O is 'worst case' in terms of input _size_ , while the 'best
| case big O' is 'best case' in terms of the input _contents_
| ('shape').
| bsedlm wrote:
| In my opinion, nothing about big-O and related omega, theta,
| etc, is at all strict.
|
| Maybe I value "pure symbolic clairty" too much. I do see a very
| _pragmatic_ usefullness to the complexity notation.
| tempnow987 wrote:
| I always used big O for worst case (by default). I did think
| that was common.
|
| Theta and Omega I use differently however.
|
| I sometimes say, best case if X then it's O(n) or whatever.
| larsrc wrote:
| There are variations, that's what we learned back at Uni.
| Expected O(f(n)), average O(f(n)), amortised O(f(n)), even
| expected amortised O(f(n)). "Expected" implied a randomised
| algorithm.
| vmsp wrote:
| In school, we used big O notation for all cases. I've never
| heard about theta or omega notation.
|
| Wikipedia's bubble sort page -- as an example -- also uses it
| for all cases.
|
| https://en.wikipedia.org/wiki/Bubble_sort
| rockinghigh wrote:
| You can see the list of notations here: https://en.wikipedia.
| org/wiki/Big_O_notation#Family_of_Bachm...
| dwohnitmok wrote:
| Big O, Theta, and Omega notations all ultimately denote sets of
| functions. So e.g. O(n) is the set of all functions which are
| eventually overtaken by some linear function.
|
| When we say "X is O(f(n))" what we really mean is "the function
| that characterizes X is contained within the set of functions
| denoted by O(f(n))." Anything that can be described by a
| function hence can be given Big O, Theta, or Omega notation. So
| yes worst case time complexity could be described in this
| notation, but so could average time complexity.
|
| But we could also describe things that have nothing to do with
| time or space complexity with big O notation as well. As long
| as we've turned it into a function, we can describe it with big
| O notation.
| londons_explore wrote:
| Computer scientists stole some notation from mathematicians,
| but then slightly changed and substantially slackened the
| definition...
| tsimionescu wrote:
| No, not really. CS and other branches of maths are using the
| exact same definitions of O,o,Th, and o. It is true that O is
| used with a different definition in CS, but it's _stricter_
| not more slack.
|
| As explained elsewhere though, they are not notations for
| algorithmic complexity, they are notations for function
| categorization. It just happens that a common use for this
| notation in CS is to categorize the function
| "best/average/worse/... case complexity of an algorithm", but
| they are also used for other purposes.
| ufo wrote:
| In my experience, a lot of the time when programmers say
| big O, they actually mean big Theta...
| tsimionescu wrote:
| Yes, I agree with that. However, if f(n)=Theta(n) =>
| f(n)=O(n), so they are usually not wrong, technically,
| just unnecessarily broad.
| adrianN wrote:
| Programmers need not be very good computer scientists.
| The actual scientists publishing papers on algorithms
| usually get it right.
| throw149102 wrote:
| I like to use the metaphor of driving somewhere, and trying to
| estimate when you're going to get there. Maybe you have a hotel
| reservation, and you want to make sure you don't get there too
| early nor too late. Then (with some times filled in for
| example): +------------------------------------
| ---+-------------------------------+---------------------------
| -----+----------------------------------------+ |
| . | Best Case (little/no traffic) | Average
| Case (average traffic) | Worst Case (car accident, snow-in etc)
| | +---------------------------------------+--------------
| -----------------+--------------------------------+------------
| ----------------------------+ | Max time, upper bound
| (Big O) | 30 minutes | 45 minutes
| | 1 hour 20 minutes | | Both min and
| max, tight bound (Theta) | 25 minutes | 35
| minutes | 1 hour
| | | Minimum time, lower bound (Omega) | 15 minutes
| | 25 minutes | 50 minutes
| | +---------------------------------------+--------------
| -----------------+--------------------------------+------------
| ----------------------------+
|
| So then you can start asking questions like - if I assume I'm
| in the average case, and I can show up earliest at 20 minutes
| from now, and show up latest at 50 minutes from now, can I
| guarantee that I will make it within that window? The answer in
| this case is yes, because our range is 25-45 minutes, so the
| earliest I can show up is in 25 minutes and the latest in 45
| minutes.
|
| This also helps explain why Big O is generally more useful - we
| can almost always be early to something, but we can almost
| never be late to something. So showing up 5 minutes early
| usually is not a problem.
|
| It also shows why Theta is better than both - in this case, if
| we had this table we could completely ignore both the Big O row
| and Omega row, because Theta is a better bound for both min and
| max time.
|
| Of course, in algorithmic complexity we replace the times with
| functions in terms of some variables, usually just _n_.
| tsimionescu wrote:
| Your example is nice, but extremely informal, as these are
| all constant numbers, while O/Theta/Omega (and o, omega,
| theta) are about functions.
|
| I think the most important thing that needs to be clarified
| though is that algorithmic complexity is first and foremost a
| function, `WorseCaseComplexity(size of input) = max number of
| steps for any input of that size`. This function is often
| then classed into a "complexity class" using O/Theta/Omega
| etc, but that is only done for the sake of easier comparison
| of complexities between algorithms.
| powersnail wrote:
| I think you are mistaken here. The concept of
| worst/best/amortized case is different from O/Theta/Omega. The
| latter is about the direction with which it is bounding.
|
| Big O/Theta/Omega denotes sets of functions, and each of them
| can be applied to best/worst/amortized complexity analysis.
| tsimionescu wrote:
| Worst case, median case, and best case complexity have nothing
| to do with O/Th/O.
|
| When analyzing an algorithm, you want to compute the number of
| steps it has to execute in certain cases as a function of its
| input size(s). That function will be something like 2n^2 + 67n
| + 189 steps. Since this is too much information, you can then
| express it instead as O(n^2) (or O(n^100)) or O(n^2) (or O(25n)
| ) or Th(n^2).
|
| You can then also compute the best case running time of the
| algorithm as a function of the size of the input, and get 2n^2
| + 67n. Then you can say that the best case running time is also
| O(n^2).
|
| Finally, you can (sometimes) define an "average" running time.
|
| Now, what do we mean by "shortest running time"? Is that not
| always for n=0 or something? No - the best case running time
| has to do with the running time in a case where the input has a
| certain shape. For some sorting algorithms for example, if the
| input is already sorted, they will only go through the list
| once and do nothing, so they have a best case running time that
| is O(n). Other sorting algorithms will still perform steps even
| if the input is already sorted, so even their best case running
| time may be O(n log(n)).
|
| Now, you can say that, given any possible input, the running
| time of the algorithm is O(worse case running time) and O(best
| case running time). You definitely CAN'T say in general that it
| is Th(average case running time) though, as in will sometimes
| run in time that is shorter than "average case running time"
| (for best case scenarios); or sometimes it will run in time
| that is longer than "average case running time" (for worse case
| scenarios).
|
| Basically, the worse/best/average case running times are an
| analysis of how properties other than the size of the input
| affect its running time.
|
| Big-O notation and its friends have nothing to do with
| complexity analysis. They are a general notation for the
| asymptotic behavior of mathematical functions. Big-O is usually
| understood as an upper bound on a function, big-O is a lower
| bound, and Th is a tight bound (the function f is by definition
| always within a factor of k above or below Th(f)).
|
| Edit: let's take an example. Say we have the following
| algorithm: isAllOdd(array x): if x is
| empty: return true if x[0] is even:
| return false return isAllOdd(x[1:])
|
| In the best case, for any size array with the first element
| being even, this function finishes after executing 4
| instructions: 2 comparisons, an array access and then a return.
| So it's best case complexity is best(n) = 4, which is O(1),
| Th(1), O(1).
|
| In the worse case, all elements of the array are odd, so we
| have to go through all n elements to find that out. It will
| have to execute 6 instructions for each element in the array (2
| comparisons, 2 array accesses, a call and a return), except for
| the last element where it will execute only 2 instructions
| (check size then return). So its worst case complexity function
| is worst(n) = 6(n-1) + 2, which is O(n), Th(n), O(n).
|
| In the average case, for every call we are flipping a coin, and
| we have n total flips. Since we can assume the coin is fair, we
| will expect to see an even number after log(n/2) calls of
| isAllOdd. So, our average case complexity is average(n) =
| 6*log((n-1)/2) + 2 steps, which is O(log n), Th(log n), O(log
| n). (I am ignoring cases where n < 2, which is anyway
| irrelevant for asymptotic behavior).
|
| Now, what is the overall complexity of isAllOdd? We can
| generalize the average case calculation if we want to get an
| exact value; but whatever that function, overall(n), is, we
| know for sure best(n) < overall(n) < worst(n); so overall(n) is
| O(n) and O(1). However, we can't define any Th for it - there
| is no function which it will differ from by at most a constant
| factor for any n.
|
| Why do the detailed analysis instead of saying this is O(n) and
| being done with it? Well, perhaps we know that we are in fact
| calling this function with a large number of even numbers, so
| we would actually over-estimate its complexity. If we instead
| know we are calling it with a large number of odd numbers, we
| would also find out from the average case analysis that it
| would be better to define a reverse funtion, isAllEven, and
| compute isAllOdd = !isAllEven.
| karpierz wrote:
| Theta(X) is not "average case" behaviour, it means "both
| Omega(X) and O(X)".
|
| Ex: merge sort is Theta(n log(n)), but quick sort is not.
|
| The author is likely saying "worst case O(n log(n))" as
| redundancy to emphasize that this is worst case behaviour.
| tsimionescu wrote:
| It's not redundant. You can have "average case O(n log(n))"
| but "worse case O(n^2)", like quicksort does. Depending on
| the pivot function, that "worse case" can be more or less
| common.
| karpierz wrote:
| It is not redundant to say "average case O(n log(n))". It's
| not formally a correct usage of O(). But colloquially, it
| means that the average number of operations over all
| possible inputs on length n is X.
|
| It is redundant to say "worst case O(n^2)". By definition,
| O(X) means that the maximum of the number of operations
| over all possible inputs of length n is at most X.
| tsimionescu wrote:
| > It is not redundant to say "average case O(n log(n))".
| It's not formally a correct usage of O().
|
| You are wrong. It is perfectly formally correct to say
| "average case complexity is 2n log(100n) + 78n + 90,
| which is O(n log (n))".
|
| > By definition, O(X) means that the maximum of the
| number of operations over all possible inputs of length n
| is at most X.
|
| No, that is, in fact, a colloquial use of O(X). O(X),
| formally, is simply the set of all functions that are
| bigger than X up to some constant factor. It has nothing
| to do with number of operations in itself.
|
| Instead, we formally define complexity functions for an
| algorithm, which are the minimum/average/maximum number
| of operations the algorithm will execute for any input of
| size N; call these functions Cmin(n), Cavg(n), Cmax(n).
| Then, we can say that C_(n) is in O(X(n)) (or, as a
| shorthand, C_(n) = O(X(n))) if there exists a number k
| such that C_(n) < k * X(n) for any n > n0.
|
| So, we can say that the best case complexity of an
| algorithm, Cmin(n), is O(n^2); this means that even for
| the best possible input of length n, it will execute at
| least k * n^2 steps, for some k. Or, we can say that the
| worse case complexity of some algorithm, Cmax(n), is
| Omega(n^2); that is, that even in the worse case input of
| length n, it will execute less than k * n^2 steps, for
| some k.
|
| In fact, the concept "number of steps that an algorithm
| will execute for any input of size n" is not a
| mathematical function at all (unlike min/max/avg number
| of steps), since there isn't, in general, a unique such
| number for any given size.
|
| So, I would argue that saying "complexity O(n^2)" is in
| fact ill-defined formally, and that, if you want to be
| formal, you _must_ specify "worst case complexity
| O(n^2)".
| stonemetal12 wrote:
| Would argue it isn't perfectly formally correct, because
| "average case complexity" isn't well defined. Is average
| here mean, median or mode? It is typically used as if it
| meant, 90th percentile, or all but a few degenerate
| cases.
|
| But that is a bit nit picky.
| Kranar wrote:
| You are incorrect about this. Worst case and Big O are
| independent of one another. The O, O, Th, etc... are
| about describing bounds where O(f) is used for the set of
| asymptotic upper bounds on f, O(f) is the set of
| asymptotic lower bounds on f, and Th(f) is the
| intersection of O(f) and O(f).
|
| Typically f is used to describe space or time complexity
| of an algorithm, but it can be used for any function that
| maps positive integers to the non-negative real numbers.
|
| You can make any combination of best case/worst
| case/average case with O, O, Th. In fact, you can even
| consider other cases as well such as amortized cases,
| 90th percentile cases, any case you want to investigate
| that can be defined as a function from positive integers
| to non-negative real numbers can be classified as having
| an asymptotic upper bound, lower bound, or tight bound.
| karpierz wrote:
| Yeah, okay. You are correct here.
|
| "Average-case of algorithm X" can be a function function
| (average number of operations for all possible inputs of
| length n). And asymptotic bounds can be applied to
| arbitrary functions from R -> R.
| dudus wrote:
| I remember from college there was big-Oh `O(x)` and little-Oh
| `o(x)`. Is Omega/Theta the same as little-Oh?
| gwern wrote:
| I'd like to hear more about the reinforcement learning part. The
| patch literally just says 'Reinforcement Learning'. Not very
| helpful!
| hbn wrote:
| Sorry if this seems pedantic in the grand scheme of the topic,
| but is the indentation in the code samples messed up, or is there
| some kind of logic to it? I've never seen anything like that, and
| I find it pretty difficult to read.
| bogwog wrote:
| Looks normal to me (except it seems like some snippets indent
| with 2 spaces while others indent with 4).
|
| The line spacing is weird though.
| danlark wrote:
| Sorry, it definitely was a little inconsistent as I stored all
| snippets without proper formatting and I tried at the same time
| align texts for phones, etc
|
| Failed at everything. Will improve
| evmar wrote:
| The "use ASLR for random seed" trick in there is pretty cute, but
| I noticed it will just silently not work in non-ASLR settings,
| and in particular under Wasm.
| forrestthewoods wrote:
| An interesting takeaway here is that custom comparators are
| really hard to write correctly. There are a lot of constraints
| that aren't obvious.
|
| I suspect there should be a new std::verify_weak_ordering to help
| people write tests that perform n^3 tests on potential data to
| ensure a comparator is safe to use.
| jcelerier wrote:
| MSVC and GCC's standard libraries implementations both have a
| debug mode which has been checking that for almost two decades.
| DrBazza wrote:
| Quite a long interesting read. Unless I missed it, I recall that
| there's optimal solutions for sorting up to 7?? items. I can't
| find the reference, but unsurprisingly I read it a dlang forum
| post by Andrei Alexandrescu. Anyone have a link to that aspect of
| sorting?
| tialaramex wrote:
| Under the section heading "Reinforcement learning for small
| sorts", there are diagrams of the small sorts, for 3, 4 and 5
| items.
|
| They also explain that down here you're at the mercy of the
| physical hardware. Emitting one fewer x86 instruction does not
| necessarily make your program faster, still if it's not slower
| then smaller programs are good too.
| westurner wrote:
| > LLVM history: _Back then we recognized some really interesting
| benchmarks and we didn't recall anybody trying to really
| benchmark sorts on different data patterns for standard
| libraries._
|
| Timsort https://en.wikipedia.org/wiki/Timsort :
|
| > _In the worst case, Timsort takes O(n log n) comparisons to
| sort an array of n elements. In the best case, which occurs when
| the input is already sorted, it runs in linear time, meaning that
| it is an adaptive sorting algorithm. [3]_
|
| > _It is advantageous over Quicksort for sorting object
| references or pointers because these require expensive memory
| indirection to access data and perform comparisons and Quicksort
| 's cache coherence benefits are greatly reduced._ [...]
|
| > _Timsort has been Python 's standard sorting algorithm since
| version 2.3 [~2002]. It is also used to sort arrays of non-
| primitive type in Java SE 7,[4] on the Android platform,[5] in
| GNU Octave,[6] on V8,[7] Swift,[8] and Rust.[9]_
|
| Sorting algorithm > Comparison of algorithms
| https://en.wikipedia.org/wiki/Sorting_algorithm#Comparison_o...
|
| Schwartzian transform
| https://en.wikipedia.org/wiki/Schwartzian_transform :
|
| > _In Python 2.4 and above, both the sorted() function and the
| in-place list.sort() method take a key= parameter that allows the
| user to provide a "key function" (like foo in the examples
| above). In Python 3 and above, use of the key function is the
| only way to specify a custom sort order (the previously supported
| cmp= parameter that allowed the user to provide a "comparison
| function" was removed). Before Python 2.4, developers would use
| the lisp-originated decorate-sort-undecorate (DSU) idiom,[2]
| usually by wrapping the objects in a (sortkey, object) tuple_
|
| Big-O Cheatsheet https://www.bigocheatsheet.com/
|
| Quicksort in Python 7 ways (and many other languages) on
| RosettaCode:
| https://rosettacode.org/wiki/Sorting_algorithms/Quicksort#Py...
| dralley wrote:
| IIRC Rust and Swift have a modified timsort implementation to
| remove the "galloping" strategy.
| tialaramex wrote:
| Rust has "an adaptive, iterative merge sort inspired by
| timsort" to implement the stable sort() method if you have an
| allocator, and a pattern-defeating quicksort to implement
| unstable_sort() which is provided even if you don't have an
| allocator (no_std embedded environments).
| danlark wrote:
| Thanks, I'll remove the note about "recalling"
| westurner wrote:
| xeus-cling didn't exist back then:
| https://github.com/jupyter-xeus/xeus-cling#a-c-notebook
|
| https://github.com/fffaraz/awesome-cpp#debug
|
| A standard way to benchmark and chart {sorting algorithms,
| web framework benchmarks,} would be great.
|
| The TechEmpower "framework overhead" benchmarks might have at
| least average case sorting in there somewhere: https://www.te
| chempower.com/benchmarks/#section=data-r20&hw=...
| westurner wrote:
| awesome-algorithms https://github.com/tayllan/awesome-
| algorithms#github-librari...
|
| awesome-theoretical-computer-science > Machine Learning
| Theory, Physics; Grover's; and surely something is faster
| than Timsort: https://github.com/mostafatouny/awesome-
| theoretical-computer...
| samhw wrote:
| Christ, the description on that second repo is so
| _comically terribly_ spelled that I can't figure out
| whether it's a joke:
|
| > The interdicplinary of Mathematics and Computer
| Science, Distinguisehed by its emphasis on mathemtical
| technique and rigour.
|
| (Perhaps a witty satire on the reputation of mathmos for
| being godawful at anything literary...)
| chkas wrote:
| I don't understand why you would need Timsort. If the pivot
| element is chosen randomly, Quicksort is always (or with
| probability bordering on certainty) O(n log n).
|
| Update: I confused Timsort with IntroSort. I should have
| googled before posting ...
| masklinn wrote:
| > I don't understand why you would need Timsort. If the pivot
| element is chosen randomly, Quicksort is always (or with
| probability bordering on certainty) O(n log n).
|
| For starters, it's a stable sort, which excludes quicksort
| right out the gates.
|
| Second, the primary hypothesis of timsort is that real-world
| data is non-random, timsort is biased for partially or
| almost-sorted sequences, which it can sort in as low as O(n).
| That would be the point westurner was making, timsort's
| creation is very much predicated on "really benchmark sorts
| on different data patterns".
|
| Third, for very short sequences (also a common situation)
| it's an insertion sort, which is extremely efficient due to
| the low overhead (even more so combined with the adaptivity
| which insertion sort also has).
| orlp wrote:
| > For starters, it's a stable sort, which excludes
| quicksort right out the gates.
|
| Negative, it excludes in-place quicksort but out-of-place
| quicksort can be stable just fine.
|
| > Third, for very short sequences (also a common situation)
| it's an insertion sort, which is extremely efficient due to
| the low overhead (even more so combined with the adaptivity
| which insertion sort also has).
|
| Every modern quicksort implementation also does this (or a
| similarly efficient smallsort).
| superjan wrote:
| Hi, you are ignoring the second point, and for the other
| two points you are referring to particular extensions of
| quicksort. Quite possibly you can extend quicksort
| similar to how timsort extended mergesort, but if that
| was commonplace nobody would care about timsort.
| orlp wrote:
| > and for the other two points you are referring to
| particular extensions of quicksort
|
| Maybe for the third point, but once again, every
| implementation does this. For the first point, absolutely
| not. Just because in-place partitioning is often the only
| variant described, out-of-place partitioning is
| absolutely just quicksort, not an extension.
|
| > Quite possibly you can extend quicksort similar to how
| timsort extended mergesort
|
| Please see my top comment in this thread - I have done
| exactly that. Or rather, I have extended powersort (which
| is very similar to timsort) with quicksort.
| loeg wrote:
| I ported Timsort to the Windows kernel back in college. Good
| times!
| Something1234 wrote:
| I'm confused, so the comparison function can modify the list? I
| thought most sorting algorithms assumed the contents of the list
| was constant. Why would I want my comparator to ever modify the
| list under sort?
| slavboj wrote:
| An "accessed at" field is a trivial use case.
| danlark wrote:
| You don't, it's just not disallowed :)
| infogulch wrote:
| "because it prevents the user from writing a comparison
| function that modifies the list while sorting" is not a
| reason I would expect to prefer rust, and yet here we are.
| mbrubeck wrote:
| It's possible in Rust too, but only for a subset of element
| types: those that support shared mutation (also known as
| "internal mutation"). For example, a type containing an
| `AtomicU32` or a `Mutex<T>` or a `Cell<T>` could have an
| `Ord::cmp` implementation that alters its value, and the
| compiler will not prevent this.
| orlp wrote:
| Yes, I had a nasty bug in an initial version of glidesort
| where I (naively) assumed that I could create a temporary
| copy of an element by doing a memcpy and call the
| comparison function on this temporary copy a bunch of
| times before throwing it away, since I assumed the
| comparison function would not mutate the copy.
| Unfortunately interior mutability means this would lead
| to potential double drops, and it was just not sound.
| [deleted]
| jmalicki wrote:
| If you examine the modification more closely, it is just
| finding the list that would exhibit worst case performance -
| once created, that list would always exhibit the worst case as
| a constant list.
|
| Once you have the resulting list, the indices are sorted, then
| assigning each num_solid value to that index would result in
| the list being constant, and then could be sorted to give such
| worst case behavior.
|
| I wish the article was more clear on this.
| orlp wrote:
| Hi all, I'm the author of pdqsort that's credited in the post (to
| be clear: the authors acknowledged pdqsort at the bottom, I am
| _not_ associated with the posted work directly). I recently held
| a talk at my institute on efficient in-memory sorting and the
| ideas in pdqsort, in case you 're interested in hearing some of
| the theory behind it all: https://www.youtube.com/watch?v=jz-
| PBiWwNjc
|
| Next week I will hold another talk in the Dutch seminar on data
| systems design (https://dsdsd.da.cwi.nl/) on glidesort, a new
| _stable_ sorting algorithm I 've been working on. It is a
| combination of adaptive quicksort (like pdqsort, fully adaptive
| for many equal elements) and an adaptive mergesort (like timsort,
| fully adaptive for long pre-sorted runs). It is the first
| practical implementation of an algorithm I'm aware of that's
| fully adaptive for both. Like pdqsort it uses modern architecture
| aware branchless sorting, and it can use an arbitrary buffer
| size, becoming faster as you give it more memory (although if
| given a constant buffer size it will degrade to O(n (log n)^2) in
| theory, in practice for realistic workloads it's just a near-
| constant factor (c ~<= 3-5) slower).
|
| The source code isn't publish-ready yet, I have to still do some
| extra correctness vetting and testing, and in particular
| exception-safety is still not yet fully implemented. This is
| important because I wrote it in Rust where we must always give
| back a valid initialized array, even if a comparison operator
| caused a panic. But I do have some performance numbers to quote,
| that won't significantly change.
|
| For sorting 2^24 randomly shuffled distinct u32s using a buffer
| of 2^23 elements (n/2), glidesort beats Rust's stdlib slice::sort
| (which is a classic timsort also using a buffer of n/2) by a
| factor of 3.5 times. When stably sorting the same numbers
| comparing only their least significant 4 bits, it beats stdlib
| slice::sort by 12.5 times using 6.5 times fewer comparisons, both
| numbers on my Apple M1 Macbook. All of this is just using single-
| threaded code with a generic comparison function. No SIMD, no
| threads, no type-specific optimizations.
|
| Finally, glidesort with a buffer size of >= n/2 is _faster_ than
| pdqsort.
| mlochbaum wrote:
| Any chance you could comment on fluxsort[0], another fast
| quicksort? It's stable and uses a buffer about the size of the
| original array, which sounds like it puts it in a similar
| category as glidesort. Benchmarks against pdqsort at the end of
| that README; I can verify that it's faster on random data by
| 30% or so, and the stable partitioning should mean it's at
| least as adaptive (but the current implementation uses an
| initial analysis pass followed by adaptive mergesort rather
| than optimistic insertion sort to deal with nearly-sorted data,
| which IMO is fragile). There's an in-place effort called
| crumsort[1] along similar lines, but it's not stable.
|
| I've been doing a lot of work on sorting[2], in particular
| working to hybridize various approaches better. Very much
| looking forward to seeing how glidesort works.
|
| [0] https://github.com/scandum/fluxsort
|
| [1] https://github.com/scandum/crumsort
|
| [2]
| https://mlochbaum.github.io/BQN/implementation/primitive/sor...
| orlp wrote:
| I have stolen a lot of ideas from scandum and extended his
| ideas in new ways. He is definitely a mad genius. Glidesort
| (on my M1 machine at least) matches fluxsort within a couple
| % for random data, but glidesort is robust in that it will
| always take advantage of pre-sorted runs and many equal
| elements (at least if it has buffer memory), no matter where
| they are in the array.
|
| In particular, I was inspired by three things from scandum:
|
| 1. fluxsort's out-of-place stable partitioning. From this I
| got reminded that not only is out-of-place stable
| partitioning a thing, it's highly competitive. I've always
| had this as an idea in the back of my mind, but never went
| through with it because I kept getting discouraged by C++'s
| distinction of moving to uninitialized memory vs. moving into
| a moved-from value (which is why I implemented glidesort in
| Rust).
|
| 2. quadsort's "ping pong merge", which reduces unnecessary
| memcpys by merging both on the way _out_ and on the way _in_
| the original array. I did have this idea before, but always
| dismissed it because I thought keeping track of what 's where
| would be a massive pain. Simply waiting until there's 4
| things to merge eliminates this problem and is just genius.
|
| 3. quadsort's "branchless parity merge", which merges from
| both ends of the array if the merge is perfectly balanced. I
| make no claim that I thought of this, it's just genius. I had
| two key takeaways from this: you can make some very fast
| small sorting algorithms with merges, and interleaving loops
| to reduce data dependencies are significantly faster.
|
| So I combined #1 & #3 into what I call bidirectional stable
| partitioning, where I partition from _both_ sides of the
| array into an out-of-place buffer through interleaved loops.
|
| I extended the adaptiveness and applicability of #2 heavily
| by replacing the 'merge' operation in powersort
| (https://arxiv.org/abs/1805.04154) with a 'virtual merge'
| operation that delays merges until necessary. This is also
| what allows me to use quicksort in a bottom-up adaptive
| mergesort, because I don't eagerly sort small runs! Instead I
| simply keep unsorted runs around, 'merging' unsorted runs
| simply by concatenating them - purely in bookkeeping.
|
| I heavily extended 3 for the mergesort part by realizing we
| don't need perfectly balanced merges, we can just take the
| `min` of the two runs and start off with a merge from both
| sides, and then look further. I also did more interleaving by
| doing a binary search to compute independent parallel merges
| where sensible, and interleaving those loops.
|
| As a quick preview, here is a visualization of glidesort
| using a buffer size of n/2, where I have artificially limited
| the concatenation of unsorted runs to n/8 so that it won't
| just look only like quicksort, and both the quicksort and
| mergesort aspects are shown: https://cdn.discordapp.com/attac
| hments/273539705595756544/96...
| mlochbaum wrote:
| Thanks for the discussion! Can't say I follow everything,
| but using parity merge for part of an unbalanced merge
| makes a lot of sense and that alone is worth it.
|
| Stepped through the video a few times at 1/4 speed. The n/8
| thing is a bit confusing, first because I didn't read it
| and second because it makes it hard to tell a partition
| result from the beginning of the next segment. I think I
| can follow what's going on, but I don't get the purpose of
| the bidirectional partition. It doesn't use less memory,
| does it? So is there something to do with fitting in with
| mergesort better? I'm not familiar with powersort; I'll
| read up on it.
| orlp wrote:
| > but I don't get the purpose of the bidirectional
| partition. It doesn't use less memory, does it? So is
| there something to do with fitting in with mergesort
| better
|
| Nope, it does the same amount of comparisons, same number
| of operations, same memory, etc. What it does do is it
| allows you to interleave two independent loops, which is
| also what makes the parity merge fast (I think scandum
| misidentifies the loop unrolling for this effect for
| large arrays - you can loop unroll merging large arrays
| either way - for small constant-size merging it is
| important however).
|
| A modern CPU has a very long pipeline, and even though we
| like to pretend all instructions are one cycle with no
| latency, in reality there are real latencies to
| instructions and memory accesses, and multiple
| instructions can be executed in the same cycle. Since
| each iteration of the partition depends on the previous
| one (which pointer did we increment? we have to know
| before we can execute the next store), you can hide these
| latencies better if you interleave two independent loops.
| In addition you can use more instruction-level
| parallelism.
| mlochbaum wrote:
| Got it. I'd considered that but something about your
| explanation threw me off. A subtlety I missed was that
| you can't just do two forward partitions, because the
| second one begins exactly halfway through the array--one
| of the results would be placed there but it's probably
| not the correct start location for the larger half-
| partition.
| orlp wrote:
| Exactly, which is why I call it bidirectional
| partitioning: one forward, one backward. It's a very
| strange case where you can use parallelism (in this case
| instruction-level parallelism), but only get two
| independent instances without the ability to recurse
| further.
|
| You can of course make partitioning embarrassingly
| parallel, look at IPS4o for that. But it is vastly more
| complicated, and involves overhead shuffling blocks after
| the partition.
| hengri wrote:
| Random question, but is this from the Google bare metal team? I
| remember interviewing with them and the work they did was
| fascinating.
| orlp wrote:
| I'm not sure if you're asking me personally or about the
| original post, but to be clear: I'm not (currently)
| associated with Google, nor is my work on pdqsort or
| glidesort. I'm also not associated with the original post,
| other than that my work on pdqsort was credited at the bottom
| of the blog post, which is what my opening sentence was
| referring to.
|
| I edited my comment to reflect this, I can see how it
| could've been misinterpreted. It was not my intention to
| deceive.
| danlark wrote:
| Hi, the author is here
|
| I am working in MapReduce/Data-pipelining/Dataproc efficiency
| internally and externally. In cloud you can check out Google
| Cloud Dataflow
|
| We are working closely with general efficiency and bare metal
| teams, yes :). They definitely do a fascinating work. This
| one was my 20% project together with the google efficiency
| team
| DontGiveTwoFlux wrote:
| OP describes systems which seed tests with non-determinism to
| catch users who write tests relying on undefined behavior, such
| as order of equal elements. Writing tests against these systems
| is a game changer (once you figure out what's going on and why
| there are flakes). I'm a huge fan of this subtle change which has
| spared me at least one bug. It also allows library authors to
| gain the advantage of not accidentally producing a behavior that
| induces Hyrum's law. Kudos to the team for making this work at
| scale.
| hu3 wrote:
| offtopic: Such a great engineer and yet their blog says: Website
| Powered by WordPress.com.
|
| I find the pragmatism worthy of praise. I've been meaning to
| start blogging since forever but I keep planning and researching
| about blog platforms/tools and never get to the point.
| papito wrote:
| WP is a never-ending source of security vulnerabilities. 30% of
| the internet runs on it.
|
| I found Jekyll and never looked back.
| nvarsj wrote:
| That's why you use managed Wordpress, aka wordpress.com.
| sydthrowaway wrote:
| hatware wrote:
| Did one person design the SR-71...?
| sydthrowaway wrote:
| yes
| samhw wrote:
| In case you genuinely missed the point: he was saying that
| the person is a great engineer for _not_ wasting their time
| optimising irrelevant details like their blog. Time
| allocation is an optimisation problem like any other.
| [deleted]
| haolez wrote:
| Making use of existing tools and being able to manage their
| shortcomings is an important engineering skill.
| gzer0 wrote:
| The paradox of choice. I suffer from this way too often.
|
| https://en.wikipedia.org/wiki/The_Paradox_of_Choice
| a_e_k wrote:
| > _If you compare by a boolean variable, for example, partition
| data by existence of something, it's very tempting to write
| std::sort call._
|
| > _We saw a pattern of sort+resize a lot._
|
| These seem like good opportunities for clang-tidy checks.
| techwiz137 wrote:
| Do you need to have learned Computer Science to grasp the topic
| at hand here?
| [deleted]
| dekhn wrote:
| Great fun read. I love this stuff; when I first learned C++ many
| years ago I didn't really know much about complexity and what I
| have learned over the years is that complexity costs depend a lot
| on the underlying data distributions.
| seanp2k2 wrote:
| Thanks, I'm going to write all this out on a whiteboard the next
| time someone asks me to sort a linked list in a technical
| interview.
|
| Edit: just wanted to clarify that the point I'm trying to make
| here is that if you want to ask someone if they've memorized the
| answer to that question, go ahead and ask it in a technical
| interview, because the _real_ well-considered answer looks like
| the linked article, and my honest non-smartass answer is that one
| should use the library function and not try to implement it
| themselves until they 've done sufficient research into the
| problem to prove to a reasonable degree that they indeed do need
| to re-implement sort() and that it's in the best interests of the
| business to fund such a project.
|
| Would love to read this over the phone to a recruiter and have
| them tell me that it's not what they have on their sheet.
| 10000truths wrote:
| I never really understood why a standard library would insist on
| having a single "sort" function (or at best, a "sort" and "stable
| sort"). Why not just provide explicit quick sort, merge sort etc.
| functions and let users choose which they want? Sorting is one of
| the first things a budding computer scientist learns, and the
| research on sorting algorithms is well-trodden ground by now, so
| I don't see how it could be a usability issue. I'd like to be
| able to choose from these criteria without having to rely on
| external libraries:
|
| - How does it perform on average?
|
| - How does it perform on pathological inputs?
|
| - Is it stable?
|
| - Does it allocate?
|
| Not to mention constraints that an application can take advantage
| of:
|
| - What if the data is already mostly sorted? (insertion sort)
|
| - What if writes are very expensive? (cycle sort)
|
| - What if all the data are integers? (radix sort)
|
| And so on. Trying to force a one-size-fits-all solution seems
| rather silly at this point.
| Hello71 wrote:
| while i agree with you to some extent, i don't think including
| a very large set of sorting functions in the standard library
| is a good idea. especially for dynamic languages, a larger
| standard library means more disk usage and worse cache
| utilization. it's also typically relatively easy to use your
| own sort function: in performance-critical cases, the sort
| function is usually called directly, not via intermediate
| libraries (which might need to be patched to call the new sort
| function); also, sort functions are self-contained, i.e.
| typically don't require system dependencies.
| samhw wrote:
| > A larger standard library means more disk usage and worse
| cache utilization
|
| Does it? How? I suppose one has to store the standard library
| on the machine, which does notionally use up disk space, but
| I can't see how it contributes by any reasonable performance-
| relevant meaning of 'disk usage' (presumably i.e. disk I/O
| during the runtime of the program). And for cache utilization
| I simply have no clue at all what you might be driving at.
| dijit wrote:
| There are such things as sensible defaults though.
|
| Timsort being a somewhat famous example, which uses the fact
| that python datastructures are known length to figure out the
| best sorting method for the size.
|
| If you need something more exotic then you likely know more
| about what you're doing, but most people writing software who
| need a sort function usually don't need more than timsort.
| morelisp wrote:
| > uses the fact that python datastructures are known length
| to figure out the best sorting method for the size.
|
| That... does not match with my knowledge of how timsort
| works. The sorting method is chosen based on the array
| length, and every sorting algorithm in every language knows
| that. It's extremely suboptimal for some types because
| Python's typing is so weak!
|
| Timsort succeeds because it has many good heuristics to
| optimize common properties of its merge intervals.
| wongarsu wrote:
| I think only having "sort" and "stable sort" makes sense for
| the same reason there's only one hashmap or one vector in most
| standard libraries: 99% of the time you just want a high-
| quality implementation of one reasonable choice, without
| getting into the weeds of it.
|
| Most of the time I don't care if it's timsort or quicksort, or
| if the hashmap uses Cuckoo hashing or buckets made of linked
| lists. Of course there are the 1% of cases where I notice that
| my sort or hashmap is a performance bottleneck and that a
| different algorithm might perform better. But standard
| libraries aren't good at serving that need because they have to
| be broadly applicable and conservative, if I already know that
| I want top performance for specific requirements I'm almost
| certainly better served by an external library.
| sam0x17 wrote:
| I think this goes back to the whole idea of non-CS people being
| the primary users of standard libraries.
|
| I agree though, there should be the ability to specify which
| one, and every function in the standard library should have big
| O notation in the function description. Still makes sense to
| have a general-purpose good one as the default.
| yupper32 wrote:
| There are defaults for all sorts of functions. find() functions
| on an array in many languages use some default. Parameters
| often have defaults when not set. Strings often have a default
| encoding.
|
| The vast majority of the time, the default works well enough.
| Why make things more complicated than they need to be?
|
| It also has the added benefits of helping more junior engineers
| when reading the code. "Cycle sort? What is that doing?"
| "Insertion sort? I was taught that was inefficient! I should
| spend time updating this code to a better sorting algorithm! My
| tech lead would love my initiative."
|
| Or we just use sort(), and we all move on with our lives.
| tsimionescu wrote:
| You are thinking about the weeds of sorting now, but that is
| rarely relevant to the problem at hand.
|
| You can (and most people do) look at it from the other angle.
| Sort(seq) is a function which returns a permutation of seq such
| that every element is less-than-or-equal-to the next element.
| Usually, we use this function because we need this property to
| hold in our following code. A lot of the time, this function is
| called for very short sequences, where even the most naive
| implementation would be fats enough for even the highest scale
| program. So, in the vast majority of code we only care about
| the semantics of std::sort; and all of the above have the same
| semantics.
|
| In fact, the fact that we do often get a stable sort as well
| shows that semantics are the most important characteristic, as
| stable and unstable sorts do have slightly different semantics.
| pikseladam wrote:
| wow. that was really impressive. scale is different kind of
| monster. it is like an end game boss. Thank you for sharing it. I
| learned a lot.
| tialaramex wrote:
| Fun safety problems.
|
| Towards the end of this post, the author mentions a bunch of
| places where std::sort() blows up if your data or your sort rule
| doesn't exhibit strict weak ordering. In some cases you get a bad
| sort, which feels fair enough since you violated the rules the
| sort algorithm is relying on, but in others it literally has
| Undefined Behaviour in C++.
|
| Notably Rust's sort() and unstable_sort() do not have this
| problem, because of course these are safe Rust, and safe Rust
| doesn't have Undefined Behaviour. But why can this work?
|
| First up, one bullet which gets dodged is that Rust's floating
| point types (f32 and f64) are not Ord. Rust acknowledges that NaN
| != NaN and therefore we can't expect to go around sorting a slice
| of floating point values. In C++ that's on the programmer, in C++
| the fact you can write a < b means that type counts as ordered,
| even though it trivially isn't for several built-in types. So if
| you simply naively try to sort an [f32] Rust says you can't
| because it isn't Ord, you will need to write a lambda or whatever
| to Order these f32s and then if some of them are NaN your lambda
| is responsible for deciding what to do about that.
|
| But even if we insist on getting in harm's way, safe Rust
| promises to protect us anyway. For example if I make an array of
| a thousand misfortunate::OnewayGreater<u8>s they're only one byte
| each, yet they all insist they're Ordered greater than each other
| and even than themselves, Rust's sort() doesn't cause Undefined
| Behaviour on this input. It may never succeed, since all the
| elements to be "sorted" insist they're greater than each other,
| but it definitely won't segfault, trash unrelated memory or cause
| chaos.
| nullc wrote:
| > or cause chaos
|
| Running forever is pretty chaotic.
| ElectricalUnion wrote:
| Running forever is pretty stable. Halting is chaotic.
| tialaramex wrote:
| A trivial infinite loop is _Undefined Behaviour_ in C++ but
| it is a completely reasonable thing to write in Rust and
| works fine, in fact Rust 's underlying loop _is_ an
| infinite loop unless you explicitly provide some way to
| break out of it, which you usually will.
___________________________________________________________________
(page generated 2022-04-20 23:00 UTC)