[HN Gopher] My Favorite Algorithm: Linear Time Median Finding (2...
___________________________________________________________________
My Favorite Algorithm: Linear Time Median Finding (2018)
Author : skanderbm
Score : 209 points
Date : 2024-07-25 09:16 UTC (13 hours ago)
(HTM) web link (rcoh.me)
(TXT) w3m dump (rcoh.me)
| furstenheim wrote:
| Floyd Ryvest also does the job . A bit more efficient IIRC.
|
| However I never managed to understand how it works.
|
| https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...
| anonymoushn wrote:
| One commonly sees the implication that radix sort cannot be used
| for data types other than integers, or for composite data types,
| or for large data types. For example, TFA says you could use
| radix sort if your input is 32-bit integers. But you can use it
| on anything. You can use radix sort to sort strings in O(n) time.
| contravariant wrote:
| It should also be noted that radix sort is ridiculously fast
| because it just scans linearly through the list each time.
|
| It's actually hard to come up with something that cannot be
| sorted lexicographically. The best example I was able to find
| was big fractions. Though even then you could write them as
| continued fractions and sort those lexicographically (would be
| a bit trickier than strings).
| anonymoushn wrote:
| Sorting fractions by numerical value is a good example.
| Previously I've heard that there are some standard collation
| schemes for some human languages that resist radix sort, but
| when I asked about which ones in specific I didn't hear back
| :(
| contravariant wrote:
| The Unicode Collation algorithm doesn't look fun to
| implement in radix sort, but not _entirely_ impossible
| either. They do note that some characters are contextual,
| an example they give is that CH can be treated as a single
| character that sorts _after_ C (so also after CZ).
| Technically that is still lexicographical, but not byte-
| for-byte.
| exDM69 wrote:
| Problem with radix sorting strings is that it is O(k*N) where k
| is length of key, in this case it's the second longest string's
| length. Additional problems arise if you are dealing with null
| terminated strings and do not have the length stored.
|
| Radix sort is awesome if k is small, N is huge and/or you are
| using a GPU. On a CPU, comparison based sorting is faster in
| most cases.
| anonymoushn wrote:
| No, it's O(N+M) where N is the number of strings and M is the
| sum of the lengths of the strings. Maybe your radix sort has
| some problems?
|
| I evaluated various sorts for strings as part of my winning
| submission to
| https://easyperf.net/blog/2022/05/28/Performance-analysis-
| an... and found https://github.com/bingmann/parallel-string-
| sorting to be helpful. For a single core, the fastest
| implementation among those in parallel-string-sorting was a
| radix sort, so my submission included a radix sort based on
| that one.
|
| The only other contender was multi-key quicksort, which is
| notably not a comparison sort (i.e. a general-purpose string
| comparison function is not used as a subroutine of multi-key
| quicksort). In either case, you end up operating on something
| like an array of structs containing a pointer to the string,
| an integer offset into the string, and a few cached bytes
| from the string, and in either case I don't really know what
| problems you expect to have if you're dealing with null-
| terminated strings.
|
| A very similar similar radix sort is included in
| https://github.com/alichraghi/zort which includes some
| benchmarks, but I haven't done the work to have it work on
| strings or arbitrary structs.
| Aardwolf wrote:
| > No, it's O(N+M) where N is the number of strings and M is
| the sum of the lengths of the strings.
|
| That would mean it's possible to sort N random 64-bit
| integers in O(N+M) which is just O(N) with a constant
| factor of 9 (if taking the length in bytes) or 65 (if
| taking the length in bits), so sort billions of random
| integers in linear time, is that truly right?
|
| EDIT: I think it does make sense, M is length*N, and in
| scenarios where this matters this length will be in the
| order of log(N) anyway so it's still NlogN-sh.
| anonymoushn wrote:
| > That would mean it's possible to sort N random 64-bit
| integers in O(N+M) which is just O(N) with a constant
| factor of 9 (if taking the length in bytes) or 65 (if
| taking the length in bits), so sort billions of random
| integers in linear time, is that truly right?
|
| Yes. You can sort just about anything in linear time.
|
| > EDIT: I think it does make sense, M is length*N, and in
| scenarios where this matters this length will be in the
| order of log(N) anyway so it's still NlogN-sh.
|
| I mainly wrote N+M rather than M to be pedantically
| correct for degenerate inputs that consist of mostly
| empty strings. Regardless, if you consider the size of
| the whole input, it's linear in that.
| exDM69 wrote:
| Yes, radix sort can sort integers in linear O(N) time.
| dataflow wrote:
| If you have a million 1-character strings and one string of
| length 1 million, how many steps would your LSD radix sort
| take? And (if it's indeed linear in the total input size
| like you say) how do you make it jump over the empty slots
| without losing real-world efficiency in other cases?
| anonymoushn wrote:
| It's MSB radix sort. I think LSB radix sort is not
| generally as useful because even for fixed-size inputs it
| often makes more passes over most of the input than MSB
| radix sort.
|
| Your comment makes me think it would be swell to add a
| fast path for when the input range compares equal for the
| current byte or bytes though. In general radix sorts have
| some computation to spare (on commonly used CPUs, more
| computation may be performed per element during the
| histogramming pass without spending any additional time).
| Some cutting-edge radix sorts spend this spare
| computation to look for sorted subsequences, etc.
| dataflow wrote:
| Oh I see. I've found LSD better in some cases, but maybe
| it's just my implementation. For MSD, do you get a
| performance hit from putting your stack on the heap (to
| avoid overflow on recursion)? I don't just mean the minor
| register vs. memory differences in the codegen, but also
| the cache misses & memory pressure due to potentially
| long input elements (and thus correspondingly large stack
| to manage).
|
| > Some cutting-edge radix sorts
|
| Where do you find these? :-)
| anonymoushn wrote:
| > For MSD, do you get a performance hit from putting your
| stack on the heap (to avoid overflow on recursion)?
|
| I'm not sure. My original C++ implementation which is for
| non-adversarial inputs puts the array containing
| histograms and indices on the heap but uses the stack for
| a handful of locals, so it would explode if you passed
| the wrong thing. The sort it's based on in the parallel-
| string-sorting repo works the same way. Oops!
|
| > cache misses & memory pressure due to potentially long
| input elements (and thus correspondingly large stack)
|
| I think this should be okay? The best of these sorts try
| to visit the actual strings infrequently. You'd achieve
| worst-case cache performance by passing a pretty large
| number of strings with a long, equal prefix. Ideally
| these strings would share the bottom ~16 bits of their
| address, so that they could evict each other from cache
| when you access them. See https://danluu.com/3c-conflict/
|
| > Where do you find these? :-)
|
| There's a list of cool sorts here
| https://github.com/scandum/quadsort?tab=readme-ov-
| file#varia... and they are often submitted to HN.
| dataflow wrote:
| Yeah, what I hate about MSD is the stack explosion.
|
| Otherwise - cool, thanks!
| saagarjha wrote:
| This is hinted at in the post but if you're using C++ you will
| typically have access to quickselect via std::nth_element. I've
| replaced many a sort with that in code review :) (Well, not many.
| But at least a handful.)
| conradludgate wrote:
| Same with rust, there's the `select_nth_unstable` family on
| slices that will do this for you. It uses a more fancy pivot
| choosing algorithm but will fall back to median-of-medians if
| it detects it's taking too long
| mgaunard wrote:
| You could also use one of the streaming algorithms which allow
| you to compute approximations for arbitrary quantiles without
| ever needing to store the whole data in memory.
| cosmic_quanta wrote:
| That is cool if you can tolerate approximations. But the
| uncomfortable questions soon arise: Can I tolerate an
| approximate calculation? What assumptions about my data do I to
| determine an error bound? How to verify the validity of my
| assumptions on an ongoing basis?
|
| Personally I would gravitate towards the quickselect algorithm
| described in the OP until I was forced to consider a streaming
| median approximation method.
| conradludgate wrote:
| Well, I believe you could use the streaming algorithms to
| pick the likely median, so help choose the pivot for the real
| quickselect. quickselect can be done inplace too which is
| O(1) memory if you can afford to rearrange the data.
| SkiFire13 wrote:
| Then you don't get a guaranteed O(n) complexity if the
| approximated algorithm happen to make bad choices
| mgaunard wrote:
| say you want the median value since the beginning of the
| day for a time series that has 1000 entries per second,
| that's potentially hundreds of gigabytes in RAM, or just a
| few variables if you use a p-square estimator.
| skrtskrt wrote:
| A use case for something like this, and not just for medians,
| is where you have a querying/investigation UI like Grafana or
| Datadog or whatever.
|
| You write the query and the UI knows you're querying metric
| xyz_inflight_requests, it runs a preflight check to get the
| cardinality of that metric, and gives you a prompt:
| "xyz_inflight_requests is a high-cardinality metric, this
| query may take some time - consider using estimated_median
| instead of median".
| mgaunard wrote:
| any approximation gives a bound on the error.
| sevensor wrote:
| I've definitely had situations where a streaming quantile
| algorithm would have been useful, do you have any references?
| fzy95 wrote:
| "Further analysis of the remedian algorithm" https://www.scie
| ncedirect.com/science/article/pii/S030439751...
|
| This one has a streaming variant.
| sevensor wrote:
| Thanks!!
| fanf2 wrote:
| There are two kinds:
|
| - quantile sketches, such as t-digest, which aim to control
| the quantile error or rank error. Apache DataSketches has
| several examples, https://datasketches.apache.org/docs/Quanti
| les/QuantilesOver...
|
| - histograms, such as my hg64, or hdr histograms, or
| ddsketch. These control the value error, and are generally
| easier to understand and faster than quantile sketches.
| https://dotat.at/@/2022-10-12-histogram.html
| throwaway_2494 wrote:
| See also the Greenwald Khanna quantile estimator, an online
| algorithm which can compute any quantile within a given
| [?].
|
| https://aakinshin.net/posts/greenwald-khanna-quantile-
| estima...
| ssfrr wrote:
| Here's a simple one I've used before. It's a variation on
| FAME (Fast Algorithm for Median Estimation) [1].
|
| You keep an estimate for the current quantile value, and then
| for each element in your stream, you either increment (if the
| element is greater than your estimate) or decrement (if the
| element is less than your estimate) by fixed "up -step" and
| "down-step" amounts. If your increment and decrement steps
| are equal, you should converge to the median. If you shift
| the ratio of increment and decrement steps you can estimate
| any quantile.
|
| For example, say that your increment step is 0.05 and your
| decrement step is 0.95. When your estimate converges to a
| steady state, then you must be hitting greater values 95% of
| the time and lesser values 5% of the time, hence you've
| estimated the 95th percentile.
|
| The tricky bit is choosing the overall scale of your steps.
| If your steps are very small relative to the scale of your
| values, it will converge very slowly and not track changes in
| the stream. You don't want your steps to be too large because
| they determine the precision. The FAME algorithm has a step
| where you shrink your step size when your data value is near
| your estimate (causing the step size to auto-scale down).
|
| [1]: http://www.eng.tau.ac.il/~shavitt/courses/LargeG/streami
| ng-m...
|
| [2]: https://stats.stackexchange.com/a/445714
| ignoramous wrote:
| If an approximation is enough, the p2 quantile estimator ( _O(1)_
| memory) is pretty neat:
| https://news.ycombinator.com/item?id=25201093
| Xcelerate wrote:
| I received a variant of this problem as an interview question a
| few months ago. Except the linear time approach would not have
| worked here, since the list contains trillions of numbers, you
| only have sequential read access, and the list cannot be loaded
| into memory. 30 minutes -- go.
|
| First I asked if anything could be assumed about the statistics
| on the distribution of the numbers. Nope, could be anything,
| except the numbers are 32-bit ints. After fiddling around for a
| bit I finally decided on a scheme that creates a bounding
| interval for the unknown median value (one variable contains the
| upper bound and one contains the lower bound based on 2^32
| possible values) and then adjusts this interval on each
| successive pass through the data. The last step is to average the
| upper and lower bound in case there are an odd number of
| integers. Worst case, this approach requires O(log n) passes
| through the data, so even for trillions of numbers it's fairly
| quick.
|
| I wrapped up the solution right at the time limit, and my code
| ran fine on the test cases. Was decently proud of myself for
| getting a solution in the allotted time.
|
| Well, the interview feedback arrived, and it turns out my
| solution was rejected for being suboptimal. Apparently there is a
| more efficient approach that utilizes priority heaps. After
| looking up and reading about the priority heap approach, all I
| can say is that I didn't realize the interview task was to re-
| implement someone's PhD thesis in 30 minutes...
|
| I had never used leetcode before because I never had difficulty
| with prior coding interviews (my last job search was many years
| before the 2022 layoffs), but after this interview, I immediately
| signed up for a subscription. And of course the "median file
| integer" question I received is one of the most asked questions
| on the list of "hard" problems.
| anonymoushn wrote:
| Do you mean topK rather than median, for K small? You certainly
| cannot build a heap with trillions of items in it.
| Xcelerate wrote:
| No, I mean median. Here is an article describing a very
| similar problem since I can't link to the leetcode version:
| https://www.geeksforgeeks.org/median-of-stream-of-running-
| in...
| Tarean wrote:
| But that stores all elements into memory?
| raincole wrote:
| > Auxiliary Space : O(n).
|
| > The Space required to store the elements in Heap is O(n).
|
| I don't think this algorithm is suitable for trillions of
| items.
| osti wrote:
| I'm wondering what heap approach can solve that problem, as
| I can't think of any. Hopefully OP got a link to the
| thesis.
|
| The n log n approach definitely works though.
| jiggawatts wrote:
| I've heard of senior people applying for jobs like this simply
| turning the interview question around and demanding that the
| person asking it solve it in the allotted time. A surprisingly
| high percentage of the time they can't.
| Xcelerate wrote:
| This company receives so many candidates that the interviewer
| would have just ended the call and moved on to the next
| candidate.
|
| I get the notion of making the point out of principle, but
| it's sort of like arguing on the phone with someone at a call
| center--it's better to just cut your losses quickly and move
| on to the next option in the current market.
| jagged-chisel wrote:
| And we, as software engineers, should also take that
| advice: it's better to just cut your losses quickly and
| move on to the next option in the current market.
| pfdietz wrote:
| You don't need to load trillions of numbers into memory, you
| just need to count how many of each number there are. This
| requires 2^32 words of memory, not trillions of words. After
| doing that just scan down the array of counts, summing, until
| you find the midpoint.
| Xcelerate wrote:
| Yeah, I thought of that actually, but the interviewer said
| "very little memory" at one point which gave me the
| impression that perhaps I only had some registers available
| to work with. Was this an algorithm for an embedded system?
|
| The whole problem was kind of miscommunicated, because the
| interviewer showed up 10 minutes late, picked a problem from
| a list, and the requirements for the problem were only
| revealed when I started going a direction the interviewer
| wasn't looking for ("Oh, the file is actually read-only."
| "Oh, each number in the file is an integer, not a float.")
| jagged-chisel wrote:
| That "miscommunication" you mention has been used against
| me in several interviews, because I was expected to ask
| questions (and sometimes a specific question they had in
| mind) before making assumptions. Well, then the 30min
| becomes an exercise in requirements gathering and not
| algorithmic implementation.
| NoToP wrote:
| Which in fairness, is a reasonable competency to test for
| in an interview
| jagged-chisel wrote:
| Indeed. But I need clarity on which skill they want to
| test in thirty minutes.
| klyrs wrote:
| Speaking as an interviewer: nope, you're not being tested
| on a single skill.
| creata wrote:
| With 256 counters, you could use the same approach with
| four passes: pass i bins the numbers by byte i (0 = most
| sig., 3 = least sig.) and then identifies the bin that
| contains the median.
|
| I really want to know what a one-pass, low-memory solution
| looks like, lol.
| pfdietz wrote:
| Perhaps the interviewer was looking for an argument that
| no such solution can exist? The counterargument would
| look like this: divide the N numbers into two halves,
| each N/2 numbers. Now, suppose you don't have enough
| memory to represent the first N/2 numbers (ignoring
| changes in ordering); in that case, two different bags of
| numbers will have the same representation in memory. One
| can now construct a second half of numbers for which the
| algorithm will get the wrong answer for at least one of
| the two colliding cases.
|
| This is assuming a deterministic algorithm; maybe a
| random algorithm could work with high probability and
| less memory?
| jagged-chisel wrote:
| > ... I didn't realize the interview task was to re-implement
| someone's PhD thesis in 30 minutes...
|
| What a bullshit task. I'm beginning to think this kind of
| interviewing should be banned. Seems to me it's just an easy
| escape hatch for the interviewer/hiring manager when they want
| to discriminate based on prejudice.
| Tarean wrote:
| Love this algorithm. It feels like magic, and then it feels
| obvious and basically like binary search.
|
| Similar to the algorithm to parallelize the merge step of merge
| sort. Split the two sorted sequences into four sequences so that
| `merge(left[0:leftSplit],
| right[0:rightSplit])+merge(left[leftSplit:], right[rightSplit:])`
| is sorted. leftSplit+rightSplit should be halve the total length,
| and the elements in the left partition must be <= the elements in
| the right partition.
|
| Seems impossible, and then you think about it and it's just
| binary search.
| runiq wrote:
| Why is it okay to drop not-full chunks? The article doesn't
| explain that and I'm stupid.
|
| Edit: I just realized that the function where non-full chunks are
| dropped is just the one for finding the pivot, not the one for
| finding the median. I understand now.
| RcouF1uZ4gsC wrote:
| > The C++ standard library uses an algorithm called introselect
| which utilizes a combination of heapselect and quickselect and
| has an O(nlogn) bound.
|
| Introselect is a combination of Quickselect and Median of Medians
| and is O(n) worst case.
| ValleZ wrote:
| I was asked to invent this algorithm on a whiteboard in 30
| minutes. Loved it.
| nilslindemann wrote:
| "ns" instead of "l" and "n" instead of "el" would have been my
| choice (seen in Haskell code).
| robinhouston wrote:
| The trouble with using this convention (which I also like) in
| Python code is that sooner or later one wants to name a pair of
| lists 'as' and 'bs', which then causes a syntax error because
| 'as' is a keyword in Python. There is a similar problem with
| 'is' and 'js'.
| nilslindemann wrote:
| Sure, naming is hard, but avoid "l", "I", "O", "o".
|
| Very short variable names (including "ns" and "n") are always
| some kind of disturbance when reading code, especially when
| the variable lasts longer than one screen of code - one has
| to memorize the meaning. They sometimes have a point, e.g. in
| mathematical code like this one. But variables like "l" and
| "O" are bad for a further reason, as they can not easily be
| distinguished from the numbers. See also the Python style
| guide: https://peps.python.org/pep-0008/#names-to-avoid
| danlark wrote:
| Around 4 years ago I compared lots of different median algorithms
| and the article turned out to be much longer than I anticipated
| :)
|
| https://danlark.org/2020/11/11/miniselect-practical-and-gene...
| cfors wrote:
| Just wanted to say thank you for this article - I've read and
| shared this a few times over the years!
| chpatrick wrote:
| Another nice one is O(1) weighted sampling (after O(n)
| preprocessing).
|
| https://en.wikipedia.org/wiki/Alias_method
| Someone wrote:
| FTA:
|
| _"Proof of Average O(n)
|
| On average, the pivot will split the list into 2 approximately
| equal-sized pieces. Therefore, each subsequent recursion operates
| on 1/2 the data of the previous step."_
|
| That "therefore" doesn't follow, so this is more an intuition
| than a proof. The problem with it is that the medium is more
| likely to end up in the larger of the two pieces, so you more
| likely have to recurse on the larger part than on the smaller
| part.
|
| What saves you is that _O(n)_ doesn't say anything about
| constants.
|
| Also, I would think you can improve things a bit for real world
| data by, on subsequent iterations, using the average of the set
| as pivot (You can compute that for both pieces on the fly while
| doing the splitting. The average may not be in the set of items,
| but that doesn't matter for this algorithm). Is that true?
| meatmanek wrote:
| If I'm understanding correctly, the median is actually
| guaranteed to be in the larger of the two pieces of the array
| after partitioning. That means on average you'd only discard
| 25% of the array after each partition. Your selected pivot is
| either below the median, above the median, or exactly the
| median. If it's below the median, it could be anywhere in the
| range [p0, p50) for an average of around p25; if it's above the
| median, it could be anywhere in the range (p50, p100] for an
| average of around p75.
|
| Since these remaining fractions combine multiplicatively, we
| actually care about the geometric mean of the remaining
| fraction of the array, which is e^[(integral of ln(x) dx from
| x=0.5 to x=1) / (1 - 0.5)], or about 73.5%.
|
| Regardless, it forms a geometric series, which should converge
| to 1/(1-0.735) or about 3.77.
|
| Regarding using the average as the pivot: the question is
| really what quantile would be equal to the mean for your
| distribution. Heavily skewed distributions would perform pretty
| badly. It would perform particularly badly on 0.01*np.arange(1,
| 100) -- for each partitioning step, the mean would land between
| the first element and the second element.
| mabbo wrote:
| I learned about the median-of-medians quickselect algorithm when
| I was an undergrad and was really impressed by it. I implemented
| it, and it was terribly slow. It's runtime grew linearly, but
| that only really mattered if you had at least a few billion items
| in your list.
|
| I was chatting about this with a grad student friend who casually
| said something like "Sure, it's slow, but what really matters is
| that it proves that it's _possible_ to do selection of an
| unsorted list in O(n) time. At one point, we didn 't know whether
| that was even possible. Now that we do, we know there might an
| even faster linear algorithm." Really got into the philosophy of
| what Computer Science is about in the first place.
|
| The lesson was so simple yet so profound that I nearly applied to
| grad school because of it. I have no idea if they even recall the
| conversation, but it was a pivotal moment of my education.
| zelphirkalt wrote:
| Does the fact, that any linear time algorithm exist, indicate,
| that a faster linear time algorithm exists? Otherwise, what is
| the gain from that bit of knowledge? You could also think: "We
| already know, that some <arbitrary O(...)> algorithm exists,
| there might be an even faster <other O(...)> algorithm!" What
| makes the existence of an O(n) algo give more indication, than
| the existence of an O(n log(n)) algorithm?
| anonymoushn wrote:
| If you had two problems, and a linear time solution was known
| to exist for only one of them, I think it would be reasonable
| to say that it's more likely that a practical linear time
| solution exists for that one than for the other one.
| blt wrote:
| I am not the original commenter, but I (and probably many CS
| students) have had similar moments of clarity. The key part
| for me isn't
|
| > there might be an even faster linear algorithm,
|
| but
|
| > it's possible to do selection of an unsorted list in O(n)
| time. At one point, we didn't know whether that was even
| possible.
|
| For me, the moment of clarity was understanding that
| theoretical CS mainly cares about _problems_ , not
| _algorithms_. Algorithms are tools to prove upper bounds on
| the complexity of problems. Lower bounds are equally
| important and cannot be proved by designing algorithms. We
| even see theorems of the form "there exists an O(whatever)
| algorithm for <problem>": the algorithm's existence can
| sometimes be proven non-constructively.
|
| So if the median problem sat for a long time with a linear
| lower bound and superlinear upper bound, we might start to
| wonder if the problem has a superlinear lower bound, and
| spend our effort working on that instead. The existence of a
| linear-time algorithm immediately closes that path. The only
| remaining work is to tighten the constant factor. The
| community's effort can be focused.
|
| A famous example is the linear programming problem. Klee and
| Minty proved an exponential worst case for _the simplex
| algorithm_ , but not for _linear programming itself_. Later,
| Khachiyan proved that the ellipsoid algorithm was polynomial-
| time, but it had huge constant factors and was useless in
| practice. However, a few years later, Karmarkar gave an
| efficient polynomial-time algorithm. One can imagine how
| Khachiyan 's work, although inefficient, could motivate a
| more intense focus on polynomial-time LP algorithms leading
| to Karmarkar's breakthrough.
| mrguyorama wrote:
| We studied (I believe) this algorithm in my senior year of
| Computer Science. We talked about the theory side of it that
| you mention, but this algorithm was also used to demonstrate
| that "slow linear algorithm" is not faster than "Fast nlogn
| algorithm" in most real life cases.
|
| I think we got a constant factor of 22 for this algorithm so
| maybe it was a related one or something.
| zelphirkalt wrote:
| You can simply pass once over the data, and while you do that,
| count occurrences of the elements, memorizing the last maximum.
| Whenever an element is counted, you check, if that count is now
| higher than the previous maximum. If it is, you memorize the
| element and its count as the maximum, of course. Very simple
| approach and linear in time, with minimal book keeping on the way
| (only the median element and the count (previous max)).
|
| I don't find it surprising or special at all, that finding the
| median works in linear time, since even this ad-hoc thought of
| way is in linear time.
|
| EDIT: Ah right, I mixed up mode and median. My bad.
| gcr wrote:
| This finds the mode (most common element), not the median.
|
| Wouldn't you also need to keep track of all element counts with
| your approach? You can't keep the count of only the second-
| most-common element because you don't know what that is yet.
| zelphirkalt wrote:
| Yes, you are right. I mixed up mode and median.
|
| And yes, one would need to keep track of at least a key for
| each element (not a huge element, if they are somehow huge).
| But that would be about space complexity.
| gcr wrote:
| pardon! it's fun to think about though!
| hammeiam wrote:
| The "Split the array into subarrays of length 5, now sorting all
| of the arrays is O(n) instead of O(n log n)" feels like cheating
| to me
| tptacek wrote:
| They're not sorting all the arrays?
|
| _Later_
|
| (i was going to delete this comment, but for posterity, i
| misread --- sorting _the lists_ , not the contents of the list,
| sure)
| marcosdumay wrote:
| O(n log 5) is O(n). There's no cheating, sorting small arrays
| in a list is a completely different problem from sorting a
| large array.
| pfortuny wrote:
| Actually lots of algorithms "feel" like cheating until you
| understand what you were not looking at (fast matrix
| multiplication, fast fourier transforms...).
| Sharlin wrote:
| It's unambiguously O(n), there's no lg n anywhere to be seen.
| It may be O(n) with a bit larger constant factor, but the whole
| point of big-O analysis is that those don't matter.
| IncreasePosts wrote:
| It would only be cheating if you could merge the arrays in
| O(1), which you can't.
| someplaceguy wrote:
| return l[len(l) / 2]
|
| I'm not a Python expert, but doesn't the `/` operator return a
| float in Python? Why would you use a float as an array index
| instead of doing integer division (with `//`)?
|
| I know this probably won't matter until you have extremely large
| arrays, but this is still quite a code smell.
|
| Perhaps this could be forgiven if you're a Python novice and
| hadn't realized that the two different operators exist, but this
| is not the case here, as the article contains this even more
| baffling code which uses integer division in one branch but float
| division in the other: def
| quickselect_median(l, pivot_fn=random.choice): if
| len(l) % 2 == 1: return quickselect(l, len(l) //
| 2, pivot_fn) else: return 0.5 *
| (quickselect(l, len(l) / 2 - 1, pivot_fn) +
| quickselect(l, len(l) / 2, pivot_fn))
|
| That we're 50 comments in and nobody seems to have noticed this
| only serves to reinforce my existing prejudice against the
| average Python code quality.
| runeblaze wrote:
| I do agree that it is a code smell. However given that this is
| an algorithms article I don't think it is exactly that fair to
| judge it based on code quality. I think of it as: instead of
| writing it in pseudocode the author chose a real pseudocode-
| like programming language, and it (presumably) runs well for
| illustrative purposes.
| jononor wrote:
| Well spotted! In Python 2 there was only one operator, but in
| Python 3 they are distinct. Indexing an array with a float
| raises an exception, I believe.
| xinok wrote:
| > P.S: In 2017 a new paper came out that actually makes the
| median-of-medians approach competitive with other selection
| algorithms. Thanks to the paper's author, Andrei Alexandrescu for
| bringing it to my attention!
|
| He also gave a talk about his algorithm in 2016. He's an
| entertaining presenter, I highly recommended!
|
| There's Treasure Everywhere - Andrei Alexandrescu
|
| https://www.youtube.com/watch?v=fd1_Miy1Clg
| pjkundert wrote:
| Andrei Alexandrescu is awesome; around 2000 he gave on talk on
| lock-free wait-free algorithms that I immediately applied to a
| huge C++ industrial control networking project at the time.
|
| I'd recommend anyone who writes software listening and reading
| anything of Andrei's you can find; this one is indeed a
| Treasure!
| SkiFire13 wrote:
| I wonder what's the reason of picking groups of 5 elements
| instead of 2 or 8.
| lalaland1125 wrote:
| 1. You want an odd number so the median is the middle element
| of the sublist.
|
| 2. One and three are probably too small
| danlark wrote:
| 3 and 4 elements will fail to prove the complexity is linear
|
| You still can do 3 or 4 but with slight modifications
|
| https://arxiv.org/abs/1409.3600
|
| For example, for 4 elements, it's advised to take lower median
| for the first half and upper median for the second half. Then
| the complexity will be linear
| someplaceguy wrote:
| I found this part of the code quite funny: # If
| there are < 5 items, just return the median if len(l) <
| 5: # In this case, we fall back on the first median
| function we wrote. # Since we only run this on a list
| of 5 or fewer items, it doesn't # depend on the
| length of the input and can be considered constant #
| time. return nlogn_median(l)
|
| Hell, why not just use 2^140 instead of 5 as the cut-off point,
| then? This way you'd have _constant time_ median finding for all
| arrays that can be represented in any real-world computer! :) [1]
|
| [1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-
| the-oceans/
| SkiFire13 wrote:
| Ultimately when an algorithm has worse complexity than another
| it might still be faster up to a certain point. In this case 5
| is likely under that point, though I doubt 2^256 will.
|
| In practice you might also want to use a O(n^2) algorithm like
| insertion sort under some threshold.
| someplaceguy wrote:
| > Ultimately when an algorithm has worse complexity than
| another it might still be faster up to a certain point.
|
| Sure, but the author didn't argue that the simpler algorithm
| would be faster for 5 items, which would indeed make sense.
|
| Instead, the author argued that it's OK to use the simpler
| algorithm for less than 5 items because 5 is a constant and
| therefore the simpler algorithm runs in constant time, hence
| my point that you could use the same argument to say that
| 2^140 (or 2^256) could just as well be used as the cut-off
| point and similarly argue that the simpler algorithm runs in
| _constant time_ for all arrays than can be represented on a
| real-world computer, therefore obviating the need for the
| more complex algorithm (which obviously makes no sense).
| thrw2486776 wrote:
| If you set n=2^140, then sure, it's constant. If instead
| you only have n<=2^140, then n varies across a large set
| and is practically indistinguishable from n<=infinity
| (since we get into the territory of the number of atoms in
| the universe), therefore you can perform limit calculations
| on it, in particular big O stuff.
|
| In the article n was set to 5. All of those arrays (except
| maybe 1) have exactly 5 elements. There is no variance (and
| even if there was, it would be tiny, there is no point in
| talking about limits of 5-element sequences).
| someplaceguy wrote:
| > In the article n was set to 5. All of those arrays
| (except maybe 1) have exactly 5 elements. There is no
| variance
|
| No, the code was: # If there are < 5
| items, just return the median if len(l) < 5:
| return nlogn_median(l)
|
| > and even if there was, it would be tiny, there is no
| point in talking about limits of 5-element sequences
|
| So your point is: not all constants are created equal.
| Which circles all the way back to my original point that
| this argument is pretty funny :)
| raincole wrote:
| Big-O notation and "real-world computer" don't belong to the
| same sentence. The whole point of big-O notation is to abstract
| the algorithm out of real-world limitations so we can talk
| about arbitrarily large input.
|
| Any halting program that runs on a real world computer is O(1),
| by definition.
| someplaceguy wrote:
| > The whole point of big-O notation is to abstract the
| algorithm out of real-world limitations so we can talk about
| arbitrarily large input.
|
| Except that there is no such thing as "arbitrarily large
| storage", as my link in the parent comment explained:
| https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/
|
| So why would you want to talk about arbitrarily large input
| (where the input is an array that is stored in memory)?
|
| As I understood, this big-O notation is intended to have some
| real-world usefulness, is it not? Care to elaborate what that
| usefulness is, exactly? Or is it just a purely fictional
| notion in the realm of ideas with no real-world application?
|
| And if so, why bother studying it at all, except as a
| mathematical curiosity written in some mathematical pseudo-
| code rather than a programming or engineering challenge
| written in a real-world programming language?
|
| Edit: s/pretending/intended/
| raincole wrote:
| > (where the input is an array that is stored in memory)?
|
| If the input is an array that is stored in a piece of real-
| world memory, then the only possible complexity is O(1).
| It's just how big-O works. Big-O notation is an abstraction
| that is much much closer to mathematics than to
| engineering.
|
| > this big-O notation is pretending to have some real-world
| usefulness...
|
| Big-O notation is not a person so I'm not sure whether it
| can pretend something. CS professors might exaggerate big-O
| notation's real-world usefulness so their students don't
| fall asleep too fast.
|
| > fictional
|
| Theoretical. Just like the other theoretical ideas, _at
| best_ big-O notation gives some vague insights that help
| people solve real problems. It gives a _very_ rough feeling
| about whether an algorithm is fast or not.
|
| By the way, Turing machine is in this category as well.
| fenomas wrote:
| Big-O analysis is about _scaling_ behavior - its real-world
| implications lie in what it tells you about _relative_
| sizes, not absolute sizes.
|
| E.g., if you need to run a task on 10M inputs, then knowing
| that your algorithm is O(N) doesn't tell you anything at
| all about how long your task will take. It also doesn't
| tell you whether that algorithm will be faster than some
| other algorithm that's O(N^2).
|
| But it _does_ tell you that if your task size doubles to
| 20M inputs, you can expect the time required for the first
| algorithm to double, and the second to quadruple. And that
| knowledge isn 't arcane or theoretical, it works on real-
| world hardware and is really useful for modeling how your
| code will run as inputs scale up.
| beyondCritics wrote:
| <It's not straightforward to prove why this is O(n).
|
| Replace T(n/5) with T(floor(n/5)) and T(7n/10) with
| T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.
| sfpotter wrote:
| A nice way to approximate the median:
| https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf
| jagged-chisel wrote:
| It's quicksort with a modification to select the median during
| the process. I feel like this is a good way to approach lots of
| "find $THING in list" questions.
| throwaway294531 wrote:
| If you're selecting the n:th element, where n is very small (or
| large), using median-of-medians may not be the best choice.
|
| Instead, you can use a biased pivot as in [1] or something I call
| "j:th of k:th". Floyd-Rivest can also speed things up. I have a
| hobby project that gets 1.2-2.0x throughput when compared to a
| well implemented quickselect, see:
| https://github.com/koskinev/turboselect
|
| If anyone has pointers to fast generic & in-place selection
| algorithms, I'm interested.
|
| [1] https://doi.org/10.4230/LIPIcs.SEA.2017.24
| ncruces wrote:
| An implementation in Go, that's (hopefully) simple enough to be
| understandable, yet minimally practical:
|
| https://github.com/ncruces/sort/blob/main/quick/quick.go
| kccqzy wrote:
| > Quickselect gets us linear performance, but only in the average
| case. What if we aren't happy to be average, but instead want to
| guarantee that our algorithm is linear time, no matter what?
|
| I don't agree with the need for this guarantee. Note that the
| article already says the selection of the pivot is by random. You
| can simply choose a very good random function to avoid an
| attacker crafting an input that needs quadratic time. You'll
| never be unlucky enough for this to be a problem. This is
| basically the same kind of mindset that leads people into
| thinking, what if I use SHA256 to hash these two different
| strings to get the same hash?
| mitthrowaway2 wrote:
| It's a very important guarantee for use in real-time signal
| processing applications.
| forrestthewoods wrote:
| > I don't agree with the need for this guarantee.
|
| You don't get to agree with it or not. It depends on the
| project! Clearly there exist some projects in the world where
| it's important.
|
| But honestly it doesn't matter. Because as the article shows
| with random data that median-of-medians is strictly better than
| random pivot. So even if you don't need the requirement there
| is zero loss to achieve it.
| kccqzy wrote:
| The median-of-median comes at a cost for execution time.
| Chances are, sorting each five-element chunk is a lot slower
| than even running a sophisticated random number generator.
| rented_mule wrote:
| 10-15 years ago, I found myself needing to regularly find the
| median of many billions of values, each parsed out of a multi-
| kilobyte log entry. MapReduce was what we were using for
| processing large amounts of data at the time. With MapReduce over
| that much data, you don't just want linear time, but ideally
| single pass, distributed across machines. Subsequent passes over
| much smaller amounts of data are fine.
|
| It was a struggle until I figured out that knowledge of the
| precision and range of our data helped. These were timings,
| expressed in integer milliseconds. So they were non-negative, and
| I knew the 90th percentile was well under a second.
|
| As the article mentions, finding a median typically involves
| something akin to sorting. With the above knowledge, bucket sort
| becomes available, with a slight tweak in my case. Even if the
| samples were floating point, the same approach could be used as
| long as an integer (or even fixed point) approximation that is
| very close to the true median is good enough, again assuming a
| known, relatively small range.
|
| The idea is to build a dictionary where the keys are the timings
| in integer milliseconds and the values are a count of the keys'
| appearance in the data, i.e., a histogram of timings. The maximum
| timing isn't known, so to ensure the size of the dictionary
| doesn't get out of control, use the knowledge that the 90th
| percentile is well under a second and count everything over, say,
| 999ms in the 999ms bin. Then the dictionary will be limited to
| 2000 integers (keys in the range 0-999 and corresponding values)
| - this is the part that is different from an ordinary bucket
| sort. All of that is trivial to do in a single pass, even when
| distributed with MapReduce. Then it's easy to get the median from
| that dictionary / histogram.
| justinpombrio wrote:
| Did you actually need to find the true median of billions of
| values? Or would finding a value between 49.9% and 50.1%
| suffice? Because the latter is much easier: sample 10,000
| elements uniformly at random and take their median.
|
| (I made the number 10,000 up, but you could do some statistics
| to figure out how many samples would be needed for a given
| level of confidence, and I don't think it would be
| prohibitively large.)
| andruby wrote:
| I was thinking the same thing.
|
| In all use-cases I've seen a close estimate of the median was
| enough.
| rented_mule wrote:
| The kind of margin you indicate would have been plenty for
| our use cases. But, we were already processing all these log
| entries for multiple other purposes in a single pass (not one
| pass per thing computed). With this single pass approach, the
| median calculation could happen with the same single-pass
| parsing of the logs (they were JSON and that parsing was most
| of our cost), roughly for free.
|
| Uniform sampling also wasn't obviously simple, at least to
| me. There were thousands of log files involved, coming from
| hundreds of computers. Any single log file only had timings
| from a single computer. What kind of bias would be introduced
| by different approaches to distributing those log files to a
| cluster for the median calculation? Once the solution
| outlined in the previous comment was identified, that seemed
| simpler that trying to understand if we were talking about
| 49-51% or 40-50%. And if it was too big a margin,
| restructuring our infra to allow different log file
| distribution algorithms would have been far more complicated.
| enriquto wrote:
| > the latter is much easier: sample 10,000 elements uniformly
| at random and take their median
|
| Do you have a source for that claim?
|
| I don't see how could that possibly be true... For example,
| if your original points are sampled from two gaussians of
| centers -100 and 100, of small but slightly different
| variance, then the true median can be anywhere between the
| two centers, and you may need a humungous number of samples
| to get anywhere close to it.
|
| True, in that case any point between say -90 and 90 would be
| equally good as a median in most applications. But this does
| not mean that the median can be found accurately by your
| method.
| maronato wrote:
| the key word is "uniformly". If your data is not uniformly
| distributed, then you just have to pick random values non-
| uniformly. There are many ways to do that, and once you
| have your algo you'll be able to reliably find an
| approximation of the median much faster than you would find
| the actual median.
|
| https://en.m.wikipedia.org/wiki/Non-
| uniform_random_variate_g...
| ant6n wrote:
| I'm not sure why you use a dictionary with keys 0...999,
| instead of an array indexed 0...999.
| TacticalCoder wrote:
| I really enjoyed TFA but this:
|
| > Technically, you could get extremely unlucky: at each step, you
| could pick the largest element as your pivot. Each step would
| only remove one element from the list and you'd actually have
| O(n2) performance instead of O(n)
|
| If adversarial input is a concern, doing a O(n) shuffle of the
| data first guarantees this cannot happen. If the data is really
| too big to shuffle, then only shuffle once a bucket is small
| enough to be shuffled.
|
| If you do shuffle, probabilities are here to guarantee that that
| worst case cannot happen. If anyone says that "technically" it
| can happen, I'll answer that then "technically" an attacker could
| also guess correctly every bit of your 256 bits private key.
|
| Our world is build on probabilities: all our private keys are
| protected by the mathematical improbability that someone shall
| guess them correctly.
|
| From what I read, a shuffle followed by quickselect is O(n) for
| all practical purposes.
| bo1024 wrote:
| You're already using your own randomness to pick the pivot at
| random, so I don't see why the shuffle helps more. But yes, if
| your randomness is trustworthy, the probability of more than
| O(n) runtime is very low.
___________________________________________________________________
(page generated 2024-07-25 23:03 UTC)