[HN Gopher] My Favorite Algorithm: Linear Time Median Finding (2...
       ___________________________________________________________________
        
       My Favorite Algorithm: Linear Time Median Finding (2018)
        
       Author : skanderbm
       Score  : 209 points
       Date   : 2024-07-25 09:16 UTC (13 hours ago)
        
 (HTM) web link (rcoh.me)
 (TXT) w3m dump (rcoh.me)
        
       | furstenheim wrote:
       | Floyd Ryvest also does the job . A bit more efficient IIRC.
       | 
       | However I never managed to understand how it works.
       | 
       | https://en.m.wikipedia.org/wiki/Floyd%E2%80%93Rivest_algorit...
        
       | anonymoushn wrote:
       | One commonly sees the implication that radix sort cannot be used
       | for data types other than integers, or for composite data types,
       | or for large data types. For example, TFA says you could use
       | radix sort if your input is 32-bit integers. But you can use it
       | on anything. You can use radix sort to sort strings in O(n) time.
        
         | contravariant wrote:
         | It should also be noted that radix sort is ridiculously fast
         | because it just scans linearly through the list each time.
         | 
         | It's actually hard to come up with something that cannot be
         | sorted lexicographically. The best example I was able to find
         | was big fractions. Though even then you could write them as
         | continued fractions and sort those lexicographically (would be
         | a bit trickier than strings).
        
           | anonymoushn wrote:
           | Sorting fractions by numerical value is a good example.
           | Previously I've heard that there are some standard collation
           | schemes for some human languages that resist radix sort, but
           | when I asked about which ones in specific I didn't hear back
           | :(
        
             | contravariant wrote:
             | The Unicode Collation algorithm doesn't look fun to
             | implement in radix sort, but not _entirely_ impossible
             | either. They do note that some characters are contextual,
             | an example they give is that CH can be treated as a single
             | character that sorts _after_ C (so also after CZ).
             | Technically that is still lexicographical, but not byte-
             | for-byte.
        
         | exDM69 wrote:
         | Problem with radix sorting strings is that it is O(k*N) where k
         | is length of key, in this case it's the second longest string's
         | length. Additional problems arise if you are dealing with null
         | terminated strings and do not have the length stored.
         | 
         | Radix sort is awesome if k is small, N is huge and/or you are
         | using a GPU. On a CPU, comparison based sorting is faster in
         | most cases.
        
           | anonymoushn wrote:
           | No, it's O(N+M) where N is the number of strings and M is the
           | sum of the lengths of the strings. Maybe your radix sort has
           | some problems?
           | 
           | I evaluated various sorts for strings as part of my winning
           | submission to
           | https://easyperf.net/blog/2022/05/28/Performance-analysis-
           | an... and found https://github.com/bingmann/parallel-string-
           | sorting to be helpful. For a single core, the fastest
           | implementation among those in parallel-string-sorting was a
           | radix sort, so my submission included a radix sort based on
           | that one.
           | 
           | The only other contender was multi-key quicksort, which is
           | notably not a comparison sort (i.e. a general-purpose string
           | comparison function is not used as a subroutine of multi-key
           | quicksort). In either case, you end up operating on something
           | like an array of structs containing a pointer to the string,
           | an integer offset into the string, and a few cached bytes
           | from the string, and in either case I don't really know what
           | problems you expect to have if you're dealing with null-
           | terminated strings.
           | 
           | A very similar similar radix sort is included in
           | https://github.com/alichraghi/zort which includes some
           | benchmarks, but I haven't done the work to have it work on
           | strings or arbitrary structs.
        
             | Aardwolf wrote:
             | > No, it's O(N+M) where N is the number of strings and M is
             | the sum of the lengths of the strings.
             | 
             | That would mean it's possible to sort N random 64-bit
             | integers in O(N+M) which is just O(N) with a constant
             | factor of 9 (if taking the length in bytes) or 65 (if
             | taking the length in bits), so sort billions of random
             | integers in linear time, is that truly right?
             | 
             | EDIT: I think it does make sense, M is length*N, and in
             | scenarios where this matters this length will be in the
             | order of log(N) anyway so it's still NlogN-sh.
        
               | anonymoushn wrote:
               | > That would mean it's possible to sort N random 64-bit
               | integers in O(N+M) which is just O(N) with a constant
               | factor of 9 (if taking the length in bytes) or 65 (if
               | taking the length in bits), so sort billions of random
               | integers in linear time, is that truly right?
               | 
               | Yes. You can sort just about anything in linear time.
               | 
               | > EDIT: I think it does make sense, M is length*N, and in
               | scenarios where this matters this length will be in the
               | order of log(N) anyway so it's still NlogN-sh.
               | 
               | I mainly wrote N+M rather than M to be pedantically
               | correct for degenerate inputs that consist of mostly
               | empty strings. Regardless, if you consider the size of
               | the whole input, it's linear in that.
        
               | exDM69 wrote:
               | Yes, radix sort can sort integers in linear O(N) time.
        
             | dataflow wrote:
             | If you have a million 1-character strings and one string of
             | length 1 million, how many steps would your LSD radix sort
             | take? And (if it's indeed linear in the total input size
             | like you say) how do you make it jump over the empty slots
             | without losing real-world efficiency in other cases?
        
               | anonymoushn wrote:
               | It's MSB radix sort. I think LSB radix sort is not
               | generally as useful because even for fixed-size inputs it
               | often makes more passes over most of the input than MSB
               | radix sort.
               | 
               | Your comment makes me think it would be swell to add a
               | fast path for when the input range compares equal for the
               | current byte or bytes though. In general radix sorts have
               | some computation to spare (on commonly used CPUs, more
               | computation may be performed per element during the
               | histogramming pass without spending any additional time).
               | Some cutting-edge radix sorts spend this spare
               | computation to look for sorted subsequences, etc.
        
               | dataflow wrote:
               | Oh I see. I've found LSD better in some cases, but maybe
               | it's just my implementation. For MSD, do you get a
               | performance hit from putting your stack on the heap (to
               | avoid overflow on recursion)? I don't just mean the minor
               | register vs. memory differences in the codegen, but also
               | the cache misses & memory pressure due to potentially
               | long input elements (and thus correspondingly large stack
               | to manage).
               | 
               | > Some cutting-edge radix sorts
               | 
               | Where do you find these? :-)
        
               | anonymoushn wrote:
               | > For MSD, do you get a performance hit from putting your
               | stack on the heap (to avoid overflow on recursion)?
               | 
               | I'm not sure. My original C++ implementation which is for
               | non-adversarial inputs puts the array containing
               | histograms and indices on the heap but uses the stack for
               | a handful of locals, so it would explode if you passed
               | the wrong thing. The sort it's based on in the parallel-
               | string-sorting repo works the same way. Oops!
               | 
               | > cache misses & memory pressure due to potentially long
               | input elements (and thus correspondingly large stack)
               | 
               | I think this should be okay? The best of these sorts try
               | to visit the actual strings infrequently. You'd achieve
               | worst-case cache performance by passing a pretty large
               | number of strings with a long, equal prefix. Ideally
               | these strings would share the bottom ~16 bits of their
               | address, so that they could evict each other from cache
               | when you access them. See https://danluu.com/3c-conflict/
               | 
               | > Where do you find these? :-)
               | 
               | There's a list of cool sorts here
               | https://github.com/scandum/quadsort?tab=readme-ov-
               | file#varia... and they are often submitted to HN.
        
               | dataflow wrote:
               | Yeah, what I hate about MSD is the stack explosion.
               | 
               | Otherwise - cool, thanks!
        
       | saagarjha wrote:
       | This is hinted at in the post but if you're using C++ you will
       | typically have access to quickselect via std::nth_element. I've
       | replaced many a sort with that in code review :) (Well, not many.
       | But at least a handful.)
        
         | conradludgate wrote:
         | Same with rust, there's the `select_nth_unstable` family on
         | slices that will do this for you. It uses a more fancy pivot
         | choosing algorithm but will fall back to median-of-medians if
         | it detects it's taking too long
        
       | mgaunard wrote:
       | You could also use one of the streaming algorithms which allow
       | you to compute approximations for arbitrary quantiles without
       | ever needing to store the whole data in memory.
        
         | cosmic_quanta wrote:
         | That is cool if you can tolerate approximations. But the
         | uncomfortable questions soon arise: Can I tolerate an
         | approximate calculation? What assumptions about my data do I to
         | determine an error bound? How to verify the validity of my
         | assumptions on an ongoing basis?
         | 
         | Personally I would gravitate towards the quickselect algorithm
         | described in the OP until I was forced to consider a streaming
         | median approximation method.
        
           | conradludgate wrote:
           | Well, I believe you could use the streaming algorithms to
           | pick the likely median, so help choose the pivot for the real
           | quickselect. quickselect can be done inplace too which is
           | O(1) memory if you can afford to rearrange the data.
        
             | SkiFire13 wrote:
             | Then you don't get a guaranteed O(n) complexity if the
             | approximated algorithm happen to make bad choices
        
             | mgaunard wrote:
             | say you want the median value since the beginning of the
             | day for a time series that has 1000 entries per second,
             | that's potentially hundreds of gigabytes in RAM, or just a
             | few variables if you use a p-square estimator.
        
           | skrtskrt wrote:
           | A use case for something like this, and not just for medians,
           | is where you have a querying/investigation UI like Grafana or
           | Datadog or whatever.
           | 
           | You write the query and the UI knows you're querying metric
           | xyz_inflight_requests, it runs a preflight check to get the
           | cardinality of that metric, and gives you a prompt:
           | "xyz_inflight_requests is a high-cardinality metric, this
           | query may take some time - consider using estimated_median
           | instead of median".
        
           | mgaunard wrote:
           | any approximation gives a bound on the error.
        
         | sevensor wrote:
         | I've definitely had situations where a streaming quantile
         | algorithm would have been useful, do you have any references?
        
           | fzy95 wrote:
           | "Further analysis of the remedian algorithm" https://www.scie
           | ncedirect.com/science/article/pii/S030439751...
           | 
           | This one has a streaming variant.
        
             | sevensor wrote:
             | Thanks!!
        
           | fanf2 wrote:
           | There are two kinds:
           | 
           | - quantile sketches, such as t-digest, which aim to control
           | the quantile error or rank error. Apache DataSketches has
           | several examples, https://datasketches.apache.org/docs/Quanti
           | les/QuantilesOver...
           | 
           | - histograms, such as my hg64, or hdr histograms, or
           | ddsketch. These control the value error, and are generally
           | easier to understand and faster than quantile sketches.
           | https://dotat.at/@/2022-10-12-histogram.html
        
             | throwaway_2494 wrote:
             | See also the Greenwald Khanna quantile estimator, an online
             | algorithm which can compute any quantile within a given
             | [?].
             | 
             | https://aakinshin.net/posts/greenwald-khanna-quantile-
             | estima...
        
           | ssfrr wrote:
           | Here's a simple one I've used before. It's a variation on
           | FAME (Fast Algorithm for Median Estimation) [1].
           | 
           | You keep an estimate for the current quantile value, and then
           | for each element in your stream, you either increment (if the
           | element is greater than your estimate) or decrement (if the
           | element is less than your estimate) by fixed "up -step" and
           | "down-step" amounts. If your increment and decrement steps
           | are equal, you should converge to the median. If you shift
           | the ratio of increment and decrement steps you can estimate
           | any quantile.
           | 
           | For example, say that your increment step is 0.05 and your
           | decrement step is 0.95. When your estimate converges to a
           | steady state, then you must be hitting greater values 95% of
           | the time and lesser values 5% of the time, hence you've
           | estimated the 95th percentile.
           | 
           | The tricky bit is choosing the overall scale of your steps.
           | If your steps are very small relative to the scale of your
           | values, it will converge very slowly and not track changes in
           | the stream. You don't want your steps to be too large because
           | they determine the precision. The FAME algorithm has a step
           | where you shrink your step size when your data value is near
           | your estimate (causing the step size to auto-scale down).
           | 
           | [1]: http://www.eng.tau.ac.il/~shavitt/courses/LargeG/streami
           | ng-m...
           | 
           | [2]: https://stats.stackexchange.com/a/445714
        
       | ignoramous wrote:
       | If an approximation is enough, the p2 quantile estimator ( _O(1)_
       | memory) is pretty neat:
       | https://news.ycombinator.com/item?id=25201093
        
       | Xcelerate wrote:
       | I received a variant of this problem as an interview question a
       | few months ago. Except the linear time approach would not have
       | worked here, since the list contains trillions of numbers, you
       | only have sequential read access, and the list cannot be loaded
       | into memory. 30 minutes -- go.
       | 
       | First I asked if anything could be assumed about the statistics
       | on the distribution of the numbers. Nope, could be anything,
       | except the numbers are 32-bit ints. After fiddling around for a
       | bit I finally decided on a scheme that creates a bounding
       | interval for the unknown median value (one variable contains the
       | upper bound and one contains the lower bound based on 2^32
       | possible values) and then adjusts this interval on each
       | successive pass through the data. The last step is to average the
       | upper and lower bound in case there are an odd number of
       | integers. Worst case, this approach requires O(log n) passes
       | through the data, so even for trillions of numbers it's fairly
       | quick.
       | 
       | I wrapped up the solution right at the time limit, and my code
       | ran fine on the test cases. Was decently proud of myself for
       | getting a solution in the allotted time.
       | 
       | Well, the interview feedback arrived, and it turns out my
       | solution was rejected for being suboptimal. Apparently there is a
       | more efficient approach that utilizes priority heaps. After
       | looking up and reading about the priority heap approach, all I
       | can say is that I didn't realize the interview task was to re-
       | implement someone's PhD thesis in 30 minutes...
       | 
       | I had never used leetcode before because I never had difficulty
       | with prior coding interviews (my last job search was many years
       | before the 2022 layoffs), but after this interview, I immediately
       | signed up for a subscription. And of course the "median file
       | integer" question I received is one of the most asked questions
       | on the list of "hard" problems.
        
         | anonymoushn wrote:
         | Do you mean topK rather than median, for K small? You certainly
         | cannot build a heap with trillions of items in it.
        
           | Xcelerate wrote:
           | No, I mean median. Here is an article describing a very
           | similar problem since I can't link to the leetcode version:
           | https://www.geeksforgeeks.org/median-of-stream-of-running-
           | in...
        
             | Tarean wrote:
             | But that stores all elements into memory?
        
             | raincole wrote:
             | > Auxiliary Space : O(n).
             | 
             | > The Space required to store the elements in Heap is O(n).
             | 
             | I don't think this algorithm is suitable for trillions of
             | items.
        
             | osti wrote:
             | I'm wondering what heap approach can solve that problem, as
             | I can't think of any. Hopefully OP got a link to the
             | thesis.
             | 
             | The n log n approach definitely works though.
        
         | jiggawatts wrote:
         | I've heard of senior people applying for jobs like this simply
         | turning the interview question around and demanding that the
         | person asking it solve it in the allotted time. A surprisingly
         | high percentage of the time they can't.
        
           | Xcelerate wrote:
           | This company receives so many candidates that the interviewer
           | would have just ended the call and moved on to the next
           | candidate.
           | 
           | I get the notion of making the point out of principle, but
           | it's sort of like arguing on the phone with someone at a call
           | center--it's better to just cut your losses quickly and move
           | on to the next option in the current market.
        
             | jagged-chisel wrote:
             | And we, as software engineers, should also take that
             | advice: it's better to just cut your losses quickly and
             | move on to the next option in the current market.
        
         | pfdietz wrote:
         | You don't need to load trillions of numbers into memory, you
         | just need to count how many of each number there are. This
         | requires 2^32 words of memory, not trillions of words. After
         | doing that just scan down the array of counts, summing, until
         | you find the midpoint.
        
           | Xcelerate wrote:
           | Yeah, I thought of that actually, but the interviewer said
           | "very little memory" at one point which gave me the
           | impression that perhaps I only had some registers available
           | to work with. Was this an algorithm for an embedded system?
           | 
           | The whole problem was kind of miscommunicated, because the
           | interviewer showed up 10 minutes late, picked a problem from
           | a list, and the requirements for the problem were only
           | revealed when I started going a direction the interviewer
           | wasn't looking for ("Oh, the file is actually read-only."
           | "Oh, each number in the file is an integer, not a float.")
        
             | jagged-chisel wrote:
             | That "miscommunication" you mention has been used against
             | me in several interviews, because I was expected to ask
             | questions (and sometimes a specific question they had in
             | mind) before making assumptions. Well, then the 30min
             | becomes an exercise in requirements gathering and not
             | algorithmic implementation.
        
               | NoToP wrote:
               | Which in fairness, is a reasonable competency to test for
               | in an interview
        
               | jagged-chisel wrote:
               | Indeed. But I need clarity on which skill they want to
               | test in thirty minutes.
        
               | klyrs wrote:
               | Speaking as an interviewer: nope, you're not being tested
               | on a single skill.
        
             | creata wrote:
             | With 256 counters, you could use the same approach with
             | four passes: pass i bins the numbers by byte i (0 = most
             | sig., 3 = least sig.) and then identifies the bin that
             | contains the median.
             | 
             | I really want to know what a one-pass, low-memory solution
             | looks like, lol.
        
               | pfdietz wrote:
               | Perhaps the interviewer was looking for an argument that
               | no such solution can exist? The counterargument would
               | look like this: divide the N numbers into two halves,
               | each N/2 numbers. Now, suppose you don't have enough
               | memory to represent the first N/2 numbers (ignoring
               | changes in ordering); in that case, two different bags of
               | numbers will have the same representation in memory. One
               | can now construct a second half of numbers for which the
               | algorithm will get the wrong answer for at least one of
               | the two colliding cases.
               | 
               | This is assuming a deterministic algorithm; maybe a
               | random algorithm could work with high probability and
               | less memory?
        
         | jagged-chisel wrote:
         | > ... I didn't realize the interview task was to re-implement
         | someone's PhD thesis in 30 minutes...
         | 
         | What a bullshit task. I'm beginning to think this kind of
         | interviewing should be banned. Seems to me it's just an easy
         | escape hatch for the interviewer/hiring manager when they want
         | to discriminate based on prejudice.
        
       | Tarean wrote:
       | Love this algorithm. It feels like magic, and then it feels
       | obvious and basically like binary search.
       | 
       | Similar to the algorithm to parallelize the merge step of merge
       | sort. Split the two sorted sequences into four sequences so that
       | `merge(left[0:leftSplit],
       | right[0:rightSplit])+merge(left[leftSplit:], right[rightSplit:])`
       | is sorted. leftSplit+rightSplit should be halve the total length,
       | and the elements in the left partition must be <= the elements in
       | the right partition.
       | 
       | Seems impossible, and then you think about it and it's just
       | binary search.
        
       | runiq wrote:
       | Why is it okay to drop not-full chunks? The article doesn't
       | explain that and I'm stupid.
       | 
       | Edit: I just realized that the function where non-full chunks are
       | dropped is just the one for finding the pivot, not the one for
       | finding the median. I understand now.
        
       | RcouF1uZ4gsC wrote:
       | > The C++ standard library uses an algorithm called introselect
       | which utilizes a combination of heapselect and quickselect and
       | has an O(nlogn) bound.
       | 
       | Introselect is a combination of Quickselect and Median of Medians
       | and is O(n) worst case.
        
       | ValleZ wrote:
       | I was asked to invent this algorithm on a whiteboard in 30
       | minutes. Loved it.
        
       | nilslindemann wrote:
       | "ns" instead of "l" and "n" instead of "el" would have been my
       | choice (seen in Haskell code).
        
         | robinhouston wrote:
         | The trouble with using this convention (which I also like) in
         | Python code is that sooner or later one wants to name a pair of
         | lists 'as' and 'bs', which then causes a syntax error because
         | 'as' is a keyword in Python. There is a similar problem with
         | 'is' and 'js'.
        
           | nilslindemann wrote:
           | Sure, naming is hard, but avoid "l", "I", "O", "o".
           | 
           | Very short variable names (including "ns" and "n") are always
           | some kind of disturbance when reading code, especially when
           | the variable lasts longer than one screen of code - one has
           | to memorize the meaning. They sometimes have a point, e.g. in
           | mathematical code like this one. But variables like "l" and
           | "O" are bad for a further reason, as they can not easily be
           | distinguished from the numbers. See also the Python style
           | guide: https://peps.python.org/pep-0008/#names-to-avoid
        
       | danlark wrote:
       | Around 4 years ago I compared lots of different median algorithms
       | and the article turned out to be much longer than I anticipated
       | :)
       | 
       | https://danlark.org/2020/11/11/miniselect-practical-and-gene...
        
         | cfors wrote:
         | Just wanted to say thank you for this article - I've read and
         | shared this a few times over the years!
        
       | chpatrick wrote:
       | Another nice one is O(1) weighted sampling (after O(n)
       | preprocessing).
       | 
       | https://en.wikipedia.org/wiki/Alias_method
        
       | Someone wrote:
       | FTA:
       | 
       |  _"Proof of Average O(n)
       | 
       | On average, the pivot will split the list into 2 approximately
       | equal-sized pieces. Therefore, each subsequent recursion operates
       | on 1/2 the data of the previous step."_
       | 
       | That "therefore" doesn't follow, so this is more an intuition
       | than a proof. The problem with it is that the medium is more
       | likely to end up in the larger of the two pieces, so you more
       | likely have to recurse on the larger part than on the smaller
       | part.
       | 
       | What saves you is that _O(n)_ doesn't say anything about
       | constants.
       | 
       | Also, I would think you can improve things a bit for real world
       | data by, on subsequent iterations, using the average of the set
       | as pivot (You can compute that for both pieces on the fly while
       | doing the splitting. The average may not be in the set of items,
       | but that doesn't matter for this algorithm). Is that true?
        
         | meatmanek wrote:
         | If I'm understanding correctly, the median is actually
         | guaranteed to be in the larger of the two pieces of the array
         | after partitioning. That means on average you'd only discard
         | 25% of the array after each partition. Your selected pivot is
         | either below the median, above the median, or exactly the
         | median. If it's below the median, it could be anywhere in the
         | range [p0, p50) for an average of around p25; if it's above the
         | median, it could be anywhere in the range (p50, p100] for an
         | average of around p75.
         | 
         | Since these remaining fractions combine multiplicatively, we
         | actually care about the geometric mean of the remaining
         | fraction of the array, which is e^[(integral of ln(x) dx from
         | x=0.5 to x=1) / (1 - 0.5)], or about 73.5%.
         | 
         | Regardless, it forms a geometric series, which should converge
         | to 1/(1-0.735) or about 3.77.
         | 
         | Regarding using the average as the pivot: the question is
         | really what quantile would be equal to the mean for your
         | distribution. Heavily skewed distributions would perform pretty
         | badly. It would perform particularly badly on 0.01*np.arange(1,
         | 100) -- for each partitioning step, the mean would land between
         | the first element and the second element.
        
       | mabbo wrote:
       | I learned about the median-of-medians quickselect algorithm when
       | I was an undergrad and was really impressed by it. I implemented
       | it, and it was terribly slow. It's runtime grew linearly, but
       | that only really mattered if you had at least a few billion items
       | in your list.
       | 
       | I was chatting about this with a grad student friend who casually
       | said something like "Sure, it's slow, but what really matters is
       | that it proves that it's _possible_ to do selection of an
       | unsorted list in O(n) time. At one point, we didn 't know whether
       | that was even possible. Now that we do, we know there might an
       | even faster linear algorithm." Really got into the philosophy of
       | what Computer Science is about in the first place.
       | 
       | The lesson was so simple yet so profound that I nearly applied to
       | grad school because of it. I have no idea if they even recall the
       | conversation, but it was a pivotal moment of my education.
        
         | zelphirkalt wrote:
         | Does the fact, that any linear time algorithm exist, indicate,
         | that a faster linear time algorithm exists? Otherwise, what is
         | the gain from that bit of knowledge? You could also think: "We
         | already know, that some <arbitrary O(...)> algorithm exists,
         | there might be an even faster <other O(...)> algorithm!" What
         | makes the existence of an O(n) algo give more indication, than
         | the existence of an O(n log(n)) algorithm?
        
           | anonymoushn wrote:
           | If you had two problems, and a linear time solution was known
           | to exist for only one of them, I think it would be reasonable
           | to say that it's more likely that a practical linear time
           | solution exists for that one than for the other one.
        
           | blt wrote:
           | I am not the original commenter, but I (and probably many CS
           | students) have had similar moments of clarity. The key part
           | for me isn't
           | 
           | > there might be an even faster linear algorithm,
           | 
           | but
           | 
           | > it's possible to do selection of an unsorted list in O(n)
           | time. At one point, we didn't know whether that was even
           | possible.
           | 
           | For me, the moment of clarity was understanding that
           | theoretical CS mainly cares about _problems_ , not
           | _algorithms_. Algorithms are tools to prove upper bounds on
           | the complexity of problems. Lower bounds are equally
           | important and cannot be proved by designing algorithms. We
           | even see theorems of the form  "there exists an O(whatever)
           | algorithm for <problem>": the algorithm's existence can
           | sometimes be proven non-constructively.
           | 
           | So if the median problem sat for a long time with a linear
           | lower bound and superlinear upper bound, we might start to
           | wonder if the problem has a superlinear lower bound, and
           | spend our effort working on that instead. The existence of a
           | linear-time algorithm immediately closes that path. The only
           | remaining work is to tighten the constant factor. The
           | community's effort can be focused.
           | 
           | A famous example is the linear programming problem. Klee and
           | Minty proved an exponential worst case for _the simplex
           | algorithm_ , but not for _linear programming itself_. Later,
           | Khachiyan proved that the ellipsoid algorithm was polynomial-
           | time, but it had huge constant factors and was useless in
           | practice. However, a few years later, Karmarkar gave an
           | efficient polynomial-time algorithm. One can imagine how
           | Khachiyan 's work, although inefficient, could motivate a
           | more intense focus on polynomial-time LP algorithms leading
           | to Karmarkar's breakthrough.
        
         | mrguyorama wrote:
         | We studied (I believe) this algorithm in my senior year of
         | Computer Science. We talked about the theory side of it that
         | you mention, but this algorithm was also used to demonstrate
         | that "slow linear algorithm" is not faster than "Fast nlogn
         | algorithm" in most real life cases.
         | 
         | I think we got a constant factor of 22 for this algorithm so
         | maybe it was a related one or something.
        
       | zelphirkalt wrote:
       | You can simply pass once over the data, and while you do that,
       | count occurrences of the elements, memorizing the last maximum.
       | Whenever an element is counted, you check, if that count is now
       | higher than the previous maximum. If it is, you memorize the
       | element and its count as the maximum, of course. Very simple
       | approach and linear in time, with minimal book keeping on the way
       | (only the median element and the count (previous max)).
       | 
       | I don't find it surprising or special at all, that finding the
       | median works in linear time, since even this ad-hoc thought of
       | way is in linear time.
       | 
       | EDIT: Ah right, I mixed up mode and median. My bad.
        
         | gcr wrote:
         | This finds the mode (most common element), not the median.
         | 
         | Wouldn't you also need to keep track of all element counts with
         | your approach? You can't keep the count of only the second-
         | most-common element because you don't know what that is yet.
        
           | zelphirkalt wrote:
           | Yes, you are right. I mixed up mode and median.
           | 
           | And yes, one would need to keep track of at least a key for
           | each element (not a huge element, if they are somehow huge).
           | But that would be about space complexity.
        
             | gcr wrote:
             | pardon! it's fun to think about though!
        
       | hammeiam wrote:
       | The "Split the array into subarrays of length 5, now sorting all
       | of the arrays is O(n) instead of O(n log n)" feels like cheating
       | to me
        
         | tptacek wrote:
         | They're not sorting all the arrays?
         | 
         |  _Later_
         | 
         | (i was going to delete this comment, but for posterity, i
         | misread --- sorting _the lists_ , not the contents of the list,
         | sure)
        
         | marcosdumay wrote:
         | O(n log 5) is O(n). There's no cheating, sorting small arrays
         | in a list is a completely different problem from sorting a
         | large array.
        
         | pfortuny wrote:
         | Actually lots of algorithms "feel" like cheating until you
         | understand what you were not looking at (fast matrix
         | multiplication, fast fourier transforms...).
        
         | Sharlin wrote:
         | It's unambiguously O(n), there's no lg n anywhere to be seen.
         | It may be O(n) with a bit larger constant factor, but the whole
         | point of big-O analysis is that those don't matter.
        
         | IncreasePosts wrote:
         | It would only be cheating if you could merge the arrays in
         | O(1), which you can't.
        
       | someplaceguy wrote:
       | return l[len(l) / 2]
       | 
       | I'm not a Python expert, but doesn't the `/` operator return a
       | float in Python? Why would you use a float as an array index
       | instead of doing integer division (with `//`)?
       | 
       | I know this probably won't matter until you have extremely large
       | arrays, but this is still quite a code smell.
       | 
       | Perhaps this could be forgiven if you're a Python novice and
       | hadn't realized that the two different operators exist, but this
       | is not the case here, as the article contains this even more
       | baffling code which uses integer division in one branch but float
       | division in the other:                   def
       | quickselect_median(l, pivot_fn=random.choice):             if
       | len(l) % 2 == 1:                 return quickselect(l, len(l) //
       | 2, pivot_fn)             else:                 return 0.5 *
       | (quickselect(l, len(l) / 2 - 1, pivot_fn) +
       | quickselect(l, len(l) / 2, pivot_fn))
       | 
       | That we're 50 comments in and nobody seems to have noticed this
       | only serves to reinforce my existing prejudice against the
       | average Python code quality.
        
         | runeblaze wrote:
         | I do agree that it is a code smell. However given that this is
         | an algorithms article I don't think it is exactly that fair to
         | judge it based on code quality. I think of it as: instead of
         | writing it in pseudocode the author chose a real pseudocode-
         | like programming language, and it (presumably) runs well for
         | illustrative purposes.
        
         | jononor wrote:
         | Well spotted! In Python 2 there was only one operator, but in
         | Python 3 they are distinct. Indexing an array with a float
         | raises an exception, I believe.
        
       | xinok wrote:
       | > P.S: In 2017 a new paper came out that actually makes the
       | median-of-medians approach competitive with other selection
       | algorithms. Thanks to the paper's author, Andrei Alexandrescu for
       | bringing it to my attention!
       | 
       | He also gave a talk about his algorithm in 2016. He's an
       | entertaining presenter, I highly recommended!
       | 
       | There's Treasure Everywhere - Andrei Alexandrescu
       | 
       | https://www.youtube.com/watch?v=fd1_Miy1Clg
        
         | pjkundert wrote:
         | Andrei Alexandrescu is awesome; around 2000 he gave on talk on
         | lock-free wait-free algorithms that I immediately applied to a
         | huge C++ industrial control networking project at the time.
         | 
         | I'd recommend anyone who writes software listening and reading
         | anything of Andrei's you can find; this one is indeed a
         | Treasure!
        
       | SkiFire13 wrote:
       | I wonder what's the reason of picking groups of 5 elements
       | instead of 2 or 8.
        
         | lalaland1125 wrote:
         | 1. You want an odd number so the median is the middle element
         | of the sublist.
         | 
         | 2. One and three are probably too small
        
         | danlark wrote:
         | 3 and 4 elements will fail to prove the complexity is linear
         | 
         | You still can do 3 or 4 but with slight modifications
         | 
         | https://arxiv.org/abs/1409.3600
         | 
         | For example, for 4 elements, it's advised to take lower median
         | for the first half and upper median for the second half. Then
         | the complexity will be linear
        
       | someplaceguy wrote:
       | I found this part of the code quite funny:                   # If
       | there are < 5 items, just return the median         if len(l) <
       | 5:             # In this case, we fall back on the first median
       | function we wrote.             # Since we only run this on a list
       | of 5 or fewer items, it doesn't             # depend on the
       | length of the input and can be considered constant             #
       | time.             return nlogn_median(l)
       | 
       | Hell, why not just use 2^140 instead of 5 as the cut-off point,
       | then? This way you'd have _constant time_ median finding for all
       | arrays that can be represented in any real-world computer! :) [1]
       | 
       | [1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-
       | the-oceans/
        
         | SkiFire13 wrote:
         | Ultimately when an algorithm has worse complexity than another
         | it might still be faster up to a certain point. In this case 5
         | is likely under that point, though I doubt 2^256 will.
         | 
         | In practice you might also want to use a O(n^2) algorithm like
         | insertion sort under some threshold.
        
           | someplaceguy wrote:
           | > Ultimately when an algorithm has worse complexity than
           | another it might still be faster up to a certain point.
           | 
           | Sure, but the author didn't argue that the simpler algorithm
           | would be faster for 5 items, which would indeed make sense.
           | 
           | Instead, the author argued that it's OK to use the simpler
           | algorithm for less than 5 items because 5 is a constant and
           | therefore the simpler algorithm runs in constant time, hence
           | my point that you could use the same argument to say that
           | 2^140 (or 2^256) could just as well be used as the cut-off
           | point and similarly argue that the simpler algorithm runs in
           | _constant time_ for all arrays than can be represented on a
           | real-world computer, therefore obviating the need for the
           | more complex algorithm (which obviously makes no sense).
        
             | thrw2486776 wrote:
             | If you set n=2^140, then sure, it's constant. If instead
             | you only have n<=2^140, then n varies across a large set
             | and is practically indistinguishable from n<=infinity
             | (since we get into the territory of the number of atoms in
             | the universe), therefore you can perform limit calculations
             | on it, in particular big O stuff.
             | 
             | In the article n was set to 5. All of those arrays (except
             | maybe 1) have exactly 5 elements. There is no variance (and
             | even if there was, it would be tiny, there is no point in
             | talking about limits of 5-element sequences).
        
               | someplaceguy wrote:
               | > In the article n was set to 5. All of those arrays
               | (except maybe 1) have exactly 5 elements. There is no
               | variance
               | 
               | No, the code was:                   # If there are < 5
               | items, just return the median         if len(l) < 5:
               | return nlogn_median(l)
               | 
               | > and even if there was, it would be tiny, there is no
               | point in talking about limits of 5-element sequences
               | 
               | So your point is: not all constants are created equal.
               | Which circles all the way back to my original point that
               | this argument is pretty funny :)
        
         | raincole wrote:
         | Big-O notation and "real-world computer" don't belong to the
         | same sentence. The whole point of big-O notation is to abstract
         | the algorithm out of real-world limitations so we can talk
         | about arbitrarily large input.
         | 
         | Any halting program that runs on a real world computer is O(1),
         | by definition.
        
           | someplaceguy wrote:
           | > The whole point of big-O notation is to abstract the
           | algorithm out of real-world limitations so we can talk about
           | arbitrarily large input.
           | 
           | Except that there is no such thing as "arbitrarily large
           | storage", as my link in the parent comment explained:
           | https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/
           | 
           | So why would you want to talk about arbitrarily large input
           | (where the input is an array that is stored in memory)?
           | 
           | As I understood, this big-O notation is intended to have some
           | real-world usefulness, is it not? Care to elaborate what that
           | usefulness is, exactly? Or is it just a purely fictional
           | notion in the realm of ideas with no real-world application?
           | 
           | And if so, why bother studying it at all, except as a
           | mathematical curiosity written in some mathematical pseudo-
           | code rather than a programming or engineering challenge
           | written in a real-world programming language?
           | 
           | Edit: s/pretending/intended/
        
             | raincole wrote:
             | > (where the input is an array that is stored in memory)?
             | 
             | If the input is an array that is stored in a piece of real-
             | world memory, then the only possible complexity is O(1).
             | It's just how big-O works. Big-O notation is an abstraction
             | that is much much closer to mathematics than to
             | engineering.
             | 
             | > this big-O notation is pretending to have some real-world
             | usefulness...
             | 
             | Big-O notation is not a person so I'm not sure whether it
             | can pretend something. CS professors might exaggerate big-O
             | notation's real-world usefulness so their students don't
             | fall asleep too fast.
             | 
             | > fictional
             | 
             | Theoretical. Just like the other theoretical ideas, _at
             | best_ big-O notation gives some vague insights that help
             | people solve real problems. It gives a _very_ rough feeling
             | about whether an algorithm is fast or not.
             | 
             | By the way, Turing machine is in this category as well.
        
             | fenomas wrote:
             | Big-O analysis is about _scaling_ behavior - its real-world
             | implications lie in what it tells you about _relative_
             | sizes, not absolute sizes.
             | 
             | E.g., if you need to run a task on 10M inputs, then knowing
             | that your algorithm is O(N) doesn't tell you anything at
             | all about how long your task will take. It also doesn't
             | tell you whether that algorithm will be faster than some
             | other algorithm that's O(N^2).
             | 
             | But it _does_ tell you that if your task size doubles to
             | 20M inputs, you can expect the time required for the first
             | algorithm to double, and the second to quadruple. And that
             | knowledge isn 't arcane or theoretical, it works on real-
             | world hardware and is really useful for modeling how your
             | code will run as inputs scale up.
        
       | beyondCritics wrote:
       | <It's not straightforward to prove why this is O(n).
       | 
       | Replace T(n/5) with T(floor(n/5)) and T(7n/10) with
       | T(floor(7n/10)) and show by induction that T(n) <= 10n for all n.
        
       | sfpotter wrote:
       | A nice way to approximate the median:
       | https://www.stat.berkeley.edu/~ryantibs/papers/median.pdf
        
       | jagged-chisel wrote:
       | It's quicksort with a modification to select the median during
       | the process. I feel like this is a good way to approach lots of
       | "find $THING in list" questions.
        
       | throwaway294531 wrote:
       | If you're selecting the n:th element, where n is very small (or
       | large), using median-of-medians may not be the best choice.
       | 
       | Instead, you can use a biased pivot as in [1] or something I call
       | "j:th of k:th". Floyd-Rivest can also speed things up. I have a
       | hobby project that gets 1.2-2.0x throughput when compared to a
       | well implemented quickselect, see:
       | https://github.com/koskinev/turboselect
       | 
       | If anyone has pointers to fast generic & in-place selection
       | algorithms, I'm interested.
       | 
       | [1] https://doi.org/10.4230/LIPIcs.SEA.2017.24
        
       | ncruces wrote:
       | An implementation in Go, that's (hopefully) simple enough to be
       | understandable, yet minimally practical:
       | 
       | https://github.com/ncruces/sort/blob/main/quick/quick.go
        
       | kccqzy wrote:
       | > Quickselect gets us linear performance, but only in the average
       | case. What if we aren't happy to be average, but instead want to
       | guarantee that our algorithm is linear time, no matter what?
       | 
       | I don't agree with the need for this guarantee. Note that the
       | article already says the selection of the pivot is by random. You
       | can simply choose a very good random function to avoid an
       | attacker crafting an input that needs quadratic time. You'll
       | never be unlucky enough for this to be a problem. This is
       | basically the same kind of mindset that leads people into
       | thinking, what if I use SHA256 to hash these two different
       | strings to get the same hash?
        
         | mitthrowaway2 wrote:
         | It's a very important guarantee for use in real-time signal
         | processing applications.
        
         | forrestthewoods wrote:
         | > I don't agree with the need for this guarantee.
         | 
         | You don't get to agree with it or not. It depends on the
         | project! Clearly there exist some projects in the world where
         | it's important.
         | 
         | But honestly it doesn't matter. Because as the article shows
         | with random data that median-of-medians is strictly better than
         | random pivot. So even if you don't need the requirement there
         | is zero loss to achieve it.
        
           | kccqzy wrote:
           | The median-of-median comes at a cost for execution time.
           | Chances are, sorting each five-element chunk is a lot slower
           | than even running a sophisticated random number generator.
        
       | rented_mule wrote:
       | 10-15 years ago, I found myself needing to regularly find the
       | median of many billions of values, each parsed out of a multi-
       | kilobyte log entry. MapReduce was what we were using for
       | processing large amounts of data at the time. With MapReduce over
       | that much data, you don't just want linear time, but ideally
       | single pass, distributed across machines. Subsequent passes over
       | much smaller amounts of data are fine.
       | 
       | It was a struggle until I figured out that knowledge of the
       | precision and range of our data helped. These were timings,
       | expressed in integer milliseconds. So they were non-negative, and
       | I knew the 90th percentile was well under a second.
       | 
       | As the article mentions, finding a median typically involves
       | something akin to sorting. With the above knowledge, bucket sort
       | becomes available, with a slight tweak in my case. Even if the
       | samples were floating point, the same approach could be used as
       | long as an integer (or even fixed point) approximation that is
       | very close to the true median is good enough, again assuming a
       | known, relatively small range.
       | 
       | The idea is to build a dictionary where the keys are the timings
       | in integer milliseconds and the values are a count of the keys'
       | appearance in the data, i.e., a histogram of timings. The maximum
       | timing isn't known, so to ensure the size of the dictionary
       | doesn't get out of control, use the knowledge that the 90th
       | percentile is well under a second and count everything over, say,
       | 999ms in the 999ms bin. Then the dictionary will be limited to
       | 2000 integers (keys in the range 0-999 and corresponding values)
       | - this is the part that is different from an ordinary bucket
       | sort. All of that is trivial to do in a single pass, even when
       | distributed with MapReduce. Then it's easy to get the median from
       | that dictionary / histogram.
        
         | justinpombrio wrote:
         | Did you actually need to find the true median of billions of
         | values? Or would finding a value between 49.9% and 50.1%
         | suffice? Because the latter is much easier: sample 10,000
         | elements uniformly at random and take their median.
         | 
         | (I made the number 10,000 up, but you could do some statistics
         | to figure out how many samples would be needed for a given
         | level of confidence, and I don't think it would be
         | prohibitively large.)
        
           | andruby wrote:
           | I was thinking the same thing.
           | 
           | In all use-cases I've seen a close estimate of the median was
           | enough.
        
           | rented_mule wrote:
           | The kind of margin you indicate would have been plenty for
           | our use cases. But, we were already processing all these log
           | entries for multiple other purposes in a single pass (not one
           | pass per thing computed). With this single pass approach, the
           | median calculation could happen with the same single-pass
           | parsing of the logs (they were JSON and that parsing was most
           | of our cost), roughly for free.
           | 
           | Uniform sampling also wasn't obviously simple, at least to
           | me. There were thousands of log files involved, coming from
           | hundreds of computers. Any single log file only had timings
           | from a single computer. What kind of bias would be introduced
           | by different approaches to distributing those log files to a
           | cluster for the median calculation? Once the solution
           | outlined in the previous comment was identified, that seemed
           | simpler that trying to understand if we were talking about
           | 49-51% or 40-50%. And if it was too big a margin,
           | restructuring our infra to allow different log file
           | distribution algorithms would have been far more complicated.
        
           | enriquto wrote:
           | > the latter is much easier: sample 10,000 elements uniformly
           | at random and take their median
           | 
           | Do you have a source for that claim?
           | 
           | I don't see how could that possibly be true... For example,
           | if your original points are sampled from two gaussians of
           | centers -100 and 100, of small but slightly different
           | variance, then the true median can be anywhere between the
           | two centers, and you may need a humungous number of samples
           | to get anywhere close to it.
           | 
           | True, in that case any point between say -90 and 90 would be
           | equally good as a median in most applications. But this does
           | not mean that the median can be found accurately by your
           | method.
        
             | maronato wrote:
             | the key word is "uniformly". If your data is not uniformly
             | distributed, then you just have to pick random values non-
             | uniformly. There are many ways to do that, and once you
             | have your algo you'll be able to reliably find an
             | approximation of the median much faster than you would find
             | the actual median.
             | 
             | https://en.m.wikipedia.org/wiki/Non-
             | uniform_random_variate_g...
        
         | ant6n wrote:
         | I'm not sure why you use a dictionary with keys 0...999,
         | instead of an array indexed 0...999.
        
       | TacticalCoder wrote:
       | I really enjoyed TFA but this:
       | 
       | > Technically, you could get extremely unlucky: at each step, you
       | could pick the largest element as your pivot. Each step would
       | only remove one element from the list and you'd actually have
       | O(n2) performance instead of O(n)
       | 
       | If adversarial input is a concern, doing a O(n) shuffle of the
       | data first guarantees this cannot happen. If the data is really
       | too big to shuffle, then only shuffle once a bucket is small
       | enough to be shuffled.
       | 
       | If you do shuffle, probabilities are here to guarantee that that
       | worst case cannot happen. If anyone says that "technically" it
       | can happen, I'll answer that then "technically" an attacker could
       | also guess correctly every bit of your 256 bits private key.
       | 
       | Our world is build on probabilities: all our private keys are
       | protected by the mathematical improbability that someone shall
       | guess them correctly.
       | 
       | From what I read, a shuffle followed by quickselect is O(n) for
       | all practical purposes.
        
         | bo1024 wrote:
         | You're already using your own randomness to pick the pivot at
         | random, so I don't see why the shuffle helps more. But yes, if
         | your randomness is trustworthy, the probability of more than
         | O(n) runtime is very low.
        
       ___________________________________________________________________
       (page generated 2024-07-25 23:03 UTC)