[HN Gopher] American flag sort
       ___________________________________________________________________
        
       American flag sort
        
       Author : zerojames
       Score  : 239 points
       Date   : 2024-04-23 23:26 UTC (23 hours ago)
        
 (HTM) web link (xlinux.nist.gov)
 (TXT) w3m dump (xlinux.nist.gov)
        
       | muti wrote:
       | Visualisation https://www.youtube.com/watch?v=k1XkZ5ANO64
       | 
       | Wikipedia page https://en.wikipedia.org/wiki/American_flag_sort
        
         | AnarchismIsCool wrote:
         | Looks more like anarchist flag sort...
        
           | codetrotter wrote:
           | I'll take your word for it, AnarchismIsCool ;)
        
         | aaron695 wrote:
         | > Visualisation https://www.youtube.com/watch?v=k1XkZ5ANO64
         | 
         | OT: I quite liked the follow up video Youtube suggested -
         | 
         | Sorting algorithms to relax/study to -
         | https://www.youtube.com/watch?v=vr5dCRHAgb0
        
         | peterleiser wrote:
         | The next video that came up for me was "sorting algorithms to
         | relax/study to": https://www.youtube.com/watch?v=vr5dCRHAgb0
         | 
         | I'm almost embarrassed by how long I've watched it!
        
         | knodi123 wrote:
         | No offense, but I don't think that visualization helps very
         | much.
         | 
         | Here's a visualization of a much larger data set that
         | illustrates what's actually happening with the buckets and the
         | major steps.
         | 
         | https://www.youtube.com/watch?v=21jEd1FUiV8
        
           | fourteenfour wrote:
           | Thanks, this one makes the stripe pattern much more apparent.
        
       | tomphoolery wrote:
       | Freedom Sort.
        
         | tomschlick wrote:
         | Cue the Team America theme song
        
           | mcfedr wrote:
           | That is basically what I hear in my head everytime I see that
           | flag
        
         | tantalor wrote:
         | For democracy!
        
           | seattle_spring wrote:
           | I'm doing my part!
        
             | __d wrote:
             | would you like to know more?
        
         | tom_ wrote:
         | If you disagree, see the first amendment! If you still
         | disagree, see the second amendment!
        
           | mondobe wrote:
           | And if you STILL disagree, see the third amendment! If you're
           | a soldier, in my house, I guess.
        
       | karmakaze wrote:
       | At first I didn't think I'd be interested.
       | 
       | > Using some efficiency techniques, it is twice as fast as
       | quicksort for large sets of strings.
        
         | hinkley wrote:
         | I'm not calling it flag sort.
         | 
         | It's an in-place bucket sort which is interesting.
        
           | wodenokoto wrote:
           | From Wikipedia[1]:
           | 
           | > The name American flag sort comes by analogy with the Dutch
           | national flag problem[2] in the last step: efficiently
           | partition the array into many "stripes".
           | 
           | [1] https://en.m.wikipedia.org/wiki/American_flag_sort
           | 
           | [2]
           | https://en.m.wikipedia.org/wiki/Dutch_national_flag_problem
        
             | hinkley wrote:
             | I read this rationale before I made my comment.
        
             | kleiba wrote:
             | _> From Wikipedia[1]:_
             | 
             | ...or from the very article itself...
        
           | remram wrote:
           | Or _in-place radix sort_ maybe.
           | 
           | I agree with you, the name is ridiculous, reminds me of
           | _miracle sort_ or _intelligent design sort_ : https://www.dan
           | germouse.net/esoteric/intelligentdesignsort.h...
           | 
           | Probably those two flag sort algorithms were named before the
           | joke algorithms were popularily known.
        
         | thefifthsetpin wrote:
         | Quicksort isn't great when comparing two items is expensive.
        
       | owlninja wrote:
       | The 'more information' site linked on the page seems like
       | something some HN folk love.
       | 
       | https://www.workmall.com/flags/united_states_flag.html
        
       | defrost wrote:
       | There's a Peter M. McIlroy as well!?!
       | 
       | All this time and I thought it was just Doug, I'm _guessing_
       | Peter might be the son of the father of pipe?
        
         | icsa wrote:
         | Yes. He is a software engineer in Seattle.
        
       | jdeisenberg wrote:
       | Wondering how this sort would perform with Unicode input (given
       | 256 buckets); specifically what the results would look like.
        
         | bawolff wrote:
         | Presumably it would be fine, since unicode input is still a
         | string of bytes at the end of the day.
        
           | nanidin wrote:
           | It says it sorts one byte at a time. I think this would break
           | for anything not utf-8.
        
             | ts4z wrote:
             | Seems like it should work for arbitrary byte strings (any
             | charset, any encoding)but obviously the performance
             | characteristics will differ because of non-uniform
             | distribution. But that happens even in ASCII.
        
             | brirec wrote:
             | This would even break UTF-8, since multi-byte characters
             | are a thing!
        
               | dumbo-octopus wrote:
               | How would that break anything? The strings aren't being
               | split.
        
               | EmilyHughes wrote:
               | UTF-8 is downward compatible to ASCII, so anything that
               | is just a standard character (like every character in
               | this comment) is just a byte.
        
               | ursusmaritimus wrote:
               | Lexicographic encoding of UTF-8 byte sequences matches
               | lexicographic order of the sequence of Unicode code-
               | points. So you can sort UTF-8 strings as byte strings.
               | Not that sorting by code-points has much meaning, but you
               | can use the Unicode collation algorithm first.
        
               | remram wrote:
               | TIL, thanks. That makes sense given how the length
               | prefixes look like but I never thought about it. I wonder
               | if this was by chance or if the UTF-8 creator thought
               | about it.
        
               | mzs wrote:
               | % printf "%s\n" _A_ _B_ | sort
               | 
               |  _A_
               | 
               |  _B_
               | 
               | % printf "%s" _A_ _B_ | xxd -b -c4
               | 
               | 00000000: 11110000 10011101 10010000 10110100 ....
               | 
               | 00000004: 11110000 10011101 10010000 10110101 ....
               | 
               | % printf "%s" _A_ _B_ | xxd -c1 -ps | sort | xxd -r -ps |
               | xxd -b -c4
               | 
               | 00000000: 10010000 10010000 10011101 10011101 ....
               | 
               | 00000004: 10110100 10110101 11110000 11110000 ....
               | 
               | % printf "%s" _A_ _B_ | xxd -c1 -ps | sort | xxd -r -ps
               | 
               | ????????
        
           | moonchild wrote:
           | Unicode collation is very complex. If you wanted to sort
           | according to that, then you would need to devise an
           | isomorphism between unicode strings and bitstrings such that
           | lexicographic ordering in the bitstring domain agrees with
           | unicode collation in the unicode domain. I'm not saying it's
           | impossible, but it's not at all obvious how to do it. On the
           | other hand, if all you want is _some_ total order for
           | acceleration structures, then you can indeed just use an
           | arbitrary encoding like utf8 and do the obvious thing.
        
             | dsamarin wrote:
             | The obvious solution to this is to pre-compile a list of
             | all possible Unicode strings up to the required length,
             | sort them, and create a lookup table using their indices. I
             | wonder for what kind of project this would be useful.
        
               | thfuran wrote:
               | If that works for you for anything but very small
               | strings, I want to buy your computer.
        
               | remram wrote:
               | Even for 3-codepoint strings it already wouldn't fit in
               | any computer that exists. And that is not sufficient to
               | encode all 1-glyph strings.
        
             | IncreasePosts wrote:
             | Isn't that exactly what the Unicode collation algorithm is?
             | 
             | https://unicode.org/reports/tr10
             | 
             | tl;dr from its wikipedia page:
             | 
             | > The Unicode collation algorithm (UCA) is an algorithm
             | defined in Unicode Technical Report #10, which is a
             | customizable method to produce binary keys from strings
             | representing text in any writing system and language that
             | can be represented with Unicode. These keys can then be
             | efficiently compared byte by byte in order to collate or
             | sort them according to the rules of the language, with
             | options for ignoring case, accents, etc
        
               | moonchild wrote:
               | Ah! You are right.
        
             | bawolff wrote:
             | You can just use the icu library to do that. Its a very
             | common thing to do when working with unicode collations.
        
         | dhosek wrote:
         | Ignoring the fact that if you have Unicode encoded strings, you
         | probably want some sort of lexical sort rather than a sort on
         | the raw Unicode character codes which is semantically
         | meaningless, there is no reason to assume that Unicode data
         | would be any different to sort than any other data.
        
         | seanhunter wrote:
         | I think unicode is intrinsically antithetical to freedom. This
         | would do great with ASCII.
        
       | re wrote:
       | Looks like there's a Rust implementation:
       | https://crates.io/crates/afsort
       | 
       | > When sorting English words, this implementation seems to be
       | about 40% faster than sort_unstable from the Rust standard
       | library.
       | 
       | That's better than I would have expected. It would be interesting
       | to see how it does on other types of string sort problems.
        
         | antonhag wrote:
         | Hi! Crate author here, happy to see my old project getting
         | mentioned. Let me know if you have any questions!
        
           | ewalk153 wrote:
           | Do you actively use this function in any projects? What was
           | your inspiration to write the crate?
        
             | antonhag wrote:
             | I don't actively use it, unfortunately. The main
             | inspiration was to faster sort inputs into the
             | https://github.com/BurntSushi/fst crate, which I in turn
             | used to try to build a search library.
        
       | Zelizz wrote:
       | Pretty much just a variation of counting sort with a worse name?
       | https://en.wikipedia.org/wiki/Counting_sort
        
         | carabiner wrote:
         | A superior name.
        
         | dumbo-octopus wrote:
         | No. The very article you link details the difference, and that
         | Counting Sort is often a subroutine in Radix Sort, and that
         | Counting Sort is extremely poorly suited to strings, which is
         | the entire purpose of this sort. And this name is better. So...
         | "no" in basically every way possible.
        
           | Zelizz wrote:
           | Characters in a string can be thought of as digits in a
           | base-256 number. Call counting sort recursively on each
           | bucket, looking at the next character in the string. Can you
           | not see the similarity?
        
             | chowells wrote:
             | There is no one who fails to see the similarity, because
             | they're both kinds of radix sorts. For instance, a counting
             | sort is a degenerate form of radix sort that only does one
             | pass.
             | 
             | But where the classic radix sort is bottom-up, the American
             | flag sort is top-down. That really matters for data where
             | the the most significant portions of the data is likely to
             | be all you need to consider for the sorting. While a top-
             | down sort will have similar performance to bottom-up in the
             | worst case, in the best case it can skip a lot of passes,
             | especially if the inputs are much longer than the longest
             | common prefix.
        
             | dumbo-octopus wrote:
             | They're related, but in a very specific way that does not
             | meet the bar for "variation" in my opinion. As you
             | described, counting sort is a very simple procedure that
             | works well only on a very specific input domain. If another
             | algorithm has "more logic" than counting sort itself in
             | order to transform a very different input domain into a
             | format that is suited for making many repeated calls to (a
             | procedure that may be implemented as) counting sort, in a
             | specific pattern, and appropriately coalescing the results
             | of those calls, I think it is appropriate to let it have
             | its own name. Would you prefer to refer to every possible
             | utilization of counting as "that version of counting sort
             | where...."?
        
         | sltkr wrote:
         | American flag sort is essentially an in-place version of
         | counting sort.
         | 
         | That's talking about the single-pass American flag sort. If you
         | use American flag sort to sort strings, you do multiple passes:
         | on the first pass, you sort all strings by their first
         | character, then for each starting character, you sort the
         | strings with that starting character (which are now in a
         | contiguous subarray) by their second character, and so on.
         | 
         | @chowells calls it a "top-down radix sort" below, which is a
         | great description. It also explains the strengths and
         | weaknesses of the two algorithms: radix sort works great for
         | small strings of fixed length (e.g. IPv4 addresses, which can
         | be thought of as 4-byte sequences) while American flag sort
         | works great for variable-length strings like actual textual
         | strings, especially if they don't share common prefixes (e.g.
         | dictionary words, usernames, etc.)
         | 
         | @hinkley pointed out that the recursive version is just a
         | bucket sort, but in-place. Which is also true!
         | 
         | tl;dr:
         | 
         | Single-pass American flag sort = in-place counting sort
         | 
         | Recursive American flag sort = in-place bucket sort
        
       | Someone wrote:
       | The paper
       | (https://static.usenix.org/publications/compsystems/1993/win_...
       | linked to from the page) is well-written and worth reading,
       | iterating over multiple C programs that build up to this sort
       | algorithm and giving a good rationale for each iteration.
        
         | awanderingmind wrote:
         | There is currently a typo in the link which prevents it from
         | working - the correct link is
         | https://static.usenix.org/publications/compsystems/1993/win_...
         | 
         | Interesting read, thanks!
        
       | Traubenfuchs wrote:
       | > it is twice as fast as quicksort for large sets of string
       | 
       | Define large.
        
         | TheRealPomax wrote:
         | They did. But you'll need to read the (free) paper [1] to get
         | the details omitted in the NITS catalog summary.
         | 
         | [1]
         | http://www.usenix.org/publications/compsystems/1993/win_mcil...
        
       | Oarch wrote:
       | The comments made me realise there's no national flag with lots
       | of vertical stripes. Seems like a hole in the market.
       | 
       | Someone alert CGP Grey!
        
       | Kon-Peki wrote:
       | Sorts like these work really well on GPUs. You would start with a
       | single thread for the first pass and then keep spawning one-
       | thread-per-bucket for every additional pass until done. It isn't
       | particularly taxing on the GPU; your advantage comes from being
       | able to run potentially thousands of threads simultaneously
       | without having to worry about coordinating read/write memory
       | access.
        
       | NikkiA wrote:
       | This strikes me that it should parallelize pretty well if you
       | deputise the individual stripes to multiple threads.
        
       | franze wrote:
       | today i coded franzSort, faster than mergeSort, slower than
       | quickSort
       | 
       | function franzSort(a) { function m(l, r) { let s = []; while
       | (l.length && r.length) { if (l[0] <= r[0]) { s.push(l.shift()); }
       | else { s.push(r.shift()); } } return s.concat(l).concat(r); }
       | if (a.length <= 1) {             return a;         }         let
       | h = Math.floor(a.length / 2);         return
       | m(franzSort(a.slice(0, h)), franzSort(a.slice(h)));
       | 
       | }
       | 
       | // Example usage let arr = [5, 3, 8, 6, 2, 7, 1, 4]; let
       | sortedArr = franzSort(arr); console.log(sortedArr);
        
       | jon_richards wrote:
       | The stars in the American flag aren't used. This is clearly Pride
       | flag sort.
        
         | oniony wrote:
         | And Mauritius's flag has more stripes.
        
       ___________________________________________________________________
       (page generated 2024-04-24 23:01 UTC)