[HN Gopher] American flag sort
___________________________________________________________________
American flag sort
Author : zerojames
Score : 239 points
Date : 2024-04-23 23:26 UTC (23 hours ago)
(HTM) web link (xlinux.nist.gov)
(TXT) w3m dump (xlinux.nist.gov)
| muti wrote:
| Visualisation https://www.youtube.com/watch?v=k1XkZ5ANO64
|
| Wikipedia page https://en.wikipedia.org/wiki/American_flag_sort
| AnarchismIsCool wrote:
| Looks more like anarchist flag sort...
| codetrotter wrote:
| I'll take your word for it, AnarchismIsCool ;)
| aaron695 wrote:
| > Visualisation https://www.youtube.com/watch?v=k1XkZ5ANO64
|
| OT: I quite liked the follow up video Youtube suggested -
|
| Sorting algorithms to relax/study to -
| https://www.youtube.com/watch?v=vr5dCRHAgb0
| peterleiser wrote:
| The next video that came up for me was "sorting algorithms to
| relax/study to": https://www.youtube.com/watch?v=vr5dCRHAgb0
|
| I'm almost embarrassed by how long I've watched it!
| knodi123 wrote:
| No offense, but I don't think that visualization helps very
| much.
|
| Here's a visualization of a much larger data set that
| illustrates what's actually happening with the buckets and the
| major steps.
|
| https://www.youtube.com/watch?v=21jEd1FUiV8
| fourteenfour wrote:
| Thanks, this one makes the stripe pattern much more apparent.
| tomphoolery wrote:
| Freedom Sort.
| tomschlick wrote:
| Cue the Team America theme song
| mcfedr wrote:
| That is basically what I hear in my head everytime I see that
| flag
| tantalor wrote:
| For democracy!
| seattle_spring wrote:
| I'm doing my part!
| __d wrote:
| would you like to know more?
| tom_ wrote:
| If you disagree, see the first amendment! If you still
| disagree, see the second amendment!
| mondobe wrote:
| And if you STILL disagree, see the third amendment! If you're
| a soldier, in my house, I guess.
| karmakaze wrote:
| At first I didn't think I'd be interested.
|
| > Using some efficiency techniques, it is twice as fast as
| quicksort for large sets of strings.
| hinkley wrote:
| I'm not calling it flag sort.
|
| It's an in-place bucket sort which is interesting.
| wodenokoto wrote:
| From Wikipedia[1]:
|
| > The name American flag sort comes by analogy with the Dutch
| national flag problem[2] in the last step: efficiently
| partition the array into many "stripes".
|
| [1] https://en.m.wikipedia.org/wiki/American_flag_sort
|
| [2]
| https://en.m.wikipedia.org/wiki/Dutch_national_flag_problem
| hinkley wrote:
| I read this rationale before I made my comment.
| kleiba wrote:
| _> From Wikipedia[1]:_
|
| ...or from the very article itself...
| remram wrote:
| Or _in-place radix sort_ maybe.
|
| I agree with you, the name is ridiculous, reminds me of
| _miracle sort_ or _intelligent design sort_ : https://www.dan
| germouse.net/esoteric/intelligentdesignsort.h...
|
| Probably those two flag sort algorithms were named before the
| joke algorithms were popularily known.
| thefifthsetpin wrote:
| Quicksort isn't great when comparing two items is expensive.
| owlninja wrote:
| The 'more information' site linked on the page seems like
| something some HN folk love.
|
| https://www.workmall.com/flags/united_states_flag.html
| defrost wrote:
| There's a Peter M. McIlroy as well!?!
|
| All this time and I thought it was just Doug, I'm _guessing_
| Peter might be the son of the father of pipe?
| icsa wrote:
| Yes. He is a software engineer in Seattle.
| jdeisenberg wrote:
| Wondering how this sort would perform with Unicode input (given
| 256 buckets); specifically what the results would look like.
| bawolff wrote:
| Presumably it would be fine, since unicode input is still a
| string of bytes at the end of the day.
| nanidin wrote:
| It says it sorts one byte at a time. I think this would break
| for anything not utf-8.
| ts4z wrote:
| Seems like it should work for arbitrary byte strings (any
| charset, any encoding)but obviously the performance
| characteristics will differ because of non-uniform
| distribution. But that happens even in ASCII.
| brirec wrote:
| This would even break UTF-8, since multi-byte characters
| are a thing!
| dumbo-octopus wrote:
| How would that break anything? The strings aren't being
| split.
| EmilyHughes wrote:
| UTF-8 is downward compatible to ASCII, so anything that
| is just a standard character (like every character in
| this comment) is just a byte.
| ursusmaritimus wrote:
| Lexicographic encoding of UTF-8 byte sequences matches
| lexicographic order of the sequence of Unicode code-
| points. So you can sort UTF-8 strings as byte strings.
| Not that sorting by code-points has much meaning, but you
| can use the Unicode collation algorithm first.
| remram wrote:
| TIL, thanks. That makes sense given how the length
| prefixes look like but I never thought about it. I wonder
| if this was by chance or if the UTF-8 creator thought
| about it.
| mzs wrote:
| % printf "%s\n" _A_ _B_ | sort
|
| _A_
|
| _B_
|
| % printf "%s" _A_ _B_ | xxd -b -c4
|
| 00000000: 11110000 10011101 10010000 10110100 ....
|
| 00000004: 11110000 10011101 10010000 10110101 ....
|
| % printf "%s" _A_ _B_ | xxd -c1 -ps | sort | xxd -r -ps |
| xxd -b -c4
|
| 00000000: 10010000 10010000 10011101 10011101 ....
|
| 00000004: 10110100 10110101 11110000 11110000 ....
|
| % printf "%s" _A_ _B_ | xxd -c1 -ps | sort | xxd -r -ps
|
| ????????
| moonchild wrote:
| Unicode collation is very complex. If you wanted to sort
| according to that, then you would need to devise an
| isomorphism between unicode strings and bitstrings such that
| lexicographic ordering in the bitstring domain agrees with
| unicode collation in the unicode domain. I'm not saying it's
| impossible, but it's not at all obvious how to do it. On the
| other hand, if all you want is _some_ total order for
| acceleration structures, then you can indeed just use an
| arbitrary encoding like utf8 and do the obvious thing.
| dsamarin wrote:
| The obvious solution to this is to pre-compile a list of
| all possible Unicode strings up to the required length,
| sort them, and create a lookup table using their indices. I
| wonder for what kind of project this would be useful.
| thfuran wrote:
| If that works for you for anything but very small
| strings, I want to buy your computer.
| remram wrote:
| Even for 3-codepoint strings it already wouldn't fit in
| any computer that exists. And that is not sufficient to
| encode all 1-glyph strings.
| IncreasePosts wrote:
| Isn't that exactly what the Unicode collation algorithm is?
|
| https://unicode.org/reports/tr10
|
| tl;dr from its wikipedia page:
|
| > The Unicode collation algorithm (UCA) is an algorithm
| defined in Unicode Technical Report #10, which is a
| customizable method to produce binary keys from strings
| representing text in any writing system and language that
| can be represented with Unicode. These keys can then be
| efficiently compared byte by byte in order to collate or
| sort them according to the rules of the language, with
| options for ignoring case, accents, etc
| moonchild wrote:
| Ah! You are right.
| bawolff wrote:
| You can just use the icu library to do that. Its a very
| common thing to do when working with unicode collations.
| dhosek wrote:
| Ignoring the fact that if you have Unicode encoded strings, you
| probably want some sort of lexical sort rather than a sort on
| the raw Unicode character codes which is semantically
| meaningless, there is no reason to assume that Unicode data
| would be any different to sort than any other data.
| seanhunter wrote:
| I think unicode is intrinsically antithetical to freedom. This
| would do great with ASCII.
| re wrote:
| Looks like there's a Rust implementation:
| https://crates.io/crates/afsort
|
| > When sorting English words, this implementation seems to be
| about 40% faster than sort_unstable from the Rust standard
| library.
|
| That's better than I would have expected. It would be interesting
| to see how it does on other types of string sort problems.
| antonhag wrote:
| Hi! Crate author here, happy to see my old project getting
| mentioned. Let me know if you have any questions!
| ewalk153 wrote:
| Do you actively use this function in any projects? What was
| your inspiration to write the crate?
| antonhag wrote:
| I don't actively use it, unfortunately. The main
| inspiration was to faster sort inputs into the
| https://github.com/BurntSushi/fst crate, which I in turn
| used to try to build a search library.
| Zelizz wrote:
| Pretty much just a variation of counting sort with a worse name?
| https://en.wikipedia.org/wiki/Counting_sort
| carabiner wrote:
| A superior name.
| dumbo-octopus wrote:
| No. The very article you link details the difference, and that
| Counting Sort is often a subroutine in Radix Sort, and that
| Counting Sort is extremely poorly suited to strings, which is
| the entire purpose of this sort. And this name is better. So...
| "no" in basically every way possible.
| Zelizz wrote:
| Characters in a string can be thought of as digits in a
| base-256 number. Call counting sort recursively on each
| bucket, looking at the next character in the string. Can you
| not see the similarity?
| chowells wrote:
| There is no one who fails to see the similarity, because
| they're both kinds of radix sorts. For instance, a counting
| sort is a degenerate form of radix sort that only does one
| pass.
|
| But where the classic radix sort is bottom-up, the American
| flag sort is top-down. That really matters for data where
| the the most significant portions of the data is likely to
| be all you need to consider for the sorting. While a top-
| down sort will have similar performance to bottom-up in the
| worst case, in the best case it can skip a lot of passes,
| especially if the inputs are much longer than the longest
| common prefix.
| dumbo-octopus wrote:
| They're related, but in a very specific way that does not
| meet the bar for "variation" in my opinion. As you
| described, counting sort is a very simple procedure that
| works well only on a very specific input domain. If another
| algorithm has "more logic" than counting sort itself in
| order to transform a very different input domain into a
| format that is suited for making many repeated calls to (a
| procedure that may be implemented as) counting sort, in a
| specific pattern, and appropriately coalescing the results
| of those calls, I think it is appropriate to let it have
| its own name. Would you prefer to refer to every possible
| utilization of counting as "that version of counting sort
| where...."?
| sltkr wrote:
| American flag sort is essentially an in-place version of
| counting sort.
|
| That's talking about the single-pass American flag sort. If you
| use American flag sort to sort strings, you do multiple passes:
| on the first pass, you sort all strings by their first
| character, then for each starting character, you sort the
| strings with that starting character (which are now in a
| contiguous subarray) by their second character, and so on.
|
| @chowells calls it a "top-down radix sort" below, which is a
| great description. It also explains the strengths and
| weaknesses of the two algorithms: radix sort works great for
| small strings of fixed length (e.g. IPv4 addresses, which can
| be thought of as 4-byte sequences) while American flag sort
| works great for variable-length strings like actual textual
| strings, especially if they don't share common prefixes (e.g.
| dictionary words, usernames, etc.)
|
| @hinkley pointed out that the recursive version is just a
| bucket sort, but in-place. Which is also true!
|
| tl;dr:
|
| Single-pass American flag sort = in-place counting sort
|
| Recursive American flag sort = in-place bucket sort
| Someone wrote:
| The paper
| (https://static.usenix.org/publications/compsystems/1993/win_...
| linked to from the page) is well-written and worth reading,
| iterating over multiple C programs that build up to this sort
| algorithm and giving a good rationale for each iteration.
| awanderingmind wrote:
| There is currently a typo in the link which prevents it from
| working - the correct link is
| https://static.usenix.org/publications/compsystems/1993/win_...
|
| Interesting read, thanks!
| Traubenfuchs wrote:
| > it is twice as fast as quicksort for large sets of string
|
| Define large.
| TheRealPomax wrote:
| They did. But you'll need to read the (free) paper [1] to get
| the details omitted in the NITS catalog summary.
|
| [1]
| http://www.usenix.org/publications/compsystems/1993/win_mcil...
| Oarch wrote:
| The comments made me realise there's no national flag with lots
| of vertical stripes. Seems like a hole in the market.
|
| Someone alert CGP Grey!
| Kon-Peki wrote:
| Sorts like these work really well on GPUs. You would start with a
| single thread for the first pass and then keep spawning one-
| thread-per-bucket for every additional pass until done. It isn't
| particularly taxing on the GPU; your advantage comes from being
| able to run potentially thousands of threads simultaneously
| without having to worry about coordinating read/write memory
| access.
| NikkiA wrote:
| This strikes me that it should parallelize pretty well if you
| deputise the individual stripes to multiple threads.
| franze wrote:
| today i coded franzSort, faster than mergeSort, slower than
| quickSort
|
| function franzSort(a) { function m(l, r) { let s = []; while
| (l.length && r.length) { if (l[0] <= r[0]) { s.push(l.shift()); }
| else { s.push(r.shift()); } } return s.concat(l).concat(r); }
| if (a.length <= 1) { return a; } let
| h = Math.floor(a.length / 2); return
| m(franzSort(a.slice(0, h)), franzSort(a.slice(h)));
|
| }
|
| // Example usage let arr = [5, 3, 8, 6, 2, 7, 1, 4]; let
| sortedArr = franzSort(arr); console.log(sortedArr);
| jon_richards wrote:
| The stars in the American flag aren't used. This is clearly Pride
| flag sort.
| oniony wrote:
| And Mauritius's flag has more stripes.
___________________________________________________________________
(page generated 2024-04-24 23:01 UTC)