[HN Gopher] Zero Tolerance for Bias
___________________________________________________________________
Zero Tolerance for Bias
Author : Harmohit
Score : 83 points
Date : 2024-06-06 18:58 UTC (2 days ago)
(HTM) web link (queue.acm.org)
(TXT) w3m dump (queue.acm.org)
| tbrownaw wrote:
| Trusting a broken library seems like a different _kind_ of error
| than implementing an algorithm that isn 't quite the algorithm
| you meant to implement.
| zug_zug wrote:
| Kinda interesting.
|
| There are N! permutations in a shuffle (no duplicates) and
| there's an algorithm where you pick one random number between
| 1->N! and then bring up that permutation (though I don't know how
| to do it better than N log N with an order-statistic tree). I
| like this because it requires exactly one random number.
|
| A trivial solution in functional programming (for those of us who
| find this swap stuff really unreadable and error-prone) would be
| something like:
|
| [1,2,3,4,5,6].map(x => {return {value: x, order:
| Math.random()}}).sort((a,b) => (a.order - b.order)).map(x =>
| x.value)
|
| Of course this is N-Log-N, but personally I think it's easy to
| forget how small logN grows. Like log10(number of atoms in
| universe) = 82, so if your dataset is smaller than the number of
| atoms in the universe you could think of it as less than the
| constant 82.
| vlovich123 wrote:
| The number of permissions of a deck of cards is 2^225 according
| to the article. Permissions grow quite quick
| zug_zug wrote:
| Yeah. So something like a 50-digit number. I guess your point
| is this would require a BigInt?
|
| From a performance perspective it's negligible. Like when
| doing RSA for example the primes are usually 2^512 in length.
| im3w1l wrote:
| Fisher Yates is simple, fast and correct. The only area of
| improvement I can think of is reducing cache misses.
| o11c wrote:
| That needs some conditions:
|
| * original Fisher-Yates is slow; Knuth's variation is what
| makes it fast
|
| * Fisher-Yates relies on "generate integer in range", which is
| often implemented badly (with bias, or with unnecessary
| slowness). See the PCG blog for the best-known solution.
|
| https://www.pcg-random.org/posts/bounded-rands.html
| im3w1l wrote:
| Modern languages usually provide a generate integer in range
| primitive.
|
| Edit: Even C++ seems to have it nowadays in
| std::uniform_int_distribution
| orlp wrote:
| Here is a trivial shuffle algorithm that is completely unbiased
| and only requires an unbiased coin (or random number generator
| giving bits):
|
| 1. Randomly assign each element to list A or list B. 2.
| Recursively shuffle lists A and B. 3. Concatenate lists A and B.
|
| To prove it's correct, note that assigning a random real number
| to each element and sorting based on that number is an unbiased
| shuffle. Then we note the above does in fact do that by
| considering the fractional base-2 expansion of the random
| numbers, and noting the above is in fact a base-2 radix sort of
| these numbers. We can sort these random real numbers even though
| they have an infinite amount of random bits, as we can stop
| expanding the digits when the prefix of digits is unique (which
| corresponds to the event that a list is down to a single
| element).
|
| I call the above algorithm RadixShuffle. You can do it in base-2,
| but also in other bases. For base-2 you can make it in-place
| similar to how the partition for Quicksort is implemented in-
| place, for other bases you either have to do it out-of-place or
| in two passes (the first pass only counting how many elements go
| in each bucket to compute offsets).
|
| The above can be combined with a fallback algorithm for small N
| such as Fisher-Yates. I believe even though the above is N log N
| it can be faster than Fisher-Yates for larger N because it is
| exceptionally cache-efficient as well as RNG-efficient whereas
| Fisher-Yates requires a call to the RNG and invokes an expected
| cache miss for each element.
|
| ---
|
| Another fun fact: you can turn any biased memoryless coin into an
| unbiased one with a simple trick. Throw the coin twice, if it
| gives HH or TT you throw away the toss, if it's HT or TH you use
| the first toss as your unbiased coin.
|
| This works because if p is the probability that heads comes up we
| have: HH: p^2 HT: p(1-p) TH:
| (1-p)p TT: (1-p)^2
|
| Naturally, p(1-p) and (1-p)p are equiprobable, thus if we reject
| the other outcomes we have distilled an unbiased coin out of our
| biased coin.
| amelius wrote:
| Regarding your second fact: if we need N tosses, then is your
| method the most efficient one?
| orlp wrote:
| Looking into it a bit, it appears that this particular fun
| fact was first thought of (or at least put to paper) by von
| Neumann in "Various techniques used in connection with random
| digits" in 1951. Yuval Peres proves in "Iterating von
| Neumann's procedure for extracting random bits" that if N
| goes to infinity it approaches the entropy limit.
|
| So for particular small choices of N there might be more
| clever schemes, but for large N you can't do a whole lot
| better.
| thaumasiotes wrote:
| > Here is a trivial shuffle algorithm that is completely
| unbiased and only requires an unbiased coin (or random number
| generator giving bits):
|
| How does this compare to a regular Knuth shuffle where, since
| you only have a 1-bit generator, when you need (for example) an
| integer between 0 and 23, you treat the bits you generate as
| the binary expansion of a real number in [0, 1), and generate
| them until floor(24*n) is unambiguous?
|
| (Obviously, the random number generation is more conceptually
| complex, but aside from that.)
| skybrian wrote:
| I recently stumbled across some other ways that random number
| generation can go wrong.
|
| Suppose you want reproducible results from a seed, along with
| parallelism. Algorithmic random number generators are usually
| mutable and generate a _sequence_ of results, limiting
| parallelism. Rather than a sequence, you want something tree-
| shaped where you can create an independent random stream for each
| child task. In higher-level API 's, a _jump_ or _split_ operator
| can be useful.
|
| Counter-based random number generators [1] seem pretty useful in
| that context. An immutable random number generator works like a
| hash algorithm that maps each input to an output that's difficult
| to predict. The problem with this is being careful to avoid using
| the same input twice. You can think of it as allocating random
| numbers from a very large address space in a reproducible way.
| How do you partition the address space, predictably, so that
| every address is used at most once, and nobody runs out?
|
| Giving each child a unique ID and generating a stream from that
| is one way. If the tree is deeper, you'll want a unique seed for
| each _path_.
|
| When a mutable random number generator is copied to a child task
| (or maybe just to an iterator), the same random numbers might be
| generated in two places. Avoiding this is the sort of thing that
| Rust's borrow checker can prevent - borrowing is okay, but you
| want to prevent multiple concurrent ownership.
|
| [1] https://en.wikipedia.org/wiki/Counter-
| based_random_number_ge...
| IAmLiterallyAB wrote:
| Interesting, but there's one part I'm not sure I agree with.
|
| > Pseudo-random number generators are useful for many purposes,
| but unbiased shuffling isn't one of them.
|
| A properly seeded CSPRNG is perfectly fine at this. And if it's
| not, then all of our cryptography is pretty much screwed. This is
| why in modern kernels, /dev/random and /dev/urandom are the same
| (minus differences in behavior when the initialization isn't
| complete). As D.J. Bernstein put it, it's superstition to not
| trust CSPRNGs. https://www.mail-
| archive.com/cryptography@randombit.net/msg0... And if it's good
| enough for crypto, it's good enough for card shuffling.
|
| FYI I am not a cryptographer
___________________________________________________________________
(page generated 2024-06-08 23:00 UTC)