[HN Gopher] Zero Tolerance for Bias
       ___________________________________________________________________
        
       Zero Tolerance for Bias
        
       Author : Harmohit
       Score  : 83 points
       Date   : 2024-06-06 18:58 UTC (2 days ago)
        
 (HTM) web link (queue.acm.org)
 (TXT) w3m dump (queue.acm.org)
        
       | tbrownaw wrote:
       | Trusting a broken library seems like a different _kind_ of error
       | than implementing an algorithm that isn 't quite the algorithm
       | you meant to implement.
        
       | zug_zug wrote:
       | Kinda interesting.
       | 
       | There are N! permutations in a shuffle (no duplicates) and
       | there's an algorithm where you pick one random number between
       | 1->N! and then bring up that permutation (though I don't know how
       | to do it better than N log N with an order-statistic tree). I
       | like this because it requires exactly one random number.
       | 
       | A trivial solution in functional programming (for those of us who
       | find this swap stuff really unreadable and error-prone) would be
       | something like:
       | 
       | [1,2,3,4,5,6].map(x => {return {value: x, order:
       | Math.random()}}).sort((a,b) => (a.order - b.order)).map(x =>
       | x.value)
       | 
       | Of course this is N-Log-N, but personally I think it's easy to
       | forget how small logN grows. Like log10(number of atoms in
       | universe) = 82, so if your dataset is smaller than the number of
       | atoms in the universe you could think of it as less than the
       | constant 82.
        
         | vlovich123 wrote:
         | The number of permissions of a deck of cards is 2^225 according
         | to the article. Permissions grow quite quick
        
           | zug_zug wrote:
           | Yeah. So something like a 50-digit number. I guess your point
           | is this would require a BigInt?
           | 
           | From a performance perspective it's negligible. Like when
           | doing RSA for example the primes are usually 2^512 in length.
        
       | im3w1l wrote:
       | Fisher Yates is simple, fast and correct. The only area of
       | improvement I can think of is reducing cache misses.
        
         | o11c wrote:
         | That needs some conditions:
         | 
         | * original Fisher-Yates is slow; Knuth's variation is what
         | makes it fast
         | 
         | * Fisher-Yates relies on "generate integer in range", which is
         | often implemented badly (with bias, or with unnecessary
         | slowness). See the PCG blog for the best-known solution.
         | 
         | https://www.pcg-random.org/posts/bounded-rands.html
        
           | im3w1l wrote:
           | Modern languages usually provide a generate integer in range
           | primitive.
           | 
           | Edit: Even C++ seems to have it nowadays in
           | std::uniform_int_distribution
        
       | orlp wrote:
       | Here is a trivial shuffle algorithm that is completely unbiased
       | and only requires an unbiased coin (or random number generator
       | giving bits):
       | 
       | 1. Randomly assign each element to list A or list B. 2.
       | Recursively shuffle lists A and B. 3. Concatenate lists A and B.
       | 
       | To prove it's correct, note that assigning a random real number
       | to each element and sorting based on that number is an unbiased
       | shuffle. Then we note the above does in fact do that by
       | considering the fractional base-2 expansion of the random
       | numbers, and noting the above is in fact a base-2 radix sort of
       | these numbers. We can sort these random real numbers even though
       | they have an infinite amount of random bits, as we can stop
       | expanding the digits when the prefix of digits is unique (which
       | corresponds to the event that a list is down to a single
       | element).
       | 
       | I call the above algorithm RadixShuffle. You can do it in base-2,
       | but also in other bases. For base-2 you can make it in-place
       | similar to how the partition for Quicksort is implemented in-
       | place, for other bases you either have to do it out-of-place or
       | in two passes (the first pass only counting how many elements go
       | in each bucket to compute offsets).
       | 
       | The above can be combined with a fallback algorithm for small N
       | such as Fisher-Yates. I believe even though the above is N log N
       | it can be faster than Fisher-Yates for larger N because it is
       | exceptionally cache-efficient as well as RNG-efficient whereas
       | Fisher-Yates requires a call to the RNG and invokes an expected
       | cache miss for each element.
       | 
       | ---
       | 
       | Another fun fact: you can turn any biased memoryless coin into an
       | unbiased one with a simple trick. Throw the coin twice, if it
       | gives HH or TT you throw away the toss, if it's HT or TH you use
       | the first toss as your unbiased coin.
       | 
       | This works because if p is the probability that heads comes up we
       | have:                   HH: p^2         HT: p(1-p)         TH:
       | (1-p)p         TT: (1-p)^2
       | 
       | Naturally, p(1-p) and (1-p)p are equiprobable, thus if we reject
       | the other outcomes we have distilled an unbiased coin out of our
       | biased coin.
        
         | amelius wrote:
         | Regarding your second fact: if we need N tosses, then is your
         | method the most efficient one?
        
           | orlp wrote:
           | Looking into it a bit, it appears that this particular fun
           | fact was first thought of (or at least put to paper) by von
           | Neumann in "Various techniques used in connection with random
           | digits" in 1951. Yuval Peres proves in "Iterating von
           | Neumann's procedure for extracting random bits" that if N
           | goes to infinity it approaches the entropy limit.
           | 
           | So for particular small choices of N there might be more
           | clever schemes, but for large N you can't do a whole lot
           | better.
        
         | thaumasiotes wrote:
         | > Here is a trivial shuffle algorithm that is completely
         | unbiased and only requires an unbiased coin (or random number
         | generator giving bits):
         | 
         | How does this compare to a regular Knuth shuffle where, since
         | you only have a 1-bit generator, when you need (for example) an
         | integer between 0 and 23, you treat the bits you generate as
         | the binary expansion of a real number in [0, 1), and generate
         | them until floor(24*n) is unambiguous?
         | 
         | (Obviously, the random number generation is more conceptually
         | complex, but aside from that.)
        
       | skybrian wrote:
       | I recently stumbled across some other ways that random number
       | generation can go wrong.
       | 
       | Suppose you want reproducible results from a seed, along with
       | parallelism. Algorithmic random number generators are usually
       | mutable and generate a _sequence_ of results, limiting
       | parallelism. Rather than a sequence, you want something tree-
       | shaped where you can create an independent random stream for each
       | child task. In higher-level API 's, a _jump_ or _split_ operator
       | can be useful.
       | 
       | Counter-based random number generators [1] seem pretty useful in
       | that context. An immutable random number generator works like a
       | hash algorithm that maps each input to an output that's difficult
       | to predict. The problem with this is being careful to avoid using
       | the same input twice. You can think of it as allocating random
       | numbers from a very large address space in a reproducible way.
       | How do you partition the address space, predictably, so that
       | every address is used at most once, and nobody runs out?
       | 
       | Giving each child a unique ID and generating a stream from that
       | is one way. If the tree is deeper, you'll want a unique seed for
       | each _path_.
       | 
       | When a mutable random number generator is copied to a child task
       | (or maybe just to an iterator), the same random numbers might be
       | generated in two places. Avoiding this is the sort of thing that
       | Rust's borrow checker can prevent - borrowing is okay, but you
       | want to prevent multiple concurrent ownership.
       | 
       | [1] https://en.wikipedia.org/wiki/Counter-
       | based_random_number_ge...
        
       | IAmLiterallyAB wrote:
       | Interesting, but there's one part I'm not sure I agree with.
       | 
       | > Pseudo-random number generators are useful for many purposes,
       | but unbiased shuffling isn't one of them.
       | 
       | A properly seeded CSPRNG is perfectly fine at this. And if it's
       | not, then all of our cryptography is pretty much screwed. This is
       | why in modern kernels, /dev/random and /dev/urandom are the same
       | (minus differences in behavior when the initialization isn't
       | complete). As D.J. Bernstein put it, it's superstition to not
       | trust CSPRNGs. https://www.mail-
       | archive.com/cryptography@randombit.net/msg0... And if it's good
       | enough for crypto, it's good enough for card shuffling.
       | 
       | FYI I am not a cryptographer
        
       ___________________________________________________________________
       (page generated 2024-06-08 23:00 UTC)