[HN Gopher] Randomness extractors: making fair coins out of bias...
       ___________________________________________________________________
        
       Randomness extractors: making fair coins out of biased coins
        
       Author : Maro
       Score  : 61 points
       Date   : 2024-09-15 14:50 UTC (8 hours ago)
        
 (HTM) web link (bytepawn.com)
 (TXT) w3m dump (bytepawn.com)
        
       | leeoniya wrote:
       | made me think of, "bend entropy towards definable outcomes"
       | 
       | https://m.youtube.com/watch?v=2QJjMwBAORw
        
       | ur-whale wrote:
       | Does this work with a guaranteed random bit output bandwidth ?
       | 
       | These techniques (eg von neumann) all seem to suffer from that
       | same problem, namely, they crank out uniformly distributed random
       | bits from biased sources, but with no guarantee of how long you
       | may have to wait for a bit to come out.
        
         | cedilla wrote:
         | If the coin is biased so that it lands on heads only 1 in a
         | million times it will take a lot of coin throws to get some
         | randomness out of it. This isn't a limit of the technique, just
         | a consequence of the low entropy source.
        
           | remram wrote:
           | Does it take a fixed long time or a long random potentially-
           | infinite time?
        
             | big-green-man wrote:
             | The latter (except that it doesn't always have to be long;
             | the length of time per bit follows a gaussian
             | distribution). If it were the former then randomness would
             | be predictable in some sense. It's similar to the halting
             | problem.
        
               | remram wrote:
               | I'm not sure what you mean by "randomness would be
               | predictable". For example, it is possible to do this if
               | you know the bias.
        
         | wakawaka28 wrote:
         | If you know the bias, then you can surely estimate the bit
         | rate. Of course you could get very unlucky and wait a long time
         | for bits, even with an unbiased underlying source. But that's
         | the nature of these methods.
        
           | ur-whale wrote:
           | Or you could pump the heavily biased bit stream into a modern
           | cipher and get a - for all intents and purposes - super high
           | quality, completely unbiased random stream with the same
           | bandwidth characteristics as the input.
           | 
           | And, as a matter of fact, much faster than the input if
           | required.
        
             | Ar-Curunir wrote:
             | Things are not so simple: how do you key the cipher? with
             | the same randomness as the input string? If so, then that
             | input distribution could be a bad one for the cipher, in
             | that it might fail to generate a sufficiently scrambled
             | output.
             | 
             | This can be mitigated if you use a unkeyed hash function
             | instead, but even for these we don't have any provable
             | guarantees unless you assume the function is a random
             | oracle, which is an ideal object that cannot exist.
        
               | chongli wrote:
               | Yeah, if we're starting with the "1 in a million" unfair
               | coin then most of the input strings are going to be all 0
               | or all 1 which means the cipher is going to give the same
               | sequence every time. Not useful!
        
         | LegionMammal978 wrote:
         | For certain input distributions, a guaranteed output bandwidth
         | is impossible. For instance, suppose that you had a function
         | that, given a vector of _N_ integers chosen uniformly from
         | (0,1,2), returns a single unbiased output bit. Then, exactly
         | half of the 3^ _N_ input vectors would correspond to each
         | output bit. But this is impossible, since 3^ _N_ is always odd,
         | so this (0,1,2)-function cannot exist for any fixed _N_.
         | 
         | Now, suppose that you had a function that turned _N_ input bits
         | with a 1:2 bias into a single bit with no bias. Then you could
         | use this to implement an (0,1,2)-function as above, by
         | collapsing inputs 1 and 2 into input 1. Since such an
         | (0,1,2)-function is impossible, the 1:2-biased function is
         | similarly impossible.
         | 
         | This argument holds for any i.i.d. input source that can be
         | generated from a uniform-integer source with an odd number of
         | possible values.
        
         | Animats wrote:
         | That's a good question. You cannot get a non-statistical
         | guarantee of an output rate from the Von Neumann biased coin
         | toss, because you can potentially get a long run of the same
         | toss result, which stalls output bit generation. That might be
         | a generic limitation of all similar de-biasing approaches. Has
         | that been proven or disproven?
        
           | LegionMammal978 wrote:
           | See my sibling reply [0]: for many input distributions, you
           | can never put a fixed upper bound on the number of inputs you
           | need to get a single output bit.
           | 
           | [0] https://news.ycombinator.com/item?id=41549648
        
       | bjornsing wrote:
       | > Interestingly, this is the best possible approach, whether we
       | know the bias of the input bit stream or not.
       | 
       | Would love to see a proof of that. It feels a bit unintuitive to
       | me.
       | 
       | (But on second thought that might be because I'm thinking of
       | cases of correlated bits.)
        
         | LegionMammal978 wrote:
         | In general, I think you could show that an 'arithmetic coding'
         | approach is optimal, if the distribution of each bit given the
         | previous bits is known. From there, you'd just have to show
         | that the von Neumann approach for i.i.d. bits is no worse on
         | average.
        
           | Darmani wrote:
           | This is correct, and I came to the comments to say the same
           | thing. It takes some work to implement without arbitrary-
           | precision floating point, but arithmetic coding can make use
           | of the full entropy in the input stream, whereas the approach
           | in the article discards a lot of information.
        
         | atq2119 wrote:
         | It's obviously false. If you know that the bias is 0.5, you can
         | just pass the input stream through unmodified. You can also
         | construct better codes for other biases.
        
           | ralegh wrote:
           | Depends how you define best, I assume best in the article
           | means 'output is close to 50/50 and independent', in which
           | case the 01/10 solution is already optimal given the
           | assumptions.
           | 
           | If you define best as the most efficient at converting inputs
           | to outputs as well as the output being 50/50 then there are
           | better ways such as your example.
        
           | contravariant wrote:
           | It's worded quite badly, but I think it's the best you can do
           | given 2 independent bits, except for the rare case where 00
           | and 11 are equally likely.
        
       | sevenoftwelve wrote:
       | The article is interesting, but it misses the most practical and
       | unambiguously safe way to generate streams of random data: Use
       | cryptography.
       | 
       | To generate a stream of random data, use a hash function with
       | arbitrary-length output (XOF) such as blake2x[^0] or
       | shake256[^1]. Make sure your key contains at least 256 bits of
       | entropy. Absolutely never use a key with less than 128 bits of
       | entropy.
       | 
       | Since it's impossible to know how much entropy there is in a key,
       | you probably want to use something like the Fortuna RNG[^2].
       | Substitute the sha2/AES based construction for your XOF. Bruce
       | Schneier designed Fortuna back when XOFs were harder to come by.
       | 
       | If you want more performance, you can use blake2 to compress your
       | input seed into 256 bits and generate the random stream using
       | chacha20[^3].
       | 
       | All of this is usually handled by the Linux kernel[^4], so it's
       | best to just use the getrandom(2)[^5] system call or just read
       | from /dev/urandom[^6]. If you are writing a Rust program, you can
       | use the rand[^7] crate, which uses a mixed approach reading a
       | seed from the operating system and expanding it in-process using
       | chacha[^8]. This is a valid strategy.
       | 
       | I am omitting some subtleties[^10] about mathematical definitions
       | of randomness extractors as used by the author of the article.
       | When you are using a cryptographic approach, you are dealing with
       | a complexity-theory based security notion[^9], which does not
       | precisely equate to creating a stream with a specific amount of
       | entropy. Everywhere - except for a physics or a math paper
       | dealing with information theory - I would call this a
       | technicality. For most intents and purposes, cryptographic
       | security notions are the most real-world robust conceptions of
       | randomness available.
       | 
       | [^0]: https://www.blake2.net/
       | 
       | [^1]: https://csrc.nist.gov/pubs/fips/202/final (shake256 is part
       | of the SHA-3 standard)
       | 
       | [^2]: https://www.schneier.com/academic/fortuna/
       | 
       | [^3]: https://protonvpn.com/blog/chacha20
       | 
       | [^4]: https://lwn.net/Articles/884875/
       | 
       | [^5]: https://man.archlinux.org/man/getrandom.2
       | 
       | [^6]: https://man.archlinux.org/man/urandom.4
       | 
       | [^7]: https://crates.io/crates/rand
       | 
       | [^8]: https://docs.rs/rand/0.8.5/rand/rngs/struct.StdRng.html
       | 
       | [^9]:
       | https://en.wikipedia.org/wiki/Security_of_cryptographic_hash...
       | 
       | [^10]: In particular, PRFs are not guaranteed to output tokens
       | with a certain amount of entropy - if I recall correctly -
       | because they can map two inputs to the same output.
       | 
       | ---
       | 
       | I am the main author of the Rosenpass[^11] post-quantum secure
       | key exchange for WireGuard. My expertise comes from developing
       | this protocol, as well as a couple of years of engagement with
       | the real-world cryptography community and from my own scientific
       | research on cryptography and secure implementations of
       | cryptography.
       | 
       | [^11]: https://rosenpass.eu/
        
       | travisjungroth wrote:
       | It would be interesting to combine this with something that
       | detects bias on a non-deterministic stream. So in one shot, it
       | takes a stream of unknown bias and emits an unbiased stream. The
       | closing paragraph says that's impossible, but the tradeoff is you
       | are only sure the output is unbiased with some amount of
       | confidence. I think you'd also need a buffer to detect and then
       | output.
        
       | coppsilgold wrote:
       | If you know that a source of randomness contains entropy
       | (unpredictable bits) but you don't know how much (ex. digital
       | camera unless heavily processed will contain random sensor noise
       | in the output) the safest thing to do is pipe it into a
       | cryptographic construct such as a hash or a sponge.
       | 
       | Once you believe you piped enough you use the state of the
       | cryptographic primitive as the seed for further random bit
       | generation. The Linux kernel uses a sponge (to accumulate), hash
       | function (to consolidate) and a stream cipher (to output) to
       | 'convert' events with some degree of randomness into 'infinite'
       | safe cryptographically secure random bits.
       | 
       | To acquire some intuition about this you can imagine taking a raw
       | 1MP photo with a camera sensor and then feeding the lossless file
       | to sha256sum. You acquire a 256 bit string and the sensor noise
       | in the photo will be sufficient to secure the result. An attacker
       | would need to model all the degrees of freedom in taking photos
       | in the world and sensor noise production to build a simulator for
       | your camera and start bruteforcing your sha256 result which will
       | almost certainly (sensor might be compromised or not really be
       | raw) contain far more degrees of freedom than 256 bits.
        
       ___________________________________________________________________
       (page generated 2024-09-15 23:00 UTC)