[HN Gopher] Need a PRNG? Use a CSPRNG
___________________________________________________________________
Need a PRNG? Use a CSPRNG
Author : tczajka
Score : 32 points
Date : 2023-11-25 13:02 UTC (9 hours ago)
(HTM) web link (sortingsearching.com)
(TXT) w3m dump (sortingsearching.com)
| boulos wrote:
| It's been over a decade since I looked at this, but in real-time
| Monte Carlo path tracing the speed of your RNG is material. In
| simple enough scenes, probably something like 5-10% IIRC when
| using MT.
|
| I appreciate that you had a note about performance, but you were
| pretty quick to dismiss that there's a large quality difference
| between rand() and any modern PRNG. Plenty of these are fine
| enough and paying for a CSPRNG with 4x or more the perf hit may
| be a mistake.
|
| I _do_ agree that if you don 't know what you're going to use
| them for or don't know if you have a performance problem that you
| should just use a CSPRNG. The potential for someone to
| accidentally cause a security bug is not worth it. If they decide
| to do so intentionally, that's on them.
|
| tl;dr: maybe "Need a PRNG? Start with one that's
| cryptographically secure"?
| nullc wrote:
| There are things with rand() like performance but much better
| properties.
|
| for randomized search I'm a fan of xoshiro256++ periodically
| reseeded by rdrand.
|
| Obviously if performance isn't a concern, a CSPRNG should be
| the default. But it often is a concern.
|
| Traditional 'fast' PRNGs have simple algebraic structure that
| is absolutely known to cause wrong results, they're not worth
| it compared to modern fast PRNGs.
|
| (I wouldn't ever use MT today, it's not on the pareto-frontier
| of performance vs quality and has a bad cache footprint)
|
| > tl;dr: maybe "Need a PRNG? Start with one that's
| cryptographically secure"?
|
| I agree but: I think if people do that, they'll often find that
| their whole process is now 20% slower than when they used
| rand() and then go back to rand. So I think it's also important
| to make people aware of fast generators with better properties.
| fanf2 wrote:
| There are several PRNGs that are faster and more random than
| Mersenne Twister.
| tczajka wrote:
| I can just about believe that this kind of scenario _might_ be
| a very rare exception where the performance difference is
| relevant and you might want to do something else.
|
| Still, color me skeptical.
|
| "In simple enough scenes"? OK perhaps, but we don't want to
| only run ray tracing for very simple scenes. Simple scenes are
| those cases where you probably don't need much calculation, so
| does it really matter that 5-10% of that simple calculation is
| RNG? It's the complicated computation-heavy scenes that matter!
|
| Also, one would have to see the code to judge this. For
| instance, maybe that code was wasting random bits a lot?
| nullc wrote:
| In the sorts of searches I often do, like fitting integer
| solutions to fast function approximations, searching for
| error correcting codes, and randomized SAT-like problems, or
| simulating usage of data structures for benchmarking them
| chacha is slow enough to meaningfully influence (even
| dominate) the runtime.
|
| I wouldn't want to use rand() because it's extremely likely
| that it will produce actually bad results.
|
| I agree one should use a CSPRNG like chacha (or rdrand) if
| its runtime is insignificant but then the question is: whats
| the best choice when it's not insignificant? Using rand is a
| bad idea.
| nicole_express wrote:
| I'm going to keep using quick insecure PRNGs for my NES games
| rather than write a 6502 implementation of a CSPRNG... I guess
| this isn't something that's safe to assume for all game
| development by any means, but in my case if someone really can
| predict the future actions based off of visual observation they
| deserve the slight advantage they get.
| mcraiha wrote:
| I hope you have already watched "Dragon Warrior by
| NEScardinality" that shows exactly what slight advantage is
| https://www.youtube.com/watch?v=Bgh30BiWG58&t=428s
| SuchAnonMuchWow wrote:
| This exploit the fact that the game use a PRNG instead of a
| true RNG, and can work regardless of whether the PRNG is a
| CSPRNG or not.
| barathr wrote:
| Even simpler than implementing a cryptographic PRNG yourself,
| just use /dev/urandom, which will provide an infinite, non-
| blocking stream of cryptographically-strong pseudorandom bytes.
| All platforms currently have known-good implementations of
| /dev/urandom -- Linux uses ChaCha and MacOS and BSDs use Bruce
| Schneier's Fortuna.
| steponlego wrote:
| They weakened it recently in Linux again. For over a decade
| there was a badly bugged PRNG in the Linux kernel, it was
| discovered and replaced with a more costly one which worked
| great. Then, only a short time ago, they replaced that with one
| of... shady provenance. You're better off writing your own PRNG
| on that platform IMHO.
| fanf2 wrote:
| Jason Donenfeld (author of Wireguard) replaced Linux's SHA-1
| based PRNG (remember, SHA-1 is cryptographically broken) with
| BLAKE2. What is shady about it?
|
| You can't get cryptographically secure random numbers without
| platform support, so it's really bad to tell people to avoid
| the kernel CSPRNG.
| steponlego wrote:
| I simply don't trust NSA people and those who take their
| money. Why would you? We've seen nothing but shady moves
| from them in this space.
| tptacek wrote:
| What are you talking about? Jason Donenfeld is the author
| of WireGuard, the extraordinarily popular VPN protocol
| that cannot use NIST cryptography (it does no
| negotiation, and is built on a version of Noise that uses
| ChaPoly and 25519). The change that was just described to
| you was a shift from NIST cryptography to non-NIST
| cryptography.
| schoen wrote:
| > that cannot use NIST cryptography
|
| Do you mean as a matter of Donenfeld's engineering
| decisions (that those algorithms are unavailable in
| WireGuard)?
| fanf2 wrote:
| /dev/urandom is _really_ slow compared to a userland CSPRNG,
| though. And if you are doing simulations or fuzz testing, you
| need to be able to seed your PRNG to get reproducible results.
| isilofi wrote:
| For many kinds of Monte Carlo algorithms, CSPRNGs are stupidly
| slow. The author compares two handpicked examples of a fast
| CSPRNG and a very slow PRNG, arriving at a factor of 4. In
| practice, e.g. comparing to very simple stuff like multiply-add
| RNGs, it is more like a factor of 4000. Only to then claim that
| "But that would only be true if generating random bits was the
| hot spot, the bottleneck of your program. It never is in
| practice."
|
| E.g. for the usual example of Monte Carlo integration, you pick a
| point randomly, usually by running your RNG for both coordinates.
| Then you evaluate your characteristic function with that point as
| an input. Very often, the function will have a runtime that is in
| the range of a few hundred FPU instructions or less. Comparable
| to the evaluation of your CSPRNG. So in the end, you usually get
| a 40 to 50% faster runtime by just using a slow PRNG instead of a
| CSPRNG. Not to mention the few extra percent you will gain with a
| fast PRNG.
|
| And it doesn't stop there. CSPRNG libraries are often optimized
| to be side-channel free and cryptographically safe. Even if you
| were to use a CSPRNG, you are leaving performance on the table by
| using stuff with security properties you will never ever need,
| that can easily be optimized out for a 30% gain in the CSPRNG
| parts.
|
| And yes, for simulations a few percent points are relevant.
| Thoses "few" percents are maybe days in runtime, thousands of
| currency units in power and hardware cost and tons of CO2 in
| pollution.
| tczajka wrote:
| It's not true that I cherry-picked a slow PRNG for my 7 GB/s
| number. In fact I selected the second-fastest PRNG on that page
| (because it's popular)! The fastest one is 8 GB/s. PCG32 is 3
| GB/s.
|
| Your 4000x factor speed up for a linear-congruential generator
| is just a completely false number.
|
| Yes I did pick ChaCha20 for its speed -- it's designed for
| speed!
|
| "A few hundred FPU instructions" in your Monte Carlo is not
| comparable to generating a number with ChaCha20. If you need a
| few hundred FPU instructions per number, you will be running a
| lot slower than 2 GB/s. ChaCha only requires a few cycles per
| byte.
|
| I agree that you can optimize out the side-channel free part
| for non-crypto-purposes. That's a good thing! I recommend doing
| that.
| josephg wrote:
| I was curious. The rust rand library has implementations for
| a lot of rngs - prngs and csrngs. All with high quality
| implementations as far as I can tell.
|
| They agree with your numbers: their fast, high quality prng
| implementations are about 8gb/sec and chacha is about
| 2gb/sec:
|
| https://rust-random.github.io/book/guide-rngs.html
| espadrine wrote:
| There are ways to make much faster PRNGs; I designed one
| that reaches 53 GB/s[0] (0.06 cycles per byte) with better
| quality than those listed in the article.
|
| But I agree with tczajka: good engineering practice ought
| to be to start with a CSPRNG. First make it work, then make
| it fast. With truly random data, you know your algorithm
| works, and can then check if there is a quality loss when
| switching away. The performance of CSPRNG is so optimized
| that it is seldom the bottleneck anyway.
|
| [0]: https://github.com/espadrine/shishua#comparison
| josephg wrote:
| Cool! I wonder if the rand crate would be interested in
| adding this. I love all their sampling functions -
| shuffle, choose, uniform, etc. But it looks like shishua
| is way faster than any of their rngs.
| pclmulqdq wrote:
| The state of the art in super-fast PRNGs is about 0.3 cycles
| per byte at the moment. I believe this is done with SIMD
| versions of the xoroshiro algorithms right now. 0.3 cycles
| per byte compared to "a few" cycles per byte is an order of
| magnitude difference in throughput.
|
| Here's the comparison from the maintainers of Julia:
| https://prng.di.unimi.it/#shootout
|
| Still, most crypto libraries are designed with extreme
| performance in mind, and very few PRNG libraries are. You can
| do a lot better than 0.3 cycles/byte and still beat the NIST
| test if you try for speed.
| hdevalence wrote:
| I haven't paid close attention recently but that doesn't
| seem that far off of performance available via (hardware
| accelerated) AES?
|
| Looking at https://eprint.iacr.org/2018/392.pdf , it seems
| like:
|
| - Intel CPUs can use AESNI to do AES at 0.64 cpb - AMD Zen
| cores have two AESNI cores and can achieve 0.31 cpb -
| Vectorized AES instructions (supposed to ship in Ice Lake
| five years ago, but maybe a casualty of Intel's AVX512
| mishaps) were expected to bring it down to 0.16 cpb
|
| In some sense this isn't a "fair" comparison in that it's
| fast because there's hardware acceleration, but that
| doesn't really matter, the hardware is there so it might as
| well be used.
| pclmulqdq wrote:
| Yeah, chacha was an odd choice because AES is a lot
| faster on CPUs with acceleration. Just doing a few AES
| rounds would be a pretty fast, good PRNG.
| dpwm wrote:
| For those interested in using AES with reduced rounds as
| a PRNG, it is covered in the paper "Parallel Random
| Numbers: As Easy as 1, 2, 3" by John Salmon et al.
|
| https://www.thesalmons.org/john/random123/papers/random12
| 3sc...
| isilofi wrote:
| Upon a closer look, do not trust the numbers on https://rust-
| random.github.io/book/guide-rngs.html in any way, they are
| clearly bogus and implausible. Their figure for their
| "StepRNG" which is just a counter is 51GB/s. Their XorShift
| RNG at 5GB/s, which is just a XOR and a shift is slower than
| Xorshiro at 7GB/s, which is a xor, shift and rotate, 1 op
| more. Both XorShift and Xorshiro should actually be of
| comparable performance to a counter of the same width because
| with modern CPUs, all those trivial bit operations like
| shift, rotate and XOR are sub-cycle microops. And all three
| of those should either be memory-bound and therefore of the
| same performance, or quite a bit faster than memory-bound if
| they just measure the in-register performance.
|
| > Your 4000x factor speed up for a linear-congruential
| generator is just a completely false number. > Yes I did pick
| ChaCha20 for its speed -- it's designed for speed!
|
| 1 Chacha20 block takes 20 rounds, each of which consists of 4
| QR (quarter round) operations. A QR is 4 additions, 4 XORs
| and 4 ROTLs, so 12 instructions on 32bit values. Multiply
| that together and you arrive at 960 operations per block
| (actually a handful more for the counter, maybe the round
| loop and stuff like that, but not a lot), each block gives
| you 16 uint32 values. So 60 instructions per uint32 or 15
| instructions per byte.
|
| A multiply-add generator takes only 2 instructions (you could
| use fused-multiply-add if available, but i'll leave that out
| as I left out sub-microop-rotate before, just to not
| overcomplicate things) per uint32 or half an instruction per
| byte. Yes, that is not yet a hyperbolic factor of 4000.
|
| But then you'll have to use your random values. Since your
| Chacha20 random number stream only comes in blocks, on many
| CPU architectures, you will have all your registers full with
| your resulting block. Meaning that for the subsequent
| calculation, you have to store those random numbers somewhere
| or throw them away, do your other calc, then load the
| randomness again, etc. So you will always pay a penalty for
| cache and memory accesses and you will always have
| unnecessary register pressure. Even a L1 cache access will
| cost you about 4 cycles of access latency, other cache levels
| are far worse. Which means that it probably won't be 4000
| yet, but a lot more.
|
| Now we'll arrive at "yes, but somebody said ChaCha20 is
| roughly 1 cycle per byte!". Which isn't wrong, but you have
| to read carefully: 1 _cycle_, not 1 _instruction_. That
| benchmark relies on calculating multiple ChaCha20 blocks in
| parallel, because a modern CPU has multiple execution units
| and thus can execute multiple independent instructions within
| one cycle. There is also SIMD, where one instruction can
| operate on multiple pieces of data. But to be fair, we also
| need to do this with our multiply-add-RNG. And where I can
| have 16 registers of 32bits calculating one ChaCha20 block, I
| can also have 16 of the same 32bit registers calculating 16
| multiply-add random numbers in parallel.
|
| Thus giving us 60 cycles per ChaCha20 uint32 vs. 0.125 cycles
| (2/16) per multiply-add uint32. That is a factor of 480, not
| taking possible memory or cache penalties into account,
| because that really depends on the computation between the
| randomness steps. Still not 4k, I admit, that was hyperbole.
| glitchc wrote:
| An alternative to the post is to take any PRNG's output and
| calculate the SHA3 hash of it, then use the hash as the random
| number. If SHA3 is not available, use AES. In either case, all
| statistical weaknesses and side-channel attacks will be
| eliminated.
|
| The platform is far more likely to have support for either of
| those in the system libraries, than a bespoke CSPRNG, which would
| need to be packaged with your app/game.
| josephg wrote:
| That might be convenient so you don't have to go fishing for
| 3rd party libraries - which is especially a problem in C/C++.
| But I expect the result will be much slower than using an
| optimized csrng like chacha.
| glitchc wrote:
| Well, ChaCha is actually a hash function. Specifically it was
| a candidate for SHA3 but was beaten by Keccak in the final
| spec. Additionally, aes-ctr-128 has the same performance as
| ChaCha (1.5 GB/s).
|
| All I was trying to illustrate was that in general, pairing a
| PRNG with a block cipher or hash function is sufficient to
| create a CSPRNG, and any developers worried about their PRNGs
| can couple rand() with a readily available cipher/hash in the
| system libraries. After all the blog is explicit about not
| requiring a cryptographically secure RNG for most
| applications.
| alright2565 wrote:
| Would be better to simply use a counter and a random seed in
| that case. Who knows when your prng will loop around, but your
| counter will always loop at 2^32.
|
| At which point, if you use AES, you've just implemented the
| AES-CTR csprng that TFA suggests.
| zuzun wrote:
| The sponge design of SHA-3 makes it a CSPRNG basically. You can
| keep draining it after you fed the data in.
| glitchc wrote:
| That's a great poin, even better.
| ribit wrote:
| The reason not to use CSPRNG is performance and efficiency. There
| are many applications where you only need randomness, and don't
| care about the resiliency provided by the CSRPNG. I mean, I'm not
| going to run a cryptographically secure generator on a GPU just
| to scatter a few rays for my raytracer.
|
| IMO, suggesting that CSPRNG should be the default choice is a bad
| engineering advice. This leads to software burning processor
| cycles for no reason at all. I agree with the author that the
| default PRNGs provided by many environments are of very poor
| quality, but that's hardly a reason to choose something you don't
| need.
| josephg wrote:
| My biggest use case for random numbers is for fuzz testing. I
| fuzz test all over the place - any time I have some clear
| invariants and some complex code to test, I'll generate random
| data in a loop and make sure the invariants always hold. This
| finds _so many bugs_.
|
| But for this kind of work, a good prng is better. The reason is
| simple: when I find a failing test, I can print out the seed that
| generated that test. Then I can rerun just that one seed (so the
| error shows up immediately) and debug. Every time I rerun my
| program, a fixed seed means it will use the same input and fail
| in the same way. Perfect.
|
| High quality, modern PRNGs shouldn't have the kind of bugs that
| cause the problem listed at the start of this article. I think
| part of the problem is that C has so many ways to get random
| numbers that it's hard to tell which are high quality, and on
| which platforms.
|
| Rust's rand crate is the poster child for me of what this looks
| like done well. Here's a bunch of prngs and csrngs, implemented
| and listed in a simple table for you to pick which one you want
| to use. For each rng they show rough performance numbers and the
| prngs have a rough quality guide:
|
| https://rust-random.github.io/book/guide-rngs.html
|
| And if in doubt, thread_rng() is always a good choice.
| ngneer wrote:
| I recommend you research CSPRNGs before arriving at a
| conclusion. They, too, are seeded, which is helpful for
| deterministically replaying a sequence, say for fuzzing.
| josephg wrote:
| Oh, TIL! And indeed chacha in Rust's rand crate can be
| seeded. Cool!
|
| https://rust-
| random.github.io/rand/rand_chacha/struct.ChaCha...
| worik wrote:
| This is wrong for two reasons
|
| Firstly
|
| Reproducibility
|
| Using "true" randomness sacrifices that
|
| Secondly
|
| Performance
|
| In simulations you often need billions of these
|
| There are cases for true randomness, it is a good thing it is
| available those rare occasions5
| OskarS wrote:
| > Reproducibility
|
| That's a very silly objection: CSPRNGs can be seeded just like
| a PRNG. The "P" in both abbreviations is the key, it stands for
| "pseudo".
| pclmulqdq wrote:
| CSPRNGs are not true randomness. They are PRNGs (with a seed
| and all), but with cryptographic guarantees. That also means,
| as a corollary, that they are generally good RNGs.
|
| TRNG appliances can get very high throughput, too.
| sneak wrote:
| There are lots of cases where a PRNG is more appropriate than a
| CSPRNG. Wiping disks is one of them, as the CSPRNG becomes the
| bottleneck with large arrays. Testing i/o or network throughput
| is another; you don't want your algorithm tainting the results.
|
| I like the thrust of the article but if you're going to be
| particular, you should also be correct. CSPRNGs are not suitable
| replacements in 100% of cases.
|
| Engineering is about trade-offs.
| tczajka wrote:
| If a predictable PRNG is sufficient for wiping disks, why do we
| care that the data is random at all? How about writing 0, 1, 2,
| ..., 255, 0, 1, 2, ... to consecutive bytes? Or -- generate 1
| kb of good quality random data, and write the same 1 Kb
| repeatedly? It's not clear to me what we're trying to achieve
| in this scenario.
| OskarS wrote:
| The author is a bit too categorical, there are certainly use-
| cases where both speed and code-size matters a great deal. Others
| have mentioned Monte Carlo algorithms, but there are others: not
| very long ago I was doing an effect in a shader that needed a
| PRNG, it would have been lunacy to use a CSPRNG. High quality
| statistical properties are not important at all, but speed and
| instruction size is VERY important. My hunch is that this is
| often true in game-dev: Minecraft probably should not use a
| CSPRNG to generate its random world.
|
| However, in general, I basically agree: for MOST tasks where you
| need a random number generator, a CSPRNG is probably the right
| choice and it's a reasonable "default" choice if you don't have
| good reasons why it shouldn't be. It's a continual annoyance to
| me that PRNG libraries usually don't include any solid CSPRNGs
| (looking at you, C++ header <random>) when it's often the right
| choice.
| tromp wrote:
| I recommend siphash24 [1] for its excellent statistical
| properties and good performance. It's a light weight, almost
| cryptographically secure, seeded hash, very popular in hashtable
| implementations.
|
| [1] https://en.wikipedia.org/wiki/SipHash
| pyjarrett wrote:
| > You can try running this code yourself and see if you get the
| same answers. This experiment is repeatable.
|
| OpenSUSE: 483 0.305 484 0.293
| 485 0.283 486 0.275 487 0.3 488 0.302
| 489 0.304 490 0.274 491 0.274 492 0
| 493 0.285 494 0 495 0.271 496 0
| 497 0.289 498 0 499 0.294 500 0
| 501 0.306 502 0 503 0.31 504 0
| 505 0.275 506 0 507 0.285 508 0
| 509 0.289 510 0 511 0.282 512 0
| 513 0.301 514 0 515 0.266 516 0
| 517 0.289 518 0 519 0.291 520 0
| 521 0.309 522 0 523 0.325 524 0
| 525 0.358 526 0 527 1 528 0 529 0
| 530 0 531 0 532 0 533 0 534 0
|
| That's wild.
|
| I don't see this with mt19937, but I did see these funny nuggets
| with unseeded `rand()` on Windows: 255 0.3
| 256 0 257 0.289 258 0.306 ... 511
| 0.262 512 0 513 0.275
| Vecr wrote:
| I'm surprised I'm not seeing mention of the reduced round ChaCha8
| version of ChaCha, that's what I'd recommend for Monte Carlo use.
| It's available in Rust at
| https://docs.rs/rand_chacha/0.3.1/rand_chacha/struct.ChaCha8... .
| I recommend seeding it from a high quality random source, but it
| could be fixed in your source code, generated with `openssl rand
| -base64 32` or similar.
| spott wrote:
| This whole discussion seems to be summed up as "premature
| optimization is the root of all evil, and using anything but a
| csprng is a premature performance optimization"
|
| There doesn't seem to be any good non-performance reasons to use
| a regular prng that I can think of.
| zeta0134 wrote:
| But if you need a PRNG in a _big hurry_ , an LFSR is probably
| fine.
|
| https://en.m.wikipedia.org/wiki/Linear-feedback_shift_regist...
|
| I use one on my NES projects that can compute a new 8bit PRNG
| value in something like 65 cycles
| e63f67dd-065b wrote:
| There's one big scenario where PRNGs are good: reproducibility. I
| have a lot of tests that are deterministic given a seed, and if
| the test fails then I can grab the seed and replay the test.
|
| CSPRNG is designed to be _cryptographically secure_ , and thus I
| somewhat agree that it should be the default choice. But there
| are many cases where reproducibility is a good thing, and that's
| where you need seeded rngs.
| Klaus23 wrote:
| The same is true for CSPRNG. Same seed -> same output.
| atq2119 wrote:
| This is pretty reasonable, I think, but can it be applied to GPU
| code?
|
| The fundamental challenge of CSPRNGs is that they tend to have
| high latency and each thread produces a lot of randomness at
| once, while typically you only need a little bit at a time. So
| either you throw away the extra bits, effectively wasting ALU, or
| you keep them around in storage somewhere, which is expensive as
| well.
|
| I would love to see a CSPRNG that is designed to leverage
| subgroup/wave ops. (Hmm, now that I think about it, perhaps
| ChaCha can be spread reasonably across 4 lanes? That would give 4
| dwords per thread and potentially cut latency quite a bit, much
| better tradeoff overall for most GPU applications)
| slow_typist wrote:
| http://csundergrad.science.uoit.ca/courses/sim-and-modeling-...
___________________________________________________________________
(page generated 2023-11-25 23:00 UTC)