[HN Gopher] RDRAND on AMD Ryzen 9 5900X is flakey
___________________________________________________________________
RDRAND on AMD Ryzen 9 5900X is flakey
Author : Avamander
Score : 97 points
Date : 2021-01-11 18:21 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gbrown_ wrote:
| Not to be snarky but OpenBSD's random system has always seemed
| very elegant to me and seems like such a system could avoid this
| failure mode in systemd, video[1], slides[2].
|
| That said it's good that the faulty RDRAND was discovered. As
| pointed out this isn't the first time processors (AMD particular)
| have had such issues. Do we need to just be skeptical of this and
| run RDRAND tests every time a new processor comes out? Or perhaps
| even each microcode update?
|
| [1] https://www.youtube.com/watch?v=aWmLWx8ut20
|
| [2]
| https://www.openbsd.org/papers/hackfest2014-arc4random/index...
| segfaultbuserr wrote:
| > _seems like such a system could avoid this failure mode in
| systemd, video[1], slides[2]_
|
| Yes and no.
|
| Yes, OpenBSD is not vulnerable to this failure mode. But no,
| your slideshow only shows the architecture of the entropy pool
| (which is well designed and well implemented), but it is
| _irrelevant_ to our problem here. The problem is not how well
| the random system performs once it started, but simply _what to
| do when it 's not initialized yet during early boot_.
|
| As shown in the slideshow, OpenBSD uses an on-disk random seed
| from the previous boot as one of the many sources of randomness
| to initialize its entropy pool during early boot. Linux does it
| too, systemd saves and restores the seed file every time you
| boot or shutdown. What's the problem then? systemd needs a
| source of randomness at an earlier time, before the entropy
| pool is seeded by the file.
|
| How OpenBSD solved it then? Instead of loading it after boot in
| userspace, OpenBSD load the file early, really early, via the
| bootloader! Seeding the entropy pool via a file is part of the
| boot protocol.
|
| This is the advantage you have when you're building an entire
| operating system, not just a kernel.
| dragontamer wrote:
| Hmmm...
|
| Well, first problem is using RDRAND in this case. RDRAND isn't
| meant for seeds, its for a random number.
|
| If you're seeding something, you're supposed to use RDSEED, which
| has more guarantees about seeding / entropy behavior. I don't
| think RDRAND was ever intended to be used as an entropy source.
|
| (Think of it this way: RDSEED is kinda like /dev/random, while
| RDRAND is kinda-like /dev/urandom)
|
| ------
|
| RDRAND is an older instruction than RDSEED. So that then brings
| up the question: what is the proper startup behavior of systems
| without RDSEED but with RDRAND? Well, since RDRAND never had
| entropy guarantees, I guess its up to other sources of entropy
| (see clock-entropy, such as RDTSC).
|
| RDTSC isn't a source of entropy, its just a clock. But as a GHz-
| level precise timer, it can be used to capture entropy from some
| other source that does have time-based entropy. (IE: hard drive
| response rates, or something).
|
| Will the hard drive respond in 10,000,000 clock ticks? Or
| 10,000,020 clock ticks? That's a source of entropy, a bad one,
| but maybe you have enough hard drive I/O going on that you can
| get enough entropy to boot up Systemd.
|
| EDIT: Hmm... hard drives may not be on all systems. In that case,
| DRAM-latency could possibly be the answer. DRAM refreshes on a
| regular basis, locking the CPU out temporarily. As such, the
| amount of time it takes to access data from DRAM is actually
| somewhat random (in the scope of a GHz clock anyway). Maybe that
| could be used as an entropy-source of last resort? (as long as
| RDTSC exists anyway).
| mzs wrote:
| please don't over complicate this:
|
| >you're right, after I've generated many randoms, I was able to
| get collisions.
|
| RDRAND() = 0x0081da17 RDRAND() = 0x0081da17 RDRAND() =
| 0x0178d2ea RDRAND() = 0x0178d2ea RDRAND() = 0x02a91db5 RDRAND()
| = 0x02a91db5 RDRAND() = 0x06c4385b RDRAND() = 0x06c4385b
| RDRAND() = 0x095d1bf8 RDRAND() = 0x095d1bf8 RDRAND() =
| 0x0990b335 RDRAND() = 0x0990b335 RDRAND() = 0x0ab033e4 RDRAND()
| = 0x0ab033e4 RDRAND() = 0x0ac21fae RDRAND() = 0x0ac21fae
| RDRAND() = 0x0d39390b RDRAND() = 0x0d39390b RDRAND() =
| 0x0df2f5ce RDRAND() = 0x0df2f5ce RDRAND() = 0x109e5c8a RDRAND()
| = 0x109e5c8a
|
| https://github.com/systemd/systemd/issues/18184#issuecomment...
| dragontamer wrote:
| >> https://github.com/systemd/systemd/issues/18184#issuecomme
| nt...
|
| >> If you can, please try it without using the C intrinsic. I
| have seen multiple compiler bug reports--https://patchwork.oz
| labs.org/project/gcc/patch/2012052017042... is the easiest to
| find--that indicate that the intrinsics don't always return
| the right value.
|
| Hypothetically, the sequence discussed there could be a
| compiler-bug, unrolling the loop and failing to treat the
| results of RDRAND as volatile.
|
| So... yeah. Lets not overcomplicate things, but lets not
| undercomplicate things either. Without an actual disassembly
| of the code posted, there's no proof yet.
| rcxdude wrote:
| I doubt it, a compiler bug like failing to consider rdrand
| volatile would not occasionally fail, but would fail
| consistently (perhaps every nth iteration, but it would
| have a pattern). I struggle to see how a compiler could
| miscompile it to make it inconsistent.
| xeeeeeeeeeeenu wrote:
| Why does this keep happening? RDRAND was broken in Ryzen 3xxx[1]
| and Jaguar/Puma[2] too.
|
| [1] - https://arstechnica.com/gadgets/2019/10/how-a-months-old-
| amd...
|
| [2] -
| https://linuxreviews.org/RDRAND_stops_returning_random_value...
| stabbles wrote:
| I remember I ran into this Ryzen problem too. Spent a long time
| trying to figure out a bug in my code (random numbers were
| deterministic between runs), which ultimately was fixed by
| upgrading the bios.
| Avamander wrote:
| > Why does this keep happening?
|
| Windows doesn't use RDRAND.
| dschuetz wrote:
| systemd: we'll fork our own rdrand
| LukeShu wrote:
| systemd: It's not our job to hack around hardware bugs, it's
| the kernel's. We will not implement our own thing.
|
| HN: How dare you.[1]
|
| Also HN: _Criticizes systemd for implementing its own thing_
|
| Systemd just can't win.
|
| [1] https://news.ycombinator.com/item?id=10999335
| dschuetz wrote:
| They can. Just fork Linux and do their own thing, with
| blackjack and hookers.
| na85 wrote:
| apt: Package Gnome3-Shell depends on systemd-rdrand which is
| not installed. Package systemd-rdrand conflicts gnome3-shell.
| Uninstall 1230982 packages to resolve? [y/n]
| uncledave wrote:
| Headline in two years "Pwnie award for systemd-rdrand"
| nitrobeast wrote:
| Functions that return random results do not fit into the most
| common testing paradigms. I have seen people try to verify the
| distribution of random results, and even that could have concerns
| for theoretical flakiness. I wonder if the nature of the function
| contributed to the apparent lack of testing coverage here.
| gruez wrote:
| Given that rdrand is so flakey across systems, is there a reason
| why systemd even bothers using it? The comments suggest it's only
| used to generate uuids, which makes me think it's not performance
| sensitive code. If that's the case why increase code complexity
| and expose yourself to hardware flakiness just to save a few
| miliseconds?
| LukeShu wrote:
| The source code has a lengthy comment explaining why.
|
| https://github.com/systemd/systemd/blob/bcac754d66374782a85a...
| robocat wrote:
| int rdrand(unsigned long *ret) { /* So, you are a
| "security researcher", and you wonder why we bother with
| using raw RDRAND here, * instead of sticking to
| /dev/urandom or getrandom()? * *
| Here's why: early boot. On Linux, during early boot the
| random pool that backs /dev/urandom and *
| getrandom() is generally not initialized yet. It is very
| common that initialization of the random * pool
| takes a longer time (up to many minutes), in particular on
| embedded devices that have no * explicit
| hardware random generator, as well as in virtualized
| environments such as major cloud * installations
| that do not provide virtio-rng or a similar mechanism.
| * * In such an environment using getrandom()
| synchronously means we'd block the entire system boot-up
| * until the pool is initialized, i.e. *very* long. Using
| getrandom() asynchronously (GRND_NONBLOCK) *
| would mean acquiring randomness during early boot would
| simply fail. Using /dev/urandom would mean *
| generating many kmsg log messages about our use of it before
| the random pool is properly * initialized.
| Neither of these outcomes is desirable. *
| * Thus, for very specific purposes we use RDRAND instead of
| either of these three options. RDRAND * provides
| us quickly and relatively reliably with random values,
| without having to delay boot, * without
| triggering warning messages in kmsg. *
| * Note that we use RDRAND only under very specific
| circumstances, when the requirements on the *
| quality of the returned entropy permit it. Specifically, here
| are some cases where we *do* use * RDRAND:
| * * * UUID generation: UUIDs are
| supposed to be universally unique but are not cryptographic
| * key material. The quality and trust level of
| RDRAND should hence be OK: UUIDs should be *
| generated in a way that is reliably unique, but they do not
| require ultimate trust into * the
| entropy generator. systemd generates a number of UUIDs during
| early boot, including * 'invocation
| IDs' for every unit spawned that identify the specific
| invocation of the * service globally,
| and a number of others. Other alternatives for generating
| these UUIDs * have been considered,
| but don't really work: for example, hashing uuids from a
| local * system identifier combined
| with a counter falls flat because during early boot disk
| * storage is not yet available (think: initrd) and
| thus a system-specific ID cannot be *
| stored or retrieved yet. * *
| * Hash table seed generation: systemd uses many hash tables
| internally. Hash tables are *
| generally assumed to have O(1) access complexity, but can
| deteriorate to prohibitive * O(n)
| access complexity if an attacker manages to trigger a large
| number of hash * collisions. Thus,
| systemd (as any software employing hash tables should) uses
| seeded * hash functions for its hash
| tables, with a seed generated randomly. The hash tables
| * systemd employs watch the fill level closely and
| reseed if necessary. This allows use of *
| a low quality RNG initially, as long as it improves should a
| hash table be under attack: * the
| attacker after all needs to trigger many collisions to
| exploit it for the purpose * of DoS,
| but if doing so improves the seed the attack surface is
| reduced as the attack * takes place.
| * * Some cases where we do NOT use RDRAND are:
| * * * Generation of cryptographic key
| material * * * Generation
| of cryptographic salt values * *
| This function returns: * *
| -EOPNOTSUPP - RDRAND is not available on this system
| * -EAGAIN - The operation failed this time, but
| is likely to work if you try again a few *
| times * -EUCLEAN - We got some
| random value, but it looked strange, so we refused using it.
| * This failure might or might not be
| temporary. */
| secondcoming wrote:
| Since they only use it for UUID generation, couldn't they use
| rdrand() to seed a PGC pseudo-RNG?
| gruez wrote:
| But what happens on systems that don't have rdrand? Surely
| there is a fallback. Why isn't the fallback used in all
| cases, given how flakey rdrand is?
| rcxdude wrote:
| It falls back to /dev/urandom, which will cause the kernel
| bitterly complain by spamming log messages while the
| entropy pool hasn't been seeded.
| okl wrote:
| Possibly because it helps to bootstrap the entropy pool, see
| for example these ramblings on the topic: https://daniel-
| lange.com/archives/152-Openssh-taking-minutes...
| espadrine wrote:
| Isn't that an even bigger problem, though, if Linux' entropy
| pool gets polluted by an inentropic source that still
| increments its entropy counter?
|
| I recall discussions to maybe rely on something like jitter
| entropy when the pool is empty and getentropy() is called.
| garaetjjte wrote:
| >on something like jitter entropy when the pool is empty
| and getentropy() is called
|
| This ended up being implemented in 5.4 kernel: https://git.
| kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
| altfredd wrote:
| Linux entropy pool is designed to accept and mix different
| low-quality entropy sources and has explicit workarounds
| for problems like these.
|
| Systemd is literally the only software, that has this
| problem. I am not aware of any other software, that uses
| rdrand and expects high-quality cryptography-grade
| randomness. Precisely, because is does not work. Intel CPUs
| used to have very similar issues with rdrand and so did
| AMD. Furthermore, CPU implementation of rdrand is a very
| attractive targets for state backdoors, so most sensible
| developers either follow the "GNUPG way" (ask for
| randomness from user) or simply read from /dev/random.
|
| The rdrand instruction is great for games, because it
| allows to make white noise without complex algorithms and
| system call overhead. It is also handy for few situations,
| like interrupt handlers, when you needs to create some
| semi-random value without using stack space or calling into
| outside code. Unfortunately, when it was introduced, rdrand
| was documented to generate "cryptographically strong random
| numbers" (did Intel developers ever knew, what that means?)
| Of course, most actual cryptography experts didn't buy into
| that. As a consequence, and because multi-platform software
| needed to have it's own RNG anyway, the instruction
| remained largely unused for actual cryptography. Unused =
| untested and sometimes broken. If I were in systemd
| developer's place, I would not hinge bootability of my
| systems on something like that.
| Dylan16807 wrote:
| > The rdrand instruction is great for games, because it
| allows to make white noise without complex algorithms and
| system call overhead.
|
| Eh. You're looking at hundreds of cycles per use. That
| makes it slower than a _secure_ software RNG, let alone
| an _insecure_ one.
| jcelerier wrote:
| Right ! For games if it's for e.g. texture noise or
| things like that you shouldn't need more than a couple
| xors and shifts
| ficklepickle wrote:
| It's an early boot problem. If they tried to use the
| kernel facilities it would block, making boot take much
| longer. Fixed in recent kernel versions.
|
| I know it's in vogue to rip on systemd, but let's at
| least try to be fair.
| freeone3000 wrote:
| Somehow sysv init never ran into this issue. Maybe it's
| okay to let the kernel seed its entropy pool before
| initializing your hash table.
| Foxboron wrote:
| I don't think sysvinit made use of hashtables, and if
| they did they would _probably_ hit the same snag.
| tadfisher wrote:
| > Systemd is literally the only software, that has this
| problem. I am not aware of any other software, that uses
| rdrand and expects high-quality cryptography-grade
| randomness.
|
| Please seek out and understand the reasons behind calling
| RDRAND in systemd before making statements like these.
| "High-quality crytography-grade randomness" is explicitly
| _not_ required for the purposes of the PRNG at boot-time,
| which include UUID generation and seeding hash tables.
|
| https://github.com/systemd/systemd/blob/61bd7d1ed595a98e5
| fbf...
| recursive wrote:
| Who's spending multiple milliseconds generating uuids? I'm sure
| it happens somewhere, somehow, but that does not seem like a
| common occurrence.
| tadfisher wrote:
| It's not systemd's job to fix hardware bugs, it's the kernel's
| (or AMD's). The fix for the last time this happened was to mask
| the offending CPU IDs in the kernel's RNG from using RDRAND.
| systemd did have a temporary hack in place for their direct
| RDRAND implementation, but this current bug isn't the same (not
| just emitting 0 like last time, but eventually emitting
| duplicate values in pairs).
| ginko wrote:
| Even so, systemd should be able to deal with name collisions
| even with numbers coming from a true RNG. That causing errors
| is a design issue, IMO.
| johncolanduoni wrote:
| Most cryptographic constructs you feed CSPRNG data into
| break spectacularly if they receive colliding data so I
| guess they all have design issues too. Including the TLS
| connection you're typing your comment through.
| tadfisher wrote:
| That would be a non-deterministic failure case scaling to
| O(n) in both space and time. But it's not just UUIDs, it's
| also required for seeding systemd's internal hash tables,
| which would degrade from O(1) lookups to O(n) should an
| attacker exploit known RNG flaws.
|
| So, a PRNG with a semi-decent (not perfect) entropy pool is
| required at boot time, and systemd needs to run in
| environments where seeding that pool in software could take
| on the order of minutes. This is why RDRAND is used to seed
| the pool during boot.
|
| https://github.com/systemd/systemd/blob/61bd7d1ed595a98e5fb
| f...
| klmadfejno wrote:
| That doesn't seem to answer the question of why use it
| tadfisher wrote:
| If they don't use it, the kernel does in order to
| initialize the entropy pool, which will also result in
| broken boots. To work around historically flaky kernel
| implementations, systemd calls RDRAND directly if it
| detects it needs to with a heuristic. This is why the fix
| for the current issue is to set both 'nordrand' and
| 'SYSTEMD_RDRAND=0'.
| tedunangst wrote:
| What happened to it's not systemd's job to fix kernel
| bugs?
| numpad0 wrote:
| It generally makes more sense to trust CPU specifications
| and utilize it fully, than to assume everything is tampered
| with and only ever use `mov`
| [deleted]
| stefan_ wrote:
| It's really ridiculous of AMD to keep fucking this up, even if
| the backdoor thing never stuck for RDRAND, the fact that one
| CPU vendor can't manage to make the thing not return _identical
| values_ many many times over sure might.
|
| Similar story for systemd of course, nobody would expect them
| to learn their lesson. How nice of an init system to keep your
| system dead because it apparently needs CPRNG generated _job
| ids_ of all things.
| temac wrote:
| Given this is broken over and over again in successives AMD
| chips, this is actually a little bit likely to be a backdoor.
| People will e.g. generate keys with broken rdrand, and AMD
| will later fix it with plausible deniability, just saying
| "oops, sorry, it was just our silly usual mistake, and we
| don't even promise we won't do it again, lol!". In the
| meantime the NSA will enjoy the resulting compromises. And
| well, even if it _was_ just the same silly mistake made over
| and over again, the NSA is likely to still enjoy the results?
|
| So it does not really matter if it was intentional or not:
| regardless of the root cause of this bug, it should now be
| considered that rdrand should _never_ be trusted (unless you
| truly don 't care that your systems risk to randomly break or
| you risk to randomly generate compromised keys) and its mere
| usage (except maybe indirectly in mitigated ways designed by
| experts and audited for years) considered a very likely
| security vulnerability.
| AshamedCaptain wrote:
| And why do you even need globally UUIDs here? What's wrong with
| a plain old sequence counter or the like?
| stragies wrote:
| Among others is that predictability of behavior of systems
| makes analysis and/or attack easier. Also "plain old sequence
| counters" in clusters of (heterogeneous) hardware makes
| distributing (floating) tasks more challenging. There are a
| bunch of other reasons.
| garaetjjte wrote:
| As much as I don't like systemd this probably was reasonable
| decision back at the time. systemd needs entropy for generating
| some random identifiers during early boot. Trying to read it
| from kernel random facilities might block infinitely until
| kernel 5.4. Before that pulling entropy from RDRAND might not
| be that unreasonable decision, nowadays it probably should just
| call getrandom().
| altfredd wrote:
| It seems, that many RedHat developers have rather unique world
| views.
|
| When PulseAudio introduces random crackles and scratches in
| sound output, they claim, that I have a misbehaving sound card.
| It does not "misbehave", when I use the same applications with
| ALSA, but whatever.
|
| When rdrand fails to produce high-quality entropy -- again and
| again, over and over -- Poettering complains about buggy CPUs.
|
| I hope, that this particular group never ends up working on
| network hardware.
| [deleted]
| Avamander wrote:
| > It does not "misbehave", when I use the same applications
| with ALSA, but whatever.
|
| I mean RDRAND is a great example here, if you never use it,
| you never encounter these bugs. It's not impossible that
| PulseAudio tries to play higher quality audio or simply uses
| the device differently.
|
| > I hope, that this particular group never ends up working on
| network hardware.
|
| I'd kinda love it, probably quite a few fun hardware bugs
| will be found.
| vetinari wrote:
| > It's not impossible that PulseAudio tries to play higher
| quality audio or simply uses the device differently.
|
| Pulseaudio mixes sound from several applications, unlike
| Alsa. Alsa just configures your soundcard for that one
| stream, Pulseaudio has no such luxury.
|
| In order to mix relatively small buffers (because latency),
| it needs to do timing. If your soundcard has problem with
| timing (and many do!), you will get crackling. The good
| news is, that the problem with timing can be only at
| specific frequencies; in the past, I had a computer with
| Nvidia Ion and it did this with 44,1kHz audio. With 48kHz,
| everything was OK. So fixing the output at 48kHz by forcing
| Pulse to mix at this frequency fixed the crackling.
| Dylan16807 wrote:
| I'm reminded of this story about hidden network card bugs.
| Working in a bunch of tests doesn't mean something is
| correct, especially when it's complex.
|
| https://devblogs.microsoft.com/oldnewthing/20050512-48/?p=3
| 5...
| zuzun wrote:
| > just to save a few miliseconds?
|
| RDRAND is rather slow, especially with the SRBDS microcode
| update. One instruction takes 1,200 ns on my system, that's 6.6
| MB/s. A software implementation of a CSPRNG will do hundreds of
| MB/s, the general purpose PRNGs do several GB/s.
| HarryHirsch wrote:
| Make that: "systemd and RDRAND on Ryzen 9 is flakey"
| Dylan16807 wrote:
| I'm sure a lot of other things are being affected by this
| defect, just less visibly. It's not like the software is
| breaking x86 instructions somehow.
| HarryHirsch wrote:
| Someone please explain why someone would use the hardware
| random generator directly. Everyone else uses /dev/{u,}random
| because somehow HRNGs always manage to be buggy, and then
| there's the concern about NSA involvement. Using OS-provided
| interface mitigates either concern.
| mjg59 wrote:
| https://github.com/systemd/systemd/blob/bcac754d66374782a85
| a...
| HarryHirsch wrote:
| Dude is stubborn in the face of reality. It's not as if
| there hadn't been RDRAND hardware bugs before.
___________________________________________________________________
(page generated 2021-01-11 22:01 UTC)