https://github.com/systemd/systemd/issues/18184 Skip to content Sign up * Why GitHub? Features - + Code review + Project management + Integrations + Actions + Packages + Security + Team management + Hosting + Mobile + Customer stories - + Security - * Team * Enterprise * Explore + Explore GitHub - Learn & contribute + Topics + Collections + Trending + Learning Lab + Open source guides Connect with others + Events + Community forum + GitHub Education + GitHub Stars program * Marketplace * Pricing Plans - + Compare plans + Contact Sales + Nonprofit - + Education - [ ] [search-key] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} systemd / systemd * Sponsor Sponsor systemd/systemd * Watch 356 * Star 7.6k * Fork 2.4k * Code * Issues 1.4k * Pull requests 170 * Actions * Security * Insights More * Code * Issues * Pull requests * Actions * Security * Insights New issue Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community. Pick a username [ ] Email Address [ ] Password [ ] [ ] Sign up for GitHub By clicking "Sign up for GitHub", you agree to our terms of service and privacy statement. We'll occasionally send you account related emails. Already on GitHub? Sign in to your account Jump to bottom RDRAND on AMD Ryzen 9 5900X is flakey #18184 Open asahaf opened this issue Jan 9, 2021 * 45 comments Open RDRAND on AMD Ryzen 9 5900X is flakey #18184 asahaf opened this issue Jan 9, 2021 * 45 comments Labels not-our-bug Comments @asahaf Copy link Quote reply @asahaf asahaf commented Jan 9, 2021 * edited systemd version the issue has been seen with 247.2-1 Used distribution Arch Linux Linux kernel version used (uname -a) 5.10.5-arch1-1 CPU architecture issue was seen on AMD Ryzen 9 5900X Expected behaviour you didn't see Start systemd services Unexpected behaviour you saw Starting services sometimes fails with message: Failed to set invocation ID for unit: File exists Steps to reproduce the problem Starting up the system and login (sometimes, the os doesn't even boot) Additional Information CPU: AMD Ryzen 9 5900X Microcode update image (amd-ucode) version: 20201218.646f159-1 Motherboard: X570 i Aorus Pro Wifi BIOS/Firmware Version: F31b (10/28/2020) Arch Kernel Version: 5.10.5-arch1-1 Systemd Version: 247.2-1 Logs Jan 08 21:47:40 archlinux systemd[1]: dev-ttyS27.device: Failed to set invocation ID for unit: File exists Jan 08 21:47:40 archlinux systemd[1]: sys-devices-platform-serial8250-tty-ttyS27.device: Failed to set invocation ID for unit: File exists Jan 08 21:47:40 archlinux systemd[1]: dev-ttyS29.device: Failed to set invocation ID for unit: File exists Jan 08 21:47:40 archlinux systemd[1]: sys-devices-platform-serial8250-tty-ttyS29.device: Failed to set invocation ID for unit: File exists Jan 08 21:47:40 archlinux systemd[1]: dev-ttyS28.device: Failed to set invocation ID for unit: File exists Jan 08 21:47:40 archlinux systemd[1]: sys-devices-platform-serial8250-tty-ttyS28.device: Failed to set invocation ID for unit: File exists Jan 08 21:47:40 archlinux systemd[1]: dev-ttyS3.device: Failed to set invocation ID for unit: File exists Jan 08 21:47:40 archlinux systemd[1]: dev-ttyS31.device: Failed to set invocation ID for unit: File exists Jan 08 21:49:03 office01 systemd[1]: sys-kernel-tracing.mount: Failed to set invocation ID for unit: File exists Jan 08 21:49:03 office01 systemd[1]: Failed to mount Kernel Trace File System. The text was updated successfully, but these errors were encountered: 3 1 @poettering Copy link Member @poettering poettering commented Jan 11, 2021 this smells as if your CPU has a borked RDRAND implementation. i.e. a story like this one: https://arstechnica.com/gadgets/2019/10/ how-a-months-old-amd-microcode-bug-destroyed-my-weekend/ @poettering Copy link Member @poettering poettering commented Jan 11, 2021 can you reboot reliably if you add "nordrand" to the kernel cmdine? See #11810 @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 That actually what I suspected when I started googling. I've downloaded the small program in linked article and execute it. Here is output: ./rdrand-test/amd-rdrand-bug Your RDRAND() does not have the AMD bug. ./rdrand-test/test-rdrand RDRAND() = 0x757b61ec RDRAND() = 0x4c0527be RDRAND() = 0xe4ccb42a RDRAND() = 0xc54d5506 RDRAND() = 0x7c414b9e RDRAND() = 0x5075cb07 RDRAND() = 0x5cd99e55 RDRAND() = 0xc6ca8a67 RDRAND() = 0x844013d9 RDRAND() = 0x927b689d RDRAND() = 0xd344ddfc RDRAND() = 0x7809d123 RDRAND() = 0x3481b240 RDRAND() = 0xed6269b7 RDRAND() = 0x5909acd7 RDRAND() = 0x70c4c118 RDRAND() = 0x240d6150 RDRAND() = 0x7532c2b5 RDRAND() = 0xe86bddc4 RDRAND() = 0xf40c31fa Now I got confused. I'm not sure if that issue is because of a broken CPU rdrand implementation I'll try the nordrand argument, and get back to you @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 here is the output with nordrand argument -- Journal begins at Wed 2021-01-06 21:15:50 +03, ends at Mon 2021-01-11 19:59:23 +03. -- Jan 11 19:59:18 office01 systemd[1]: basic.target: Failed to set invocation ID for unit: File exists Jan 11 19:59:18 office01 systemd[1]: Failed to start Basic System. ## Subject: A start job for unit basic.target has failed ## Defined-By: systemd ## Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ## ## A start job for unit basic.target has finished with a failure. ## ## The job identifier is 116 and the job result is failed. Jan 11 19:59:18 office01 systemd[1]: dbus.service: Failed to set invocation ID for unit: File exists Jan 11 19:59:18 office01 systemd[1]: Failed to start D-Bus System Message Bus. ## Subject: A start job for unit dbus.service has failed ## Defined-By: systemd ## Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ## ## A start job for unit dbus.service has finished with a failure. ## ## The job identifier is 209 and the job result is failed. @poettering Copy link Member @poettering poettering commented Jan 11, 2021 If you add SYSTEMD_RDRAND=0 to the kernel cmdline, does that change anything? @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 This argument seems solved the issue. This systemd issue used to happen 7/10 of the times when I reboot my machine. Now after applying this argument (SYSTEMD_RDRAND=0), I rebooted my machine 10 times with no issues. Does this argument disables the relying on the CPU for generating random seed? and why is it sometimes, without the argument, I get lucky to boot the system without this issue? @poettering Copy link Member @poettering poettering commented Jan 11, 2021 Smells as if your RDRAND on your CPU is flaky, and doesn't generate properly random stuff, but frequently the same values. I'd report that to your CPU vendor. Like all RDRAND issues this is security sensitive as it can be.... In particular as most distros now default to seeding the kernel entropy pool with RDRAND. @poettering Copy link Member @poettering poettering commented Jan 11, 2021 Does this argument disables the relying on the CPU for generating random seed? and why is it sometimes, without the argument, I get lucky to boot the system without this issue? I can't answer that, it seems the RNG in the CPU is reliable and sometimes works and sometimes doesn't. systemd uses RDRAND to generate UUIDs only, nothing else, and exactly as suggested by the whitepapers. if this generates non-unique UUIDs that so easily collide then the RNG is just rubbish. @poettering poettering changed the title [DEL: Systemd fails to start services with error - Failed to set invocation ID for unit: File exists:DEL] [INS:RDRAND on AMD Ryzen 9 5900X is flakey:INS] Jan 11, 2021 @poettering poettering added the not-our-bug label Jan 11, 2021 @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 Thank you very much, do I still need nordrand argument for the kernel too, in addition to SYSTEMD_RDRAND=0 till the vendor fix it with microcode? @poettering Copy link Member @poettering poettering commented Jan 11, 2021 nordrand turns off RDRAND use by the kernel, i.e. means the kernel won't use it to seed the kernel pool. SYSTEMD_RDRAND=0 turns off RDRAND use by systemd, i.e. we won't generate UUIDs with it. (it won't use it for crypto keys anyway) So the two switches matter, but of course there's a bunch of other userspace sw that might use RDRAND, where you want to turn it off, but there's no common way to do that I was aware of. @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 you're right, after I've generated many randoms, I was able to get collisions. RDRAND() = 0x0081da17 RDRAND() = 0x0081da17 RDRAND() = 0x0178d2ea RDRAND() = 0x0178d2ea RDRAND() = 0x02a91db5 RDRAND() = 0x02a91db5 RDRAND() = 0x06c4385b RDRAND() = 0x06c4385b RDRAND() = 0x095d1bf8 RDRAND() = 0x095d1bf8 RDRAND() = 0x0990b335 RDRAND() = 0x0990b335 RDRAND() = 0x0ab033e4 RDRAND() = 0x0ab033e4 RDRAND() = 0x0ac21fae RDRAND() = 0x0ac21fae RDRAND() = 0x0d39390b RDRAND() = 0x0d39390b RDRAND() = 0x0df2f5ce RDRAND() = 0x0df2f5ce RDRAND() = 0x109e5c8a RDRAND() = 0x109e5c8a @poettering Copy link Member @poettering poettering commented Jan 11, 2021 * edited @asahaf looks like systemd is not at fault. I figure: 1. someone who cares needs to report this to AMD 2. kernel should probably mask RDRAND on these CPUs too, i.e. follow-up for kernel commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24. someone who cares should ping relevant kernel folks about this 3. Someone who cares should ask for a CVE against the AMD CPUs. And we can close this here I guess, we can't do much about this from our side. 2 @poettering Copy link Member @poettering poettering commented Jan 11, 2021 /cc @tlendacky (the author of c49a0a80137c7ca7d6ced4c812c9e07a949f6f24) @bearoso Copy link @bearoso bearoso commented Jan 11, 2021 @asahaf I've got an option on my Gigabyte B550 BIOS concerning RdRand. Could you look in your BIOS settings for "Settings->AMD CBS-> CPU Common Options->RdRand Speedup Disable" and try toggling that and see if things improve? @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 let me check that @briansmith Copy link @briansmith briansmith commented Jan 11, 2021 you're right, after I've generated many randoms, I was able to get collisions. Are you checking the carry flag in your test?: The Carry Flag indicates whether a random value is available at the time the instruction is executed. CF=1 indicates that the data in the destination is valid. Otherwise CF=0 and the data in the destination operand will be returned as zeros for the specified width. All other flags are forced to 0 in either situation. @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 @bearoso Disabling Settings->AMD CBS->CPU Common Options->RdRand Speedup didn't help @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 @briansmith Here is the code I used to generate the numbers // SPDX-License-Identifier: GPL-2.0 /* * Copyright (C) 2019 Jason A. Donenfeld . All Rights Reserved. * * Compile: `gcc -o test-rdrand -O3 -mrdrnd -std=gnu99 test-rdrand.c` */ #include #include #include int main(int argc, char *argv[]) { for (int j, i = 0; i < 2000; ++i) { for (j = 0; j < 10; ++j) { uint32_t val = 0; if (__builtin_ia32_rdrand32_step(&val)) { printf("RDRAND() = 0x%08x\n", val); break; } } if (j == 10) { puts("RDRAND() = FAIL"); return 1; } } return 0; } @briansmith Copy link @briansmith briansmith commented Jan 11, 2021 __builtin_ia32_rdrand32_step If you can, please try it without using the C intrinsic. I have seen multiple compiler bug reports--https://patchwork.ozlabs.org/project/ gcc/patch/20120520170426.GA7774@intel.com/ is the easiest to find--that indicate that the intrinsics don't always return the right value. @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 sometime rdrand works just fine and reliable, not even able to reproduce the issue. @patrickschur Copy link @patrickschur patrickschur commented Jan 11, 2021 The problem is that the entropy value isn't changing. You can try to disable RDRAND speedup optimization in UEFI (if that's enabled). @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 @patrickschur I've tried that. it didn't help @xerz-one Copy link @xerz-one xerz-one commented Jan 11, 2021 Hi, I have a 5900X computer here and I'm unable to replicate this behavior using the snippet at #18184 (comment), compiled with gcc oof.c -mrdrnd -o oof, and running it repeatedly with ./oof | uniq -cd. Anything I may have done wrong or should look for? @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 * edited @xerz-one I had to try many times to be able to catch it. Also there is something which I might be doing wrong, is I look for a collision in a produced set of 200 values. And if I try to find a collision in 32 bit number, it would be easy if I try enough, that's why what I've tried may not be a good test @patrickschur Copy link @patrickschur patrickschur commented Jan 11, 2021 * edited Can someone replicate this issue with ComboAM4v2 1.1.9.0 or newer? Because the problem should be fixed by a newer version of the PSP firmware (00.14.00.24 or later) which is included in 1.1.9.0. But not every partner has updated their UEFI yet. Edit: Maybe ComboAM4v2 1.1.8.0 also works, because this version includes the PSP firmware version 00.14.00.23. Which also fixes a entropy bug. @andrew-d Copy link @andrew-d andrew-d commented Jan 11, 2021 Here's a hacked-together version that doesn't use intrinsics, per @briansmith's recommendation, if anyone wants to try it. I included a commented-out bit for testing RDSEED - would be interested if that has similar behavior. Also, would appreciate if someone else would sanity-check the code, just in case : https://gist.github.com/andrew-d/bcbe477f7de9a03c7b8285bcee531196 @xerz-one Copy link @xerz-one xerz-one commented Jan 11, 2021 @asahaf But how often? for i in {1..1000}; do ./oof | uniq -cd; done gives me nothing, for instance @xerz-one Copy link @xerz-one xerz-one commented Jan 11, 2021 @andrew-d Did gcc ooof.c -o ooof followed by for i in {1..1000}; do . /ooof | uniq -cd; done, got no output @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 * edited @xerz-one the way uniq works is it only looks for consecutive duplicate values. if the same value appears in the set but not consecutive, uniq won't be able to catch it. you many need to sort the output. for i in {1..1000}; do ./oof | sort | uniq -cd; done 1 @xerz-one Copy link @xerz-one xerz-one commented Jan 11, 2021 Ah thanks, sorry for the mistake! Now I'm getting up to two collisions per run my shell presenting collisions The source code presented by @andrew-d also shows a similar behavior @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 @xerz-one what is the set size? also do you think the way we do the test is right? @xerz-one Copy link @xerz-one xerz-one commented Jan 11, 2021 * edited Set size for each command run should follow i < 2000, j < 10 as in script. As for the shell loop, I probably should be taking the stdout of all of the runs together, instead of separately, as I currently am. Disclaimer: I am not infosec, any advice is welcome. @jamescooper-blis Copy link @jamescooper-blis jamescooper-blis commented Jan 11, 2021 * edited Ah thanks, sorry for the mistake! Now I'm getting up to two collisions per run my shell presenting collisions The source code presented by @andrew-d also shows a similar behavior I can reproduce this on Ubuntu 16.04 running on VMWare Player 16 on a Intel i7-10875H CPU. Host is Windows 10. Edit: Also same VM setup on an Intel i4790k. Edit: Also GCP instance using a: `cat proc/cpuinfo processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU @ 2.20GHz ` I did have to change the loop to 'for i in {1..100000} This code: https://gist.github.com/andrew-d/ bcbe477f7de9a03c7b8285bcee531196 @andrew_d @xerz-one Copy link @xerz-one xerz-one commented Jan 11, 2021 * edited @asahaf Here's the output of a single run of the loop, grouping all output together instead of running a lot of small batches: https:// gist.github.com/xerz-one/c422cbb7cf24432c48e56a5c58a926b2 @poettering Copy link Member @poettering poettering commented Jan 11, 2021 you're right, after I've generated many randoms, I was able to get collisions. Are you checking the carry flag in your test?: (side note: systemd's code does, and did since day 1: https:// github.com/systemd/systemd/blob/master/src/basic/random-util.c#L134) @void-witch Copy link @void-witch void-witch commented Jan 11, 2021 running a ryzen 5 1600, i can reproduce the issue fedora 33, 5.9.16-200.fc33.x86_64 running for i in {1..10000}; do ./oof | sort | uniq -cd; done (where oof is the test from above) for i in {1..10000}; do ./oof | sort | uniq -cd; done 2 RDRAND() = 0xef32dbd3 2 RDRAND() = 0x353d0542 2 RDRAND() = 0xdfd1bf5a 2 RDRAND() = 0x100dda3b 2 RDRAND() = 0xe3ff5e70 2 RDRAND() = 0xae5622ff 2 RDRAND() = 0x92d5d5eb 2 RDRAND() = 0xe18a56a9 @SantiagoTorres Copy link @SantiagoTorres SantiagoTorres commented Jan 11, 2021 I'm just a bystander here, but FWIW: the birthday paradox for this experiment (with a 2^8 set) gives ~77164 RDRAND runs before you get a 50% chance of a "collision." If we are running 2000*10000 RDRAND calls (20e6, I think), then you're very likely to get collisions, regardless of the randomness of your generator. @poettering Copy link Member @poettering poettering commented Jan 11, 2021 * edited well, getting collisions of course will happen, 2^32 is not that large, and there's the birthday paradox. See table at: https://en.wikipedia.org/wiki/Birthday_problem#Probability_table i.e. for a 32bit value the probability for a collision grows above 0.5 once you have taken ~77000 values, even if everything is in order. hence: you're right, after I've generated many randoms, I was able to get collisions. @asahaf how many "randoms" did you actually generate? @Kotters Copy link @Kotters Kotters commented Jan 11, 2021 * edited Running a 3950x, using Ubuntu in WSL1. At first I didn't get any collisions: kota@Descartes:~$ for i in {1..1000}; do ./ooof | sort | uniq -cd; done kota@Descartes:~$ for i in {1..1000}; do ./ooof | sort | uniq -cd; done kota@Descartes:~$ for i in {1..1000}; do ./ooof | sort | uniq -cd; done kota@Descartes:~$ for i in {1..1000}; do ./ooof | sort | uniq -cd; done kota@Descartes:~$ cat /proc/cpuinfo | grep "model name" model name : AMD Ryzen 9 3950X 16-Core Processor But a few minutes later, I tried again: kota@Descartes:~$ cat /proc/cpuinfo | grep "model name" | head -n1 model name : AMD Ryzen 9 3950X 16-Core Processor kota@Descartes:~$ for i in {1..1000}; do ./ooof | sort | uniq -cd; done kota@Descartes:~$ for i in {1..1000}; do ./ooof | sort | uniq -cd; done 2 RDRAND() = 0xbcac5dac kota@Descartes:~$ for i in {1..1000}; do ./ooof | sort | uniq -cd; done 2 RDRAND() = 0x0ed09aee kota@Descartes:~$ for i in {1..1000}; do ./ooof | sort | uniq -cd; done 2 RDRAND() = 0x71a87421 kota@Descartes:~$ edit: To be clear, I'm running this version of the script: double edit: Not a script. c code. Treat me as a data point, not a competent. int main(int argc, char *argv[]) { int failures = 0; for (int j, i = 0; i < 2000; ++i) { for (j = 0; j < 10; ++j) { uint32_t val = 0; if (rdrand(&val)) { printf("RDRAND() = 0x%08x\n", val); break; } else { failures++; } } if (j == 10) { puts("RDRAND() = FAIL"); return 1; } } return 0; } @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 @poettering the set size is 2000 random per run. @poettering Copy link Member @poettering poettering commented Jan 11, 2021 ahem, so this isn't a "script" (it's a c program), and the "script" checks the carry flag, i.e. finds cases where RDRAND actually reports a failure itself. This is not relevant at al whatsoever. It's totally OK from systemd's perspective if RDRAND fails as long as it tells us via carry flag. Problem is if it gives us some random value, claims all went good, but the random value is not actually that random and for some reason is returned soonishly again mutiple times, so that uuids generated from that are not actually unique as they should be... The original issue was cases where RDRAND collisions are much more likely then they are supposed to be, and RDRAND does not tell us about any failure via carry... @asahaf Copy link Author @asahaf asahaf commented Jan 11, 2021 @poettering this is what I've noticed, sometimes you get the collisions very often and sometimes it's hard to get it. it's true due to the limited size of uint32 that we for sure are going to get collisions, but the question is how often. @poettering Copy link Member @poettering poettering commented Jan 11, 2021 @poettering the set size is 2000 random per run. and you had collisions on every run? or how often did you have to repeat to get collisions? as mentioned if you run RDRAND ~77000 times the chance you get a collision is > 0.5 even if your RNG is working correct @bearoso Copy link @bearoso bearoso commented Jan 11, 2021 You're generating 2000000 numbers out of 4294967296. What is the likelihood of two numbers matching given the birthday paradox? @myfreeweb Copy link @myfreeweb myfreeweb commented Jan 11, 2021 This is not relevant at al whatsoever Seems like counting duplicates of the output with sort and uniq would actually count duplicates that didn't have the carry flag, because the printf("RDRAND() = 0x%08x\n", val) is in another branch vs. the failures++ carry counter. Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment Assignees No one assigned Labels not-our-bug Milestone No milestone Linked pull requests Successfully merging a pull request may close this issue. None yet 12 participants @briansmith @myfreeweb @asahaf @andrew-d @poettering @xerz-one @SantiagoTorres @bearoso @patrickschur @Kotters @void-witch @jamescooper-blis * (c) 2021 GitHub, Inc. * Terms * Privacy * Security * Status * Help * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.