[HN Gopher] Constant-Time Code: The Pessimist Case [pdf]
___________________________________________________________________
Constant-Time Code: The Pessimist Case [pdf]
Author : yuedongze
Score : 60 points
Date : 2025-03-09 05:21 UTC (3 days ago)
(HTM) web link (eprint.iacr.org)
(TXT) w3m dump (eprint.iacr.org)
| Terr_ wrote:
| So basically compilers, and optimizations inside CPUs, will keep
| aggressively optimizing code to create the the vulnerability.
|
| Perhaps this indicates a missing concept in our compilation and
| execution of code, metadata that _demands_ a section is not
| optimized.
| colonial wrote:
| At least on the compiler level, LLVM already offers a "black
| box" intrinsic that nudges the optimizer to be maximally
| pessimistic about whatever it wraps.
|
| It doesn't carry any security guarantees (Rust's documentation
| on their wrapper for it explicitly strikes out "cryptographic
| or security" purposes) but one could imagine a similar
| intrinsic with stronger abilities.
| pclmulqdq wrote:
| Most constant-time code today is done in assembly, which is a
| mild downgrade from when it was done in C.
| pjc50 wrote:
| Yes. I keep arguing that C is further and further from
| pretending to be a "portable assembler", and rather than trying
| increasingly elaborate tricks to generate the series of machine
| instructions you want you should just .. emit those
| instructions yourself in the first place.
|
| The mention of CMOV is good. Additional tricks are available
| through vectorization - necessarily all elements of the vector
| have to execute at the same speed, which gives you extra
| options to put crafted data in a different column to force
| computation to proceed at a known rate.
|
| We should not forget that some CPUs offer dedicated
| instructions - e.g. Intel AES-NI.
|
| Another solution is to just stop trying to do it on the general
| purpose processor altogether and move it to some sort of HSM or
| TPM. Put your eggs in a smaller, more defensible basket. In a
| world of GPUs and AI accelerators it's easier to ask for a
| crypto accelerator (or rather decelerator!).
|
| Is there any good work on constant-time crypto on the GPU?
|
| If anyone manages to make homomorphic encryption viable, that
| provides another solution, at a huge performance penalty.
| pclmulqdq wrote:
| I don't think it's anywhere close to viable to move the
| cryptographic parts of the data plane into HSMs/TPMs. There's
| just too much work to do there, and you have to move
| plaintext over unsecured channels to do it. That means that
| you have to put at least an ephemeral key into the CPU, and
| the rest follows.
|
| AES-NI, the SHA instructions, and constant-time subsets of
| instructions are generally good enough that you can do this
| in assembly.
| GMoromisato wrote:
| I didn't read TFA, but does this imply that all encryption needs
| to be done in a special enclave with dedicated hardware (that
| isn't vulnerable to this)?
| jiehong wrote:
| So far, SIMD operations are contant time, since the whole vector
| needs to be computed, and not just a single element in it.
|
| ChaCha20 or Kyber might be using that. Although, I don't know if
| any cryptographic algorithm depend on vector only calculation as
| a base for their strength.
|
| Alternatively, I suppose that specialised cryptographic CPUs
| could be made to follow strict standards ensuing constant time
| runtime.
| viraptor wrote:
| > SIMD operations are contant time, since the whole vector
| needs to be computed
|
| Is that actually guaranteed? Having to do the whole vector
| doesn't necessarily mean the time can't vary for different
| values for complex operations. (like division)
|
| I can't find the details, but
| https://gleissen.github.io/papers/iodine.pdf at least mentions
| "Dually, if timing variability is unavoidable, e.g., in SIMD or
| floating-point units, making this variability explicit can
| better inform..." so it sounds like simd is at least situation
| specific here.
| manwe150 wrote:
| I wonder if you could have a compiler that intentionally pads
| every instruction to use a vector register pair of <value,
| slow-const>, so that every value gets paired with whatever
| constant or other expression will cause the slowest execution
| of that vector instruction and with the maximum latency
| chain. Otherwise, it does look like the VDIV instructions can
| have variable latencies based on the data itself (and not
| just spectre-like speculation issues on the memory access
| patterns).
|
| From https://uops.info/table.html
| Sesse__ wrote:
| You don't have any guarantee that the SIMD operations are
| actually done in parallel (which is of of the assumptions
| needed for "the latency matches that of the slowest
| element"). E.g., I believe VPGATHERDD is microcoded (and
| serial) on some CPUs, NEON (128-bit) is designed to be
| efficiently implementable by running a 64-bit ALU twice,
| AVX2 and AVX512 double-pumped on some (many) CPUs
| likewise...
| hovav wrote:
| It's not guaranteed. See section 7 of https://www.usenix.org/
| system/files/conference/usenixsecurit...
| hinkley wrote:
| In MD5, there are calculations that if they result in certain
| data ranges in certain bytes, results in a secondary
| calculation to spread that change into other bytes. Like a non
| Euclidean carry operation.
|
| So I would think if you treat all of the bytes as one SIMD
| instruction, do a SIMD compare, and then fire another operation
| as a result... if you could guarantee that the third operation
| would have to fire for at least one of the vector values every
| time, then I think you'd have your solution.
| kibwen wrote:
| Give me hardware with a discrete cryptographic processing unit
| which is just a CPU that does has no speculation, caching, or
| other nonsense. In other words, a CPU that actually works the way
| it's supposed to work. I don't care that idiots want their wrong
| software to be as fast as possible, I want correct software
| that's as fast as possible, and no faster.
| adamc wrote:
| But they want to sell as many cpus as possible, right? So that
| market wins.
| Sesse__ wrote:
| You can easily buy such CPUs in the microcontroller market,
| just go ahead. They're being sold by the billions.
| adamc wrote:
| Presumably, performance also matters.
| Sesse__ wrote:
| Well, if you say you don't want caching and speculation,
| then you've essentially given up on performance already.
| There's a reason why these small cores are slow.
| pjc50 wrote:
| A TPM, HSM, or Intel AES-NI type solution?
| hinkley wrote:
| You know given the rise of chiplets, a custom core isn't a
| stupid idea...
| hinkley wrote:
| Are we going to end up doing all our crypto on the big.LITTLE
| cores?
|
| I wonder if they're fast enough...
| layer8 wrote:
| As the paper notes (page 21 last paragraph), this isn't just
| about cryptographic algorithms though, it's about the
| processing of any "secret" data.
| jjmarr wrote:
| Can an integrated CPU-FPGA design (like the ones AMD sells) be a
| solution to this problem? Point 5 of the paper says:
|
| > Modern hardware and software stacks use a very layered
| structure, with each layer striving to hide its internal
| functioning details from upper layers. The socio-economic context
| which allowed such hardware to exist inherently relies on such
| abstraction (under the name of "industrial secrecy"). An
| effective constant-time coding process would require a model of
| computation with strong guarantees on timing-related
| characteristics, that would be maintained through all these
| layers, down to the semiconductors that implement the logic
| gates. This industry-wide vertical cooperation is unlikely to
| happen
|
| It's not mentioned in the paper, but hardware description
| languages provide strong guarantees on timing-related
| characteristics on the semiconductor level. Hardware companies
| are willing to provide user access to FPGAs, as evidenced by the
| userspace FPGA subsystem in the Linux kernel as well.[1] You can
| buy Versal SOCs right now that combine ARM CPUs with FPGAs on one
| die, run Linux on it, and outsource your crypto tasks to the
| FPGA.
|
| I see a future where we give up on constant-time coding for CPUs
| and move crypto libraries to FPGAs, given the hundreds of
| billions of dollars in our economy reliant on strong encryption.
| That'd be a world in which every corporate laptop is required to
| have an FPGA for compliance purposes.
|
| It'd be interesting to know if anyone has _tried_ this.
|
| [1]https://www.phoronix.com/news/FPGA-User-Space-Interface-AMD
| NoahZuniga wrote:
| FPGAs are very expensive, so a TPU that implements the most
| important algorithms might be so much cheaper because the
| design cost can be amortized on the huge amount of required
| chips.
| hinkley wrote:
| My impression is that there's a lot of mental shorthand in
| the chip design community and FPGAs are used for prototyping
| and then translated into ASICs for any niche or larger
| applications. I presume there's a pretty straightforward
| translation process between the two, though no one has ever
| tried to explain it to me.
| boothby wrote:
| A very simple description of an FPGA is that it's got a
| bunch of switches on the ends of wires. Some of the
| switches can connect wires to logic elements and some can
| connect wires to other wires. In this view, programming an
| FPGA is just toggling switches so that some of them are
| "on" and the rest are "off".
|
| The easiest migration from FPGA to ASIC is to just make a
| chip with the same circuit elements and wire segments, but
| instead of making switches, you just short out connections
| in the "on" state and leave the rest open.
| jjmarr wrote:
| I believe the challenge that FPGAs can address is
| sociotechnical, because the developer of a crypto library
| will have much more control over the stack and does not need
| to trust others.
|
| Many high-frequency trading companies use FPGAs over ASICs
| for similar reasons. FPGAs are more expensive but allow them
| to have full control over the algorithms implemented and
| doesn't require giving secret information to a foundry.
|
| In other words, eliminate the impedance mismatch between
| _responsibility_ and _control_ for developing a secure
| system.
|
| It'll be cheaper to implement cryptography on an ASIC. But
| the author of this paper wants to control every single aspect
| of the encryption/decryption process. Said developer can
| confidently say the system is secure. You can't say you've
| delivered a secure system if you're getting your ASICs from
| another company that doesn't provide implementation details
| because it'd (justifiably) give others a competitive
| advantage.
|
| Question I'd have is whether the cost difference between
| ASICs/FPGAs is worth it for the majority of applications. $1
| or $10 on every CPU might mean a world in which every laptop
| has an FPGA, but $100? What about for server-side
| applications? Would a hyperscaler spend $1000 extra on every
| rack if it allowed guaranteed constant-time encryption?
| burch45 wrote:
| It's not about "giving secret information to a foundry".
| It's entirely the field programmable (FP) feature. It's
| also not really programmable in the sense that you would be
| sending in new instructions in realtime. Reconfigurable is
| a better word. So giving everyone an FPGA in their laptop
| isn't really going help anyone in except some enthusiast
| who wants to try out some different algorithms.
| pjc50 wrote:
| The FPGA idea raises security questions of its own - how do you
| get the bitstream over securely, then the data and results,
| without leaking the data you're trying to protect or being
| vulnerable to evil clones of the bitstream? Or the FPGA debug
| JTAG interface?
|
| Meanwhile Windows is requiring that every laptop have a small
| crypto processor for its own compliance processes (i.e.
| bitlocker).
| jjmarr wrote:
| I would assume the bitstream would only contain non-secret
| implementation details and keys would be loaded at runtime
| rather than being synthesized into the design.
|
| In terms of moving the data over to the FPGA, I have no idea.
| But if it's all on the same die, it doesn't seem like there's
| a big _physical_ attack surface that 's different than just
| doing stuff on the CPU.
| hinkley wrote:
| I'm confused by the quoted text because timing attacks rely on
| at-most behavior and abstraction layers on at-least behavior.
|
| Abstractions cannot send messages before they receive them. So
| any delay you add at the top must be magnified as you go down.
| The exception to this is if the contents of the message require
| different behaviors for different payloads, in which case they
| are correct. But encrypted payloads are opaque to the layers
| they are sent through. You can observe who the message was sent
| to, and know the maximum amount of data the conversation might
| have contained. But not a very clear idea of how long it took
| to build the message, unless you've already compromised the
| machine.
| manwe150 wrote:
| Recent timing attacks rely on the observation that modern
| CPUs send some messages faster than other messages, based on
| predicting what they might contain, so some delays get
| magnified (those that deviate from expectations) while other
| delays (those that match prior data) get minimized as you go
| down. An encrypted payload leaks this same information too
| (the process is independent of what data is being
| transferred), although that leaked information is (hopefully)
| not useful since it just leaks the encrypted data, which
| (hopefully) looks like random noise. But that data has to be
| encrypted and decrypted at some point somewhere, which gives
| a point to attack.
| zokier wrote:
| editorialized title, originally "Constant-Time Code: The
| Pessimist Case"
|
| most of the points brought up in the article have been happening
| already for decade(s), it's not something that 'will soon'
| happen.
| rwmj wrote:
| RISC-V has a constant time coding extension (Zkt, https://riscv-
| software-src.github.io/riscv-unified-db/manual...). I believe the
| idea of this is that your CPU will have a dedicated core that
| implements Zkt and performs a known set of operations in constant
| time. I'm not sure that anyone has actually implemented it.
|
| Do other ISAs have anything similar?
| camel-cdr wrote:
| No, this Zkt is part of the application profiles, so expected
| to be implemented by high perf OoO cores. It just defines that
| a list of instructions must have "data-independent timing".
| beng-nl wrote:
| Yes, x86:
|
| https://www.intel.com/content/www/us/en/developer/articles/t...
| touisteur wrote:
| Oooh has it shipped anywhere already ?
| pclmulqdq wrote:
| Everywhere. The constant-time bit defaults to "1" because
| the current x86 CPUs follow all of the promises of this
| already. The particular feature here is to allow Intel to
| make optional optimizations that are not constant-time on
| the particular subset of instructions.
|
| ARM cores also have a similar set of promises.
| IshKebab wrote:
| Zkt is probably widely implemented even if not declared. Since
| most RISC-V cores are towards the lower end of the spectrum (in
| order etc.) you'd have to go to extra effort not to implement
| it.
| nickpsecurity wrote:
| One, old technique that can counter this is as follows:
|
| 1. Clock how long the operation takes on any type of data. By
| operation, it would be a function call maybe for a whole block or
| buffer.
|
| 2. Add a small margin of time to that.
|
| 3. Modify the function call to make sure it has waited that long
| before returning a response. That is true whether it finished
| early or not.
|
| The result prevents the function call from leaking timing
| information.
|
| High-assurance, security certification in the 1980's also
| required mitigating covert, storage channels. That would require,
| at a minimum, overwriting any memory used during the call and all
| the CPU registers before returning to the caller.
| some_furry wrote:
| It might make the exploits mildly more difficult in some cases,
| but adding a synthetic delay (whether randomized or
| artificially inflated to meet a certain high water mark) isn't
| likely to help.
|
| https://security.stackexchange.com/a/96493/271113
|
| For example, consider the case of a cache-timing leak (rather
| than the classical "string compare" one). You'd have to have a
| lot of control over the behavior of your operating environment
| to guarantee it doesn't leak bits of your secret from cache
| misses.
|
| When you add a delay to your processing, unless the same tasks
| being executed, I would expect power analysis leakage and
| random jitter from the kernel's scheduler to reveal that fact.
| It might be a more costly analysis to detect, but it's still
| there.
|
| Generally, I think this is a doomed line of reasoning.
| sneak wrote:
| Cache and power need to be exploited locally, generally.
| Introducing random delays to raise the noise floor would work
| for network services, I believe.
| some_furry wrote:
| It depends.
|
| AES cache-timing was broken over a network (but required,
| like, 2^28 samples).
|
| I wouldn't bet the farm on this line of thinking providing
| resilience. It might be just annoying enough for an
| attacker to not really bother. (Or maybe only if the target
| is, like, a software cryptocurrency wallet with enough
| value to loot if they're successful.)
| hovav wrote:
| > power need[s] to be exploited locally
|
| Not in the presence of DVFS, it turns out:
| https://www.hertzbleed.com/hertzbleed.pdf
| i2km wrote:
| Actually, the link you provide seems to support the parent
| comment's suggestion, rather than detract from it.
|
| The previous comment was suggesting making sure that every
| code path takes the same amount of time, not adding a random
| delay (which doesn't work).
|
| And while I agree that power-analysis attacks etc. are still
| going to apply, the over-arching context here is just timing-
| analysis
| some_furry wrote:
| The link I provided is about random delays being inferior
| to setting a high water mark, yes.
|
| I'm not just echoing the argument made by the link, though.
| I'm adding to it.
|
| I don't think the "cap runtime to a minimum value" strategy
| will actually help, due to how much jitter your cap
| measurements will experience from the normal operation of
| the machine.
|
| If you filter it out when measuring, you'll end up capping
| too low, so some values will be above it. For a
| visualization, let's pretend that you capped the runtime at
| 80% what it actually takes in the real world:
| function biased() { return Math.max(0.8,
| Math.random()) } let samples = []; for
| (let i = 0; i < 1000; i++) {
| samples.push(biased()); } // Now plot `samples`
|
| Alternatively, let's say you cap it sufficiently high that
| there's always some slack time at the end.
|
| Will the kernel switch away to another process on the same
| machine?
|
| If so, will the time between "the kernel has switched to
| another process since we're really idling" to "the kernel
| has swapped back to our process" be measurable?
|
| It's better to make sure your actual algorithm is actually
| constant-time, even if that means fighting with compilers
| and hardware vendors' decisions.
| tromp wrote:
| The article details how current techniques for preventing timing
| attack are becoming ineffective for the following reasons:
|
| > 1. Compilers are applying optimization techniques that are
| heuristically good for performance on general software, but
| happen to leak information through timing-based sidechannels when
| used on secret data. Since general-purpose CPUs and compilers are
| applied to a wide variety of tasks, compilers have no reason to
| stop doing such optimizations.
|
| > 2. Constant-time coding techniques mostly aim at fooling the
| compiler, to prevent it from applying these optimizations. Since
| compilers keep getting smarter, theses techniques lose their
| efficiency.
|
| > 3. Just-in-time (JIT) compilers have access to runtime data;
| they can, and will, use such information to perform extra
| optimizations that static compilers cannot do, and further
| destroy the developer's attempt at achieving constant-time code.
|
| > 4. JIT compilation is becoming pervasive and is now employed in
| various contexts, in particular inside CPUs. Such compilers can
| do the same optimizations as "external" JIT compilers, but are
| not publicly documented, and the output of their "machine" code
| translation cannot be inspected to detect deviations from
| constant-time execution.
|
| > 5. Modern hardware and software stacks use a very layered
| structure, with each layer striving to hide its internal
| functioning details from upper layers. The socio-economic context
| which allowed such hardware to exist inherently relies on such
| abstraction (under the name of "industrial secrecy"). An
| effective constant-time coding process would require a model of
| computation with strong guarantees on timing-related
| characteristics, that would be maintained through all these
| layers, down to the semiconductors that implement the logic
| gates. This industry-wide vertical cooperation is unlikely to
| happen.
| armchairhacker wrote:
| Why is cooperation unlikely? AFAIK it's not too hard to make a
| compiler support a function attribute that says "do not
| optimize this function at all"; I assume in most compilers the
| translation and optimization phases are _mostly separate, and
| inter-procedural optimizations can treat them as externally-
| linked (untouchable). I'm less familiar with assembly. but
| couldn't processors expose instructions to enable or disable
| their JIT?
|
| I imagine it's not _easy* easy, because it requires every layer
| to cooperate, but if there's demand I think it would be done.
|
| IPv6 is in a similar situation where it requires widespread
| adoption, but I also suspect most people have issues with it
| and there isn't much demand because NATs and whatever support
| for cellular networks has made it unnecessary.
| nayuki wrote:
| > it's not too hard to make a compiler support a function
| attribute that says "do not optimize this function at all"
|
| Note that memset_s() is an existing special case already.
| https://en.cppreference.com/w/c/string/byte/memset
| hinkley wrote:
| > The intent is to produce the mask values without using a plain
| comparison ("x != 0") so that the compiler does not notice that
| we are really making a conditional move. The compiler is not
| fooled.
|
| So practice your martial arts on the bow of a rowboat. Balance,
| Danielsan.
|
| The whole timing thing is essentially an issue of getting off-
| balance as in martial arts and the solution is that there are
| actions you simply never do, and ones that look like they lack
| economy of motion because so much of the work is in keeping your
| center of gravity from shifting outside of your base. Tai chi is
| infinitely better at protecting you from concussing yourself on
| black ice than it is at protecting you from a mugger.
|
| So if you have a conditional move for one calculation, then move
| a to b if !c and move it to d if c. And when the compilers chase
| that down in another generation, or someone notices that the L1
| cache timing is different if d gets cold, then use both
| calculations in the output. Calculate complements where the
| runtime of a + !a = 1 to several decimal points.
|
| Will it work for eternity? Unlikely. Will it work for a decade?
| Probably.
| kmeisthax wrote:
| x86 and ARM both have options for executing certain instructions
| with data-independent timing. The problem here is that languages
| and compilers need to expose and honor data types that compile
| down to only those instructions.
|
| This would be something like a 'secret' keyword that would
| qualify integer types; i.e. in C you could have a variable of
| type 'secret unsigned int' and the compiler would reject
| optimizations that would reveal timing information about the
| variable's contents, while optimizing the rest of the program's
| non-secret variables. Values can be promoted to secret but not
| demoted. Code that intends to reveal secrets (e.g. to write them
| out to secure storage) must explicitly cast away the secretness.
|
| AFAIK Golang has crypto/subtle, but that's more special functions
| to do certain things in a data-independent way and not pervasive
| language support for it (and, presumably, they have to keep
| updating the compiler to not optimize that specific module).
| briansmith wrote:
| At https://rwc.iacr.org/2025/program.php you can see there is a
| talk scheduled to be given in a couple weeks titled "Testing
| Side-channel Security of Cryptographic Implementations against
| Future Microarchitectures" with the following bits in the
| abstract: "Using this framework, we conduct an empirical study of
| 18 proposed microarchitectural optimizations on 25
| implementations of eight cryptographic primitives in five popular
| libraries. We find that every implementation would contain
| secret-dependent leaks, sometimes sufficient to recover a
| victim's secret key, if these optimizations were realized.
| Ironically, some leaks are possible only because of coding idioms
| used to prevent leaks under the standard constant-time model."
| i2km wrote:
| One whole technique not mentioned in the paper or comments is
| bitslicing. For non-branching code (e.g. symmetric ciphers) it's
| guaranteed constant-time and it would be a remarkable compiler
| indeed which could introduce optimizations and timing variations
| to bit-sliced code...
| gavinhoward wrote:
| The author of the paper knows about bitslicing [1], so not
| mentioning it seems deliberate.
|
| My guess is that bitslicing only gets you so far.
|
| [1]: https://bearssl.org/constanttime.html#bitslicing
| dang wrote:
| (Url changed from https://eprint.iacr.org/2025/435, which points
| to this.)
___________________________________________________________________
(page generated 2025-03-12 23:00 UTC)