[HN Gopher] Constant-Time Code: The Pessimist Case [pdf]
       ___________________________________________________________________
        
       Constant-Time Code: The Pessimist Case [pdf]
        
       Author : yuedongze
       Score  : 60 points
       Date   : 2025-03-09 05:21 UTC (3 days ago)
        
 (HTM) web link (eprint.iacr.org)
 (TXT) w3m dump (eprint.iacr.org)
        
       | Terr_ wrote:
       | So basically compilers, and optimizations inside CPUs, will keep
       | aggressively optimizing code to create the the vulnerability.
       | 
       | Perhaps this indicates a missing concept in our compilation and
       | execution of code, metadata that _demands_ a section is not
       | optimized.
        
         | colonial wrote:
         | At least on the compiler level, LLVM already offers a "black
         | box" intrinsic that nudges the optimizer to be maximally
         | pessimistic about whatever it wraps.
         | 
         | It doesn't carry any security guarantees (Rust's documentation
         | on their wrapper for it explicitly strikes out "cryptographic
         | or security" purposes) but one could imagine a similar
         | intrinsic with stronger abilities.
        
       | pclmulqdq wrote:
       | Most constant-time code today is done in assembly, which is a
       | mild downgrade from when it was done in C.
        
         | pjc50 wrote:
         | Yes. I keep arguing that C is further and further from
         | pretending to be a "portable assembler", and rather than trying
         | increasingly elaborate tricks to generate the series of machine
         | instructions you want you should just .. emit those
         | instructions yourself in the first place.
         | 
         | The mention of CMOV is good. Additional tricks are available
         | through vectorization - necessarily all elements of the vector
         | have to execute at the same speed, which gives you extra
         | options to put crafted data in a different column to force
         | computation to proceed at a known rate.
         | 
         | We should not forget that some CPUs offer dedicated
         | instructions - e.g. Intel AES-NI.
         | 
         | Another solution is to just stop trying to do it on the general
         | purpose processor altogether and move it to some sort of HSM or
         | TPM. Put your eggs in a smaller, more defensible basket. In a
         | world of GPUs and AI accelerators it's easier to ask for a
         | crypto accelerator (or rather decelerator!).
         | 
         | Is there any good work on constant-time crypto on the GPU?
         | 
         | If anyone manages to make homomorphic encryption viable, that
         | provides another solution, at a huge performance penalty.
        
           | pclmulqdq wrote:
           | I don't think it's anywhere close to viable to move the
           | cryptographic parts of the data plane into HSMs/TPMs. There's
           | just too much work to do there, and you have to move
           | plaintext over unsecured channels to do it. That means that
           | you have to put at least an ephemeral key into the CPU, and
           | the rest follows.
           | 
           | AES-NI, the SHA instructions, and constant-time subsets of
           | instructions are generally good enough that you can do this
           | in assembly.
        
       | GMoromisato wrote:
       | I didn't read TFA, but does this imply that all encryption needs
       | to be done in a special enclave with dedicated hardware (that
       | isn't vulnerable to this)?
        
       | jiehong wrote:
       | So far, SIMD operations are contant time, since the whole vector
       | needs to be computed, and not just a single element in it.
       | 
       | ChaCha20 or Kyber might be using that. Although, I don't know if
       | any cryptographic algorithm depend on vector only calculation as
       | a base for their strength.
       | 
       | Alternatively, I suppose that specialised cryptographic CPUs
       | could be made to follow strict standards ensuing constant time
       | runtime.
        
         | viraptor wrote:
         | > SIMD operations are contant time, since the whole vector
         | needs to be computed
         | 
         | Is that actually guaranteed? Having to do the whole vector
         | doesn't necessarily mean the time can't vary for different
         | values for complex operations. (like division)
         | 
         | I can't find the details, but
         | https://gleissen.github.io/papers/iodine.pdf at least mentions
         | "Dually, if timing variability is unavoidable, e.g., in SIMD or
         | floating-point units, making this variability explicit can
         | better inform..." so it sounds like simd is at least situation
         | specific here.
        
           | manwe150 wrote:
           | I wonder if you could have a compiler that intentionally pads
           | every instruction to use a vector register pair of <value,
           | slow-const>, so that every value gets paired with whatever
           | constant or other expression will cause the slowest execution
           | of that vector instruction and with the maximum latency
           | chain. Otherwise, it does look like the VDIV instructions can
           | have variable latencies based on the data itself (and not
           | just spectre-like speculation issues on the memory access
           | patterns).
           | 
           | From https://uops.info/table.html
        
             | Sesse__ wrote:
             | You don't have any guarantee that the SIMD operations are
             | actually done in parallel (which is of of the assumptions
             | needed for "the latency matches that of the slowest
             | element"). E.g., I believe VPGATHERDD is microcoded (and
             | serial) on some CPUs, NEON (128-bit) is designed to be
             | efficiently implementable by running a 64-bit ALU twice,
             | AVX2 and AVX512 double-pumped on some (many) CPUs
             | likewise...
        
           | hovav wrote:
           | It's not guaranteed. See section 7 of https://www.usenix.org/
           | system/files/conference/usenixsecurit...
        
         | hinkley wrote:
         | In MD5, there are calculations that if they result in certain
         | data ranges in certain bytes, results in a secondary
         | calculation to spread that change into other bytes. Like a non
         | Euclidean carry operation.
         | 
         | So I would think if you treat all of the bytes as one SIMD
         | instruction, do a SIMD compare, and then fire another operation
         | as a result... if you could guarantee that the third operation
         | would have to fire for at least one of the vector values every
         | time, then I think you'd have your solution.
        
       | kibwen wrote:
       | Give me hardware with a discrete cryptographic processing unit
       | which is just a CPU that does has no speculation, caching, or
       | other nonsense. In other words, a CPU that actually works the way
       | it's supposed to work. I don't care that idiots want their wrong
       | software to be as fast as possible, I want correct software
       | that's as fast as possible, and no faster.
        
         | adamc wrote:
         | But they want to sell as many cpus as possible, right? So that
         | market wins.
        
           | Sesse__ wrote:
           | You can easily buy such CPUs in the microcontroller market,
           | just go ahead. They're being sold by the billions.
        
             | adamc wrote:
             | Presumably, performance also matters.
        
               | Sesse__ wrote:
               | Well, if you say you don't want caching and speculation,
               | then you've essentially given up on performance already.
               | There's a reason why these small cores are slow.
        
         | pjc50 wrote:
         | A TPM, HSM, or Intel AES-NI type solution?
        
           | hinkley wrote:
           | You know given the rise of chiplets, a custom core isn't a
           | stupid idea...
        
         | hinkley wrote:
         | Are we going to end up doing all our crypto on the big.LITTLE
         | cores?
         | 
         | I wonder if they're fast enough...
        
         | layer8 wrote:
         | As the paper notes (page 21 last paragraph), this isn't just
         | about cryptographic algorithms though, it's about the
         | processing of any "secret" data.
        
       | jjmarr wrote:
       | Can an integrated CPU-FPGA design (like the ones AMD sells) be a
       | solution to this problem? Point 5 of the paper says:
       | 
       | > Modern hardware and software stacks use a very layered
       | structure, with each layer striving to hide its internal
       | functioning details from upper layers. The socio-economic context
       | which allowed such hardware to exist inherently relies on such
       | abstraction (under the name of "industrial secrecy"). An
       | effective constant-time coding process would require a model of
       | computation with strong guarantees on timing-related
       | characteristics, that would be maintained through all these
       | layers, down to the semiconductors that implement the logic
       | gates. This industry-wide vertical cooperation is unlikely to
       | happen
       | 
       | It's not mentioned in the paper, but hardware description
       | languages provide strong guarantees on timing-related
       | characteristics on the semiconductor level. Hardware companies
       | are willing to provide user access to FPGAs, as evidenced by the
       | userspace FPGA subsystem in the Linux kernel as well.[1] You can
       | buy Versal SOCs right now that combine ARM CPUs with FPGAs on one
       | die, run Linux on it, and outsource your crypto tasks to the
       | FPGA.
       | 
       | I see a future where we give up on constant-time coding for CPUs
       | and move crypto libraries to FPGAs, given the hundreds of
       | billions of dollars in our economy reliant on strong encryption.
       | That'd be a world in which every corporate laptop is required to
       | have an FPGA for compliance purposes.
       | 
       | It'd be interesting to know if anyone has _tried_ this.
       | 
       | [1]https://www.phoronix.com/news/FPGA-User-Space-Interface-AMD
        
         | NoahZuniga wrote:
         | FPGAs are very expensive, so a TPU that implements the most
         | important algorithms might be so much cheaper because the
         | design cost can be amortized on the huge amount of required
         | chips.
        
           | hinkley wrote:
           | My impression is that there's a lot of mental shorthand in
           | the chip design community and FPGAs are used for prototyping
           | and then translated into ASICs for any niche or larger
           | applications. I presume there's a pretty straightforward
           | translation process between the two, though no one has ever
           | tried to explain it to me.
        
             | boothby wrote:
             | A very simple description of an FPGA is that it's got a
             | bunch of switches on the ends of wires. Some of the
             | switches can connect wires to logic elements and some can
             | connect wires to other wires. In this view, programming an
             | FPGA is just toggling switches so that some of them are
             | "on" and the rest are "off".
             | 
             | The easiest migration from FPGA to ASIC is to just make a
             | chip with the same circuit elements and wire segments, but
             | instead of making switches, you just short out connections
             | in the "on" state and leave the rest open.
        
           | jjmarr wrote:
           | I believe the challenge that FPGAs can address is
           | sociotechnical, because the developer of a crypto library
           | will have much more control over the stack and does not need
           | to trust others.
           | 
           | Many high-frequency trading companies use FPGAs over ASICs
           | for similar reasons. FPGAs are more expensive but allow them
           | to have full control over the algorithms implemented and
           | doesn't require giving secret information to a foundry.
           | 
           | In other words, eliminate the impedance mismatch between
           | _responsibility_ and _control_ for developing a secure
           | system.
           | 
           | It'll be cheaper to implement cryptography on an ASIC. But
           | the author of this paper wants to control every single aspect
           | of the encryption/decryption process. Said developer can
           | confidently say the system is secure. You can't say you've
           | delivered a secure system if you're getting your ASICs from
           | another company that doesn't provide implementation details
           | because it'd (justifiably) give others a competitive
           | advantage.
           | 
           | Question I'd have is whether the cost difference between
           | ASICs/FPGAs is worth it for the majority of applications. $1
           | or $10 on every CPU might mean a world in which every laptop
           | has an FPGA, but $100? What about for server-side
           | applications? Would a hyperscaler spend $1000 extra on every
           | rack if it allowed guaranteed constant-time encryption?
        
             | burch45 wrote:
             | It's not about "giving secret information to a foundry".
             | It's entirely the field programmable (FP) feature. It's
             | also not really programmable in the sense that you would be
             | sending in new instructions in realtime. Reconfigurable is
             | a better word. So giving everyone an FPGA in their laptop
             | isn't really going help anyone in except some enthusiast
             | who wants to try out some different algorithms.
        
         | pjc50 wrote:
         | The FPGA idea raises security questions of its own - how do you
         | get the bitstream over securely, then the data and results,
         | without leaking the data you're trying to protect or being
         | vulnerable to evil clones of the bitstream? Or the FPGA debug
         | JTAG interface?
         | 
         | Meanwhile Windows is requiring that every laptop have a small
         | crypto processor for its own compliance processes (i.e.
         | bitlocker).
        
           | jjmarr wrote:
           | I would assume the bitstream would only contain non-secret
           | implementation details and keys would be loaded at runtime
           | rather than being synthesized into the design.
           | 
           | In terms of moving the data over to the FPGA, I have no idea.
           | But if it's all on the same die, it doesn't seem like there's
           | a big _physical_ attack surface that 's different than just
           | doing stuff on the CPU.
        
         | hinkley wrote:
         | I'm confused by the quoted text because timing attacks rely on
         | at-most behavior and abstraction layers on at-least behavior.
         | 
         | Abstractions cannot send messages before they receive them. So
         | any delay you add at the top must be magnified as you go down.
         | The exception to this is if the contents of the message require
         | different behaviors for different payloads, in which case they
         | are correct. But encrypted payloads are opaque to the layers
         | they are sent through. You can observe who the message was sent
         | to, and know the maximum amount of data the conversation might
         | have contained. But not a very clear idea of how long it took
         | to build the message, unless you've already compromised the
         | machine.
        
           | manwe150 wrote:
           | Recent timing attacks rely on the observation that modern
           | CPUs send some messages faster than other messages, based on
           | predicting what they might contain, so some delays get
           | magnified (those that deviate from expectations) while other
           | delays (those that match prior data) get minimized as you go
           | down. An encrypted payload leaks this same information too
           | (the process is independent of what data is being
           | transferred), although that leaked information is (hopefully)
           | not useful since it just leaks the encrypted data, which
           | (hopefully) looks like random noise. But that data has to be
           | encrypted and decrypted at some point somewhere, which gives
           | a point to attack.
        
       | zokier wrote:
       | editorialized title, originally "Constant-Time Code: The
       | Pessimist Case"
       | 
       | most of the points brought up in the article have been happening
       | already for decade(s), it's not something that 'will soon'
       | happen.
        
       | rwmj wrote:
       | RISC-V has a constant time coding extension (Zkt, https://riscv-
       | software-src.github.io/riscv-unified-db/manual...). I believe the
       | idea of this is that your CPU will have a dedicated core that
       | implements Zkt and performs a known set of operations in constant
       | time. I'm not sure that anyone has actually implemented it.
       | 
       | Do other ISAs have anything similar?
        
         | camel-cdr wrote:
         | No, this Zkt is part of the application profiles, so expected
         | to be implemented by high perf OoO cores. It just defines that
         | a list of instructions must have "data-independent timing".
        
         | beng-nl wrote:
         | Yes, x86:
         | 
         | https://www.intel.com/content/www/us/en/developer/articles/t...
        
           | touisteur wrote:
           | Oooh has it shipped anywhere already ?
        
             | pclmulqdq wrote:
             | Everywhere. The constant-time bit defaults to "1" because
             | the current x86 CPUs follow all of the promises of this
             | already. The particular feature here is to allow Intel to
             | make optional optimizations that are not constant-time on
             | the particular subset of instructions.
             | 
             | ARM cores also have a similar set of promises.
        
         | IshKebab wrote:
         | Zkt is probably widely implemented even if not declared. Since
         | most RISC-V cores are towards the lower end of the spectrum (in
         | order etc.) you'd have to go to extra effort not to implement
         | it.
        
       | nickpsecurity wrote:
       | One, old technique that can counter this is as follows:
       | 
       | 1. Clock how long the operation takes on any type of data. By
       | operation, it would be a function call maybe for a whole block or
       | buffer.
       | 
       | 2. Add a small margin of time to that.
       | 
       | 3. Modify the function call to make sure it has waited that long
       | before returning a response. That is true whether it finished
       | early or not.
       | 
       | The result prevents the function call from leaking timing
       | information.
       | 
       | High-assurance, security certification in the 1980's also
       | required mitigating covert, storage channels. That would require,
       | at a minimum, overwriting any memory used during the call and all
       | the CPU registers before returning to the caller.
        
         | some_furry wrote:
         | It might make the exploits mildly more difficult in some cases,
         | but adding a synthetic delay (whether randomized or
         | artificially inflated to meet a certain high water mark) isn't
         | likely to help.
         | 
         | https://security.stackexchange.com/a/96493/271113
         | 
         | For example, consider the case of a cache-timing leak (rather
         | than the classical "string compare" one). You'd have to have a
         | lot of control over the behavior of your operating environment
         | to guarantee it doesn't leak bits of your secret from cache
         | misses.
         | 
         | When you add a delay to your processing, unless the same tasks
         | being executed, I would expect power analysis leakage and
         | random jitter from the kernel's scheduler to reveal that fact.
         | It might be a more costly analysis to detect, but it's still
         | there.
         | 
         | Generally, I think this is a doomed line of reasoning.
        
           | sneak wrote:
           | Cache and power need to be exploited locally, generally.
           | Introducing random delays to raise the noise floor would work
           | for network services, I believe.
        
             | some_furry wrote:
             | It depends.
             | 
             | AES cache-timing was broken over a network (but required,
             | like, 2^28 samples).
             | 
             | I wouldn't bet the farm on this line of thinking providing
             | resilience. It might be just annoying enough for an
             | attacker to not really bother. (Or maybe only if the target
             | is, like, a software cryptocurrency wallet with enough
             | value to loot if they're successful.)
        
             | hovav wrote:
             | > power need[s] to be exploited locally
             | 
             | Not in the presence of DVFS, it turns out:
             | https://www.hertzbleed.com/hertzbleed.pdf
        
           | i2km wrote:
           | Actually, the link you provide seems to support the parent
           | comment's suggestion, rather than detract from it.
           | 
           | The previous comment was suggesting making sure that every
           | code path takes the same amount of time, not adding a random
           | delay (which doesn't work).
           | 
           | And while I agree that power-analysis attacks etc. are still
           | going to apply, the over-arching context here is just timing-
           | analysis
        
             | some_furry wrote:
             | The link I provided is about random delays being inferior
             | to setting a high water mark, yes.
             | 
             | I'm not just echoing the argument made by the link, though.
             | I'm adding to it.
             | 
             | I don't think the "cap runtime to a minimum value" strategy
             | will actually help, due to how much jitter your cap
             | measurements will experience from the normal operation of
             | the machine.
             | 
             | If you filter it out when measuring, you'll end up capping
             | too low, so some values will be above it. For a
             | visualization, let's pretend that you capped the runtime at
             | 80% what it actually takes in the real world:
             | function biased() {         return Math.max(0.8,
             | Math.random())       }       let samples = [];       for
             | (let i = 0; i < 1000; i++) {
             | samples.push(biased());       }       // Now plot `samples`
             | 
             | Alternatively, let's say you cap it sufficiently high that
             | there's always some slack time at the end.
             | 
             | Will the kernel switch away to another process on the same
             | machine?
             | 
             | If so, will the time between "the kernel has switched to
             | another process since we're really idling" to "the kernel
             | has swapped back to our process" be measurable?
             | 
             | It's better to make sure your actual algorithm is actually
             | constant-time, even if that means fighting with compilers
             | and hardware vendors' decisions.
        
       | tromp wrote:
       | The article details how current techniques for preventing timing
       | attack are becoming ineffective for the following reasons:
       | 
       | > 1. Compilers are applying optimization techniques that are
       | heuristically good for performance on general software, but
       | happen to leak information through timing-based sidechannels when
       | used on secret data. Since general-purpose CPUs and compilers are
       | applied to a wide variety of tasks, compilers have no reason to
       | stop doing such optimizations.
       | 
       | > 2. Constant-time coding techniques mostly aim at fooling the
       | compiler, to prevent it from applying these optimizations. Since
       | compilers keep getting smarter, theses techniques lose their
       | efficiency.
       | 
       | > 3. Just-in-time (JIT) compilers have access to runtime data;
       | they can, and will, use such information to perform extra
       | optimizations that static compilers cannot do, and further
       | destroy the developer's attempt at achieving constant-time code.
       | 
       | > 4. JIT compilation is becoming pervasive and is now employed in
       | various contexts, in particular inside CPUs. Such compilers can
       | do the same optimizations as "external" JIT compilers, but are
       | not publicly documented, and the output of their "machine" code
       | translation cannot be inspected to detect deviations from
       | constant-time execution.
       | 
       | > 5. Modern hardware and software stacks use a very layered
       | structure, with each layer striving to hide its internal
       | functioning details from upper layers. The socio-economic context
       | which allowed such hardware to exist inherently relies on such
       | abstraction (under the name of "industrial secrecy"). An
       | effective constant-time coding process would require a model of
       | computation with strong guarantees on timing-related
       | characteristics, that would be maintained through all these
       | layers, down to the semiconductors that implement the logic
       | gates. This industry-wide vertical cooperation is unlikely to
       | happen.
        
         | armchairhacker wrote:
         | Why is cooperation unlikely? AFAIK it's not too hard to make a
         | compiler support a function attribute that says "do not
         | optimize this function at all"; I assume in most compilers the
         | translation and optimization phases are _mostly separate, and
         | inter-procedural optimizations can treat them as externally-
         | linked (untouchable). I'm less familiar with assembly. but
         | couldn't processors expose instructions to enable or disable
         | their JIT?
         | 
         | I imagine it's not _easy* easy, because it requires every layer
         | to cooperate, but if there's demand I think it would be done.
         | 
         | IPv6 is in a similar situation where it requires widespread
         | adoption, but I also suspect most people have issues with it
         | and there isn't much demand because NATs and whatever support
         | for cellular networks has made it unnecessary.
        
           | nayuki wrote:
           | > it's not too hard to make a compiler support a function
           | attribute that says "do not optimize this function at all"
           | 
           | Note that memset_s() is an existing special case already.
           | https://en.cppreference.com/w/c/string/byte/memset
        
       | hinkley wrote:
       | > The intent is to produce the mask values without using a plain
       | comparison ("x != 0") so that the compiler does not notice that
       | we are really making a conditional move. The compiler is not
       | fooled.
       | 
       | So practice your martial arts on the bow of a rowboat. Balance,
       | Danielsan.
       | 
       | The whole timing thing is essentially an issue of getting off-
       | balance as in martial arts and the solution is that there are
       | actions you simply never do, and ones that look like they lack
       | economy of motion because so much of the work is in keeping your
       | center of gravity from shifting outside of your base. Tai chi is
       | infinitely better at protecting you from concussing yourself on
       | black ice than it is at protecting you from a mugger.
       | 
       | So if you have a conditional move for one calculation, then move
       | a to b if !c and move it to d if c. And when the compilers chase
       | that down in another generation, or someone notices that the L1
       | cache timing is different if d gets cold, then use both
       | calculations in the output. Calculate complements where the
       | runtime of a + !a = 1 to several decimal points.
       | 
       | Will it work for eternity? Unlikely. Will it work for a decade?
       | Probably.
        
       | kmeisthax wrote:
       | x86 and ARM both have options for executing certain instructions
       | with data-independent timing. The problem here is that languages
       | and compilers need to expose and honor data types that compile
       | down to only those instructions.
       | 
       | This would be something like a 'secret' keyword that would
       | qualify integer types; i.e. in C you could have a variable of
       | type 'secret unsigned int' and the compiler would reject
       | optimizations that would reveal timing information about the
       | variable's contents, while optimizing the rest of the program's
       | non-secret variables. Values can be promoted to secret but not
       | demoted. Code that intends to reveal secrets (e.g. to write them
       | out to secure storage) must explicitly cast away the secretness.
       | 
       | AFAIK Golang has crypto/subtle, but that's more special functions
       | to do certain things in a data-independent way and not pervasive
       | language support for it (and, presumably, they have to keep
       | updating the compiler to not optimize that specific module).
        
       | briansmith wrote:
       | At https://rwc.iacr.org/2025/program.php you can see there is a
       | talk scheduled to be given in a couple weeks titled "Testing
       | Side-channel Security of Cryptographic Implementations against
       | Future Microarchitectures" with the following bits in the
       | abstract: "Using this framework, we conduct an empirical study of
       | 18 proposed microarchitectural optimizations on 25
       | implementations of eight cryptographic primitives in five popular
       | libraries. We find that every implementation would contain
       | secret-dependent leaks, sometimes sufficient to recover a
       | victim's secret key, if these optimizations were realized.
       | Ironically, some leaks are possible only because of coding idioms
       | used to prevent leaks under the standard constant-time model."
        
       | i2km wrote:
       | One whole technique not mentioned in the paper or comments is
       | bitslicing. For non-branching code (e.g. symmetric ciphers) it's
       | guaranteed constant-time and it would be a remarkable compiler
       | indeed which could introduce optimizations and timing variations
       | to bit-sliced code...
        
         | gavinhoward wrote:
         | The author of the paper knows about bitslicing [1], so not
         | mentioning it seems deliberate.
         | 
         | My guess is that bitslicing only gets you so far.
         | 
         | [1]: https://bearssl.org/constanttime.html#bitslicing
        
       | dang wrote:
       | (Url changed from https://eprint.iacr.org/2025/435, which points
       | to this.)
        
       ___________________________________________________________________
       (page generated 2025-03-12 23:00 UTC)