[HN Gopher] Fast and Clean: Auditable high-performance assembly ...
___________________________________________________________________
Fast and Clean: Auditable high-performance assembly via constraint
solving [pdf]
Author : luu
Score : 36 points
Date : 2023-10-08 22:08 UTC (1 days ago)
(HTM) web link (eprint.iacr.org)
(TXT) w3m dump (eprint.iacr.org)
| moonchild wrote:
| See also: unison - http://unison-code.github.io/
| Twisol wrote:
| Not, apparently, to be confused with the identically named
| Unison language. (I was wondering what unison-lang had to do
| with the OP until I clicked through.)
| djoldman wrote:
| Abstract: Handwritten assembly is a widely used tool in the
| development of high-performance cryptography: By providing full
| control over instruction selection, instruction scheduling, and
| register allocation, highest performance can be unlocked. On the
| flip side, developing handwritten assembly is not only time-
| consuming, but the artifacts produced also tend to be difficult
| to review and maintain - threatening their suitability for use in
| practice. In this work, we present SLOTHY (Super (Lazy)
| Optimization of Tricky Handwritten assemblY), a framework for the
| automated super-optimization of assembly with respect to
| instruction scheduling, register allocation, and loop
| optimization (software pipelining): With SLOTHY, the developer
| controls and focuses on algorithm and instruction selection,
| providing a readable "base" implementation in assembly, while
| SLOTHY automatically finds optimal and traceable instruction
| scheduling and register allocation strategies with respect to a
| model of the target (micro)architecture. We demonstrate the
| flexibility of SLOTHY by instantiating it with models of the
| Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72
| microarchitectures, implementing the Armv8.1-M+Helium and
| AArch64+Neon architectures. We use the resulting tools to
| optimize three workloads: First, for Cortex-M55 and Cortex-M85, a
| radix-4 complex Fast Fourier Transform (FFT) in fixed-point and
| floating-point arithmetic, fundamental in Digital Signal
| Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and
| Cortex-A72, the instances of the Number Theoretic Transform (NTT)
| underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently
| announced winners of the NIST Post-Quantum Cryptography
| standardization project. Third, for Cortex-A55, the scalar
| multiplication for the elliptic curve key exchange X25519. The
| SLOTHY-optimized code matches or beats the performance of prior
| art in all cases, while maintaining compactness and readability.
___________________________________________________________________
(page generated 2023-10-09 23:01 UTC)