[HN Gopher] Fast and Clean: Auditable high-performance assembly ...
       ___________________________________________________________________
        
       Fast and Clean: Auditable high-performance assembly via constraint
       solving [pdf]
        
       Author : luu
       Score  : 36 points
       Date   : 2023-10-08 22:08 UTC (1 days ago)
        
 (HTM) web link (eprint.iacr.org)
 (TXT) w3m dump (eprint.iacr.org)
        
       | moonchild wrote:
       | See also: unison - http://unison-code.github.io/
        
         | Twisol wrote:
         | Not, apparently, to be confused with the identically named
         | Unison language. (I was wondering what unison-lang had to do
         | with the OP until I clicked through.)
        
       | djoldman wrote:
       | Abstract: Handwritten assembly is a widely used tool in the
       | development of high-performance cryptography: By providing full
       | control over instruction selection, instruction scheduling, and
       | register allocation, highest performance can be unlocked. On the
       | flip side, developing handwritten assembly is not only time-
       | consuming, but the artifacts produced also tend to be difficult
       | to review and maintain - threatening their suitability for use in
       | practice. In this work, we present SLOTHY (Super (Lazy)
       | Optimization of Tricky Handwritten assemblY), a framework for the
       | automated super-optimization of assembly with respect to
       | instruction scheduling, register allocation, and loop
       | optimization (software pipelining): With SLOTHY, the developer
       | controls and focuses on algorithm and instruction selection,
       | providing a readable "base" implementation in assembly, while
       | SLOTHY automatically finds optimal and traceable instruction
       | scheduling and register allocation strategies with respect to a
       | model of the target (micro)architecture. We demonstrate the
       | flexibility of SLOTHY by instantiating it with models of the
       | Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72
       | microarchitectures, implementing the Armv8.1-M+Helium and
       | AArch64+Neon architectures. We use the resulting tools to
       | optimize three workloads: First, for Cortex-M55 and Cortex-M85, a
       | radix-4 complex Fast Fourier Transform (FFT) in fixed-point and
       | floating-point arithmetic, fundamental in Digital Signal
       | Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and
       | Cortex-A72, the instances of the Number Theoretic Transform (NTT)
       | underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently
       | announced winners of the NIST Post-Quantum Cryptography
       | standardization project. Third, for Cortex-A55, the scalar
       | multiplication for the elliptic curve key exchange X25519. The
       | SLOTHY-optimized code matches or beats the performance of prior
       | art in all cases, while maintaining compactness and readability.
        
       ___________________________________________________________________
       (page generated 2023-10-09 23:01 UTC)