[HN Gopher] Parallelizing SHA256 Calculation on FPGA
       ___________________________________________________________________
        
       Parallelizing SHA256 Calculation on FPGA
        
       Author : hasheddan
       Score  : 47 points
       Date   : 2025-07-03 15:25 UTC (7 hours ago)
        
 (HTM) web link (www.controlpaths.com)
 (TXT) w3m dump (www.controlpaths.com)
        
       | 15155 wrote:
       | Now try a fully unrolled/pipelined design that emits one hash per
       | clock cycle for actual parallelization.
        
         | m3kw9 wrote:
         | Or try hardcoding a few billion trillions of premade hashes
        
           | nayuki wrote:
           | https://en.wikipedia.org/wiki/Rainbow_table ?
        
         | picture wrote:
         | I know why you're downvoted, but it's true, the author is not
         | using FPGAs correctly.
        
       | Retr0id wrote:
       | So what's the overall hashrate with this approach?
       | 
       | I'll try to calculate it from the information given. 12 parallel
       | instances at a clock speed of 62.5MHz, with 68 clock cycles per
       | hash.
       | 
       | 62.5MHz * 12 / 68 = ~11MH/s
       | 
       | That seems... slow? Did I do the math right? How big of an FPGA
       | do you need before this would compete with a GPU, and how much
       | would it cost?
       | 
       | For reference, an RTX 4090 can do 21975.5 MH/s according to
       | hashcat benchmarks.
        
         | picture wrote:
         | Quite slow. It's largely due to the author using FPGAs wrong.
         | Clocking down a 7-series Artix to 62.5 MHz means the design is
         | not pipelined correctly/enough. My friend got 1 SHA256 hash per
         | cycle at 300 MHz on 7 series, but slightly fewer of the design
         | fit on a chip. Thruput would easily be in the GH/s range.
         | 
         | Keep in mind RTX4090 is 5 nm process node and has a lot more
         | transistors and memory than XC7A100T, which is 28 nm. That's a
         | _huge_ difference in terms of dynamic performance. Also, the
         | two are also released 10 years apart. If you compare RTX4090
         | against a similarly modern UltraScale part from Xilinx, I
         | believe the FPGA can be notably faster than RTX4090.
        
           | benlivengood wrote:
           | I'm assuming this space has already been heavily optimized by
           | the Bitcoin miners on their way to ASICs.
        
             | picture wrote:
             | Yes, hard silicon will be another magnitude more performant
             | than FPGAs and GPUs, but ASICs properly take on negative
             | value when they're no longer profitable to mine with. (Note
             | that efficiency won't be much better at the same process
             | node. You can just pump more power through each ASIC die)
             | 
             | Edit - I misread your comment. ASIC designers will use
             | FPGAs to test their design but it won't be optimized for
             | FPGAs which have a different logic-and-memory
             | characteristic than ASICs. There aren't many great SHA256
             | FPGA implementations, largely because there's not that much
             | demand for one
        
               | the8472 wrote:
               | > but ASICs properly take on negative value when they're
               | no longer profitable to mine with
               | 
               | No matmul coin where the hardware could be repurposed for
               | AI stuff?
        
               | 15155 wrote:
               | Modern BTC ASICs consist of 1600-3200 SHA256 cores and
               | only output nonces for sha256(sha256(btcBlockHeader)) -
               | there's no memory or ability to obtain other output.
        
               | throwawaymaths wrote:
               | always thought it might be cool to repurpose fast double
               | sha engines for error detection in storage arrays
        
               | throwawaymaths wrote:
               | matmul isn't a trapdoor function
        
             | Retr0id wrote:
             | Unfortunately I think most of that innovation happened
             | behind closed doors, because everyone wanted to maintain
             | their competitive advantages.
        
               | sMarsIntruder wrote:
               | Yes, ASICS are definitely very closed source for that
               | specific reason.
        
             | 15155 wrote:
             | Yes, but a designed-for-FPGA SHA256 implementation looks
             | very different than an ASIC SHA256 implementation - the
             | ASIC has far greater routing flexibility and density, and
             | can therefore use far more combinatorial logic between
             | register stages.
             | 
             | (ASIC simulation on an FPGA will retain the combinatorial
             | stages but run at dramatically lower fMax)
        
         | 15155 wrote:
         | SHA256 is extremely FF-heavy, you need around 200k for an
         | optimized, unrolled, pipelined implementation.
         | 
         | UltraScale+ chips will run a proper design at 600MHz-800MHz,
         | big chips might be able to fit 24 cores. The Artix chip OP used
         | is extremely slow and too small to fit this style of
         | implementation.
        
       | d00mB0t wrote:
       | More posts like this please! How about a crypto accelerator on
       | FPGA that's integrated with OpenSSL?
        
         | 15155 wrote:
         | Unless you're talking about niche algorithms (and even then),
         | the FPGA will get smoked by a CPU for most common tasks one
         | would use OpenSSL for.
        
           | d00mB0t wrote:
           | Yes--obviously modern CPUs have crypto extensions that would
           | be faster than an FPGA,this would be for educational
           | purposes.
        
             | 15155 wrote:
             | Even without the extensions, by the time you've moved the
             | workload to the FPGA and back, the CPU has already
             | completed whatever operation your FPGA was going to
             | complete with OpenSSL.
             | 
             | FPGA cryptographic acceleration is about batch task
             | bandwidth, OpenSSL has few places where this is required.
        
               | toast0 wrote:
               | If you want to do crypto acceleration for TLS, there's
               | two places to do it. Handshake/signature/key agreement,
               | which could maybe work, but hasn't been the bottleneck in
               | a long time, eliptic curve dramatically reduces the work
               | for the server and most clients can do it; but maybe
               | shipping the data around for that is fine.
               | 
               | The other part is bulk encryption. CPUs have lots of
               | acceleration for that, but clear text is still faster, so
               | the win is not to ship data to an accelerator and then
               | back to the cpu and then out to the NIC, but to ship to
               | the accelerator and from there to the NIC without
               | touching the CPU or often the accelerator is integrated
               | with the NIC.
               | 
               | It works even better if the data never has to touch the
               | CPU.
        
               | 15155 wrote:
               | Yes, this is why FPGAs are used as NICs in many
               | situations, but the folks doing this are of course not
               | using OpenSSL.
        
               | d00mB0t wrote:
               | You must be great to talk to at parties lol, I guess I
               | shouldn't build a RISC-V CPU because Intel is faster?
        
               | 15155 wrote:
               | You should definitely build a crypto accelerator - just
               | don't integrate it into OpenSSL (painful codebase to work
               | in, no speed benefit, etc.)
        
       | qdotme wrote:
       | Great job!
       | 
       | For alternative design/writeup, check out
       | http://nsa.unaligned.org
        
         | projektfu wrote:
         | That seems to be the inverse function for SHA-1 and MD5.
        
       ___________________________________________________________________
       (page generated 2025-07-03 23:01 UTC)