[HN Gopher] Parallelizing SHA256 Calculation on FPGA
___________________________________________________________________
Parallelizing SHA256 Calculation on FPGA
Author : hasheddan
Score : 47 points
Date : 2025-07-03 15:25 UTC (7 hours ago)
(HTM) web link (www.controlpaths.com)
(TXT) w3m dump (www.controlpaths.com)
| 15155 wrote:
| Now try a fully unrolled/pipelined design that emits one hash per
| clock cycle for actual parallelization.
| m3kw9 wrote:
| Or try hardcoding a few billion trillions of premade hashes
| nayuki wrote:
| https://en.wikipedia.org/wiki/Rainbow_table ?
| picture wrote:
| I know why you're downvoted, but it's true, the author is not
| using FPGAs correctly.
| Retr0id wrote:
| So what's the overall hashrate with this approach?
|
| I'll try to calculate it from the information given. 12 parallel
| instances at a clock speed of 62.5MHz, with 68 clock cycles per
| hash.
|
| 62.5MHz * 12 / 68 = ~11MH/s
|
| That seems... slow? Did I do the math right? How big of an FPGA
| do you need before this would compete with a GPU, and how much
| would it cost?
|
| For reference, an RTX 4090 can do 21975.5 MH/s according to
| hashcat benchmarks.
| picture wrote:
| Quite slow. It's largely due to the author using FPGAs wrong.
| Clocking down a 7-series Artix to 62.5 MHz means the design is
| not pipelined correctly/enough. My friend got 1 SHA256 hash per
| cycle at 300 MHz on 7 series, but slightly fewer of the design
| fit on a chip. Thruput would easily be in the GH/s range.
|
| Keep in mind RTX4090 is 5 nm process node and has a lot more
| transistors and memory than XC7A100T, which is 28 nm. That's a
| _huge_ difference in terms of dynamic performance. Also, the
| two are also released 10 years apart. If you compare RTX4090
| against a similarly modern UltraScale part from Xilinx, I
| believe the FPGA can be notably faster than RTX4090.
| benlivengood wrote:
| I'm assuming this space has already been heavily optimized by
| the Bitcoin miners on their way to ASICs.
| picture wrote:
| Yes, hard silicon will be another magnitude more performant
| than FPGAs and GPUs, but ASICs properly take on negative
| value when they're no longer profitable to mine with. (Note
| that efficiency won't be much better at the same process
| node. You can just pump more power through each ASIC die)
|
| Edit - I misread your comment. ASIC designers will use
| FPGAs to test their design but it won't be optimized for
| FPGAs which have a different logic-and-memory
| characteristic than ASICs. There aren't many great SHA256
| FPGA implementations, largely because there's not that much
| demand for one
| the8472 wrote:
| > but ASICs properly take on negative value when they're
| no longer profitable to mine with
|
| No matmul coin where the hardware could be repurposed for
| AI stuff?
| 15155 wrote:
| Modern BTC ASICs consist of 1600-3200 SHA256 cores and
| only output nonces for sha256(sha256(btcBlockHeader)) -
| there's no memory or ability to obtain other output.
| throwawaymaths wrote:
| always thought it might be cool to repurpose fast double
| sha engines for error detection in storage arrays
| throwawaymaths wrote:
| matmul isn't a trapdoor function
| Retr0id wrote:
| Unfortunately I think most of that innovation happened
| behind closed doors, because everyone wanted to maintain
| their competitive advantages.
| sMarsIntruder wrote:
| Yes, ASICS are definitely very closed source for that
| specific reason.
| 15155 wrote:
| Yes, but a designed-for-FPGA SHA256 implementation looks
| very different than an ASIC SHA256 implementation - the
| ASIC has far greater routing flexibility and density, and
| can therefore use far more combinatorial logic between
| register stages.
|
| (ASIC simulation on an FPGA will retain the combinatorial
| stages but run at dramatically lower fMax)
| 15155 wrote:
| SHA256 is extremely FF-heavy, you need around 200k for an
| optimized, unrolled, pipelined implementation.
|
| UltraScale+ chips will run a proper design at 600MHz-800MHz,
| big chips might be able to fit 24 cores. The Artix chip OP used
| is extremely slow and too small to fit this style of
| implementation.
| d00mB0t wrote:
| More posts like this please! How about a crypto accelerator on
| FPGA that's integrated with OpenSSL?
| 15155 wrote:
| Unless you're talking about niche algorithms (and even then),
| the FPGA will get smoked by a CPU for most common tasks one
| would use OpenSSL for.
| d00mB0t wrote:
| Yes--obviously modern CPUs have crypto extensions that would
| be faster than an FPGA,this would be for educational
| purposes.
| 15155 wrote:
| Even without the extensions, by the time you've moved the
| workload to the FPGA and back, the CPU has already
| completed whatever operation your FPGA was going to
| complete with OpenSSL.
|
| FPGA cryptographic acceleration is about batch task
| bandwidth, OpenSSL has few places where this is required.
| toast0 wrote:
| If you want to do crypto acceleration for TLS, there's
| two places to do it. Handshake/signature/key agreement,
| which could maybe work, but hasn't been the bottleneck in
| a long time, eliptic curve dramatically reduces the work
| for the server and most clients can do it; but maybe
| shipping the data around for that is fine.
|
| The other part is bulk encryption. CPUs have lots of
| acceleration for that, but clear text is still faster, so
| the win is not to ship data to an accelerator and then
| back to the cpu and then out to the NIC, but to ship to
| the accelerator and from there to the NIC without
| touching the CPU or often the accelerator is integrated
| with the NIC.
|
| It works even better if the data never has to touch the
| CPU.
| 15155 wrote:
| Yes, this is why FPGAs are used as NICs in many
| situations, but the folks doing this are of course not
| using OpenSSL.
| d00mB0t wrote:
| You must be great to talk to at parties lol, I guess I
| shouldn't build a RISC-V CPU because Intel is faster?
| 15155 wrote:
| You should definitely build a crypto accelerator - just
| don't integrate it into OpenSSL (painful codebase to work
| in, no speed benefit, etc.)
| qdotme wrote:
| Great job!
|
| For alternative design/writeup, check out
| http://nsa.unaligned.org
| projektfu wrote:
| That seems to be the inverse function for SHA-1 and MD5.
___________________________________________________________________
(page generated 2025-07-03 23:01 UTC)