[HN Gopher] eGPU: A 750 MHz Class Soft GPGPU for FPGA
___________________________________________________________________
eGPU: A 750 MHz Class Soft GPGPU for FPGA
Author : matt_d
Score : 39 points
Date : 2023-08-01 20:11 UTC (2 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| stefanpie wrote:
| One group at Georgia Tech in our building has also been working
| on open source GPU designs that can also target FPGAs and
| interoperate with RISCV. They have several publications on the
| work they have built up. Thought I might share since it's not
| referenced in the submission paper.
|
| https://vortex.cc.gatech.edu/
| mepian wrote:
| They still haven't published the source code for their Skybox
| project, I wonder why. Unless I missed it in their repository?
| https://github.com/vortexgpgpu
| gsmecher wrote:
| Also discussed here:
| https://old.reddit.com/r/FPGA/comments/15fnb6u/egpu_a_750_mh...
| dragontamer wrote:
| For a GPU circuit, it basically comes down to the number of
| hardware multipliers on the FPGA, does it not?
|
| I remember synthesizing a 16-bit Wallace tree in a lab exercise
| back in college. I think that single multiplier used up 70% of my
| LUTs.
|
| You only will get massive amounts of hardware parallel
| multipliers if the underlying circuit has a ton of hardware
| multipliers (Like Xilinx's VLIW SIMD AI chips)
|
| -------
|
| At all computer sizes, a GPU probably will have more multiply
| circuits than an equivalent cost FPGA, with exception of maybe
| those AI chips from Xilinx (where the individual cores are
| basically presynthesized with hardcoded ISA).
|
| Ex: at under 500mW power usage you probably will prefer some ARM
| NEON SIMD or TI DSP / VLIW. At cell phone levels you'd prefer a
| cell phone GPU, and at desktop/server levels you'd prefer a
| desktop GPU.
| danhor wrote:
| > At all computer sizes, a GPU probably will have more multiply
| circuits than an equivalent cost FPGA
|
| Very likely yes, but FPGAs often have hundreds to thousands of
| hardware multipliers, as part of the DSP blocks. Here for
| example newer AMD FPGAs:
| https://eu.mouser.com/datasheet/2/903/ds890_ultrascale_overv...
| mathisfun123 wrote:
| I wish people would stop quoting marketing material as some
| kind representation of what they know.
|
| You're giving completely the wrong impression about dsp
| slices - it is absolutely not 1 dsp slice per FP operator at
| any precision that you would want to do floating point
| arithmetic. It's definitely at least 2 plus a whole bunch of
| LUTs (~500) for FP16 with 4 stages or something like that.
| And if you want faster (fewer stages) then you need more
| slices. On alveo u280, which is an ultrascale part, I have
| never been able to effectively utilize more than ~4000 dsp
| slices (out of 9024) for 5,4 mults and that cost basically
| 99% of clbs in SLR1 and SLR2.
|
| And even then, disconnected FPUs are completely meaningless
| without a datapath implementing eg matmul and boy oh boy do
| you have no clue what you're in for there.
|
| Takeaway: it's pointless to compare raw specsheet numbers
| when _everything_ comes down to datapath.
| pkaye wrote:
| How much would that FPGA cost?
| UncleOxidant wrote:
| The FPGAs with enough multipliers to be competitive against
| an actual GPU are going to be quite a bit more expensive than
| a GPU aren't they?
| Lramseyer wrote:
| Full Disclosure, I work for an FPGA company.
|
| The mind blowing part of all of this is the fact that they were
| able to close timing at 771MHz. That is insanely fast for an
| FPGA. For perspective, most modern FPGAs run their designs at
| around 300MHz* While most of the heavy lifting in this design use
| hardened components like DSPs and FPUs, it's still very
| impressive to see!
|
| What I didn't see talked about much was how memory is loaded in
| and out of the processor. I'm curious to see what the memory
| bandwidth numbers look like as well as the resource utilization
| of the higher level routing.
|
| *For most hardware designs that aren't things like CPUs and GPUs,
| you don't always need a super high clock speed. You have a lot
| more flexibility to compute in space rather than in time (think
| more threads running slower.) The pros and cons of such tradeoffs
| are a bit of a complicated topic, but should at least be noted.
| mathisfun123 wrote:
| > The mind blowing part of all of this is the fact that they
| were able to close timing at 771MHz
|
| It's true but I mean this is Intel in-house research right? If
| they can't get absolute peak fmax on their own parts that would
| be a really bad look right? Plus these stratix parts have hard
| FP blocks (not just DSPs) so they're basically mostly
| scheduling stuff rather building the whole datapath. But
| admittedly I haven't read the paper...
|
| >Full Disclosure, I work for an FPGA company
|
| I currently do too (as an intern, maybe even the same one as
| you) and I haven't looked very hard but I'm sure we have
| similar fmax achieving projects (maybe even GPUs since we're
| fighting hard to compete with Nvidia...).
| unwind wrote:
| Uh, non-native question: what is the word "class" doing in the
| title?
|
| Is a hyphen missing, so it should be "750 MHz-class"? I searched
| the linked page but the word only appears in the title, sans
| hyphen.
| avmich wrote:
| Wonder it this could help to alleviate the momentary shortage of
| GPUs on the market.
| ZiiS wrote:
| 10 year old entry level GPUs have 100 750Mhz cores
| monocasa wrote:
| 'Cores' are really overstated in GPUs. CUDA cores are really
| SIMD lanes and if you counted it the same way as a CPU does,
| you'd get somewhere in the dozens of cores range even for
| modern GPUs.
| codedokode wrote:
| A proper method is counting ALUs instead of vague "cores".
| xigency wrote:
| That seems backwards to me. Sure, a GPU core is less
| general, but in terms of concurrent execution, memory
| bandwidth, and FLOPS I would expect hundreds to thousands
| of cores for all new GPU offerings. Apple's double-digit
| GPU core counts for instance sound extremely understated.
| monocasa wrote:
| It's not. The best comparison is the SM count for Nvidia
| hardware, or the wavefront count for AMD hardware. So a
| 4070 has 46 cores as you'd count them on a CPU.
| latchkey wrote:
| I don't think this will be momentary. Reality is that there
| have been shortages of GPUs for a long time now and demand
| isn't going down. People are signing 3 year contracts with
| lambda now.
| [deleted]
| monocasa wrote:
| If it's on an FPGA then it doesn't really compete with GPUs you
| can buy from just about any perspective other than openness.
___________________________________________________________________
(page generated 2023-08-01 23:00 UTC)