[HN Gopher] Tiny GPU: A minimal GPU implementation in Verilog
___________________________________________________________________
Tiny GPU: A minimal GPU implementation in Verilog
Author : fgblanch
Score : 241 points
Date : 2024-04-25 05:36 UTC (17 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ginko wrote:
| Really cool project I love seeing HW projects like this in the
| open. But I'd argue that this is a SIMD coprocessor. For
| something to be a GPU it should at least have some sort of
| display output.
|
| I know the terminology has gotten quite loose in recent years
| with Nvidia & Co. selling server-only variants of their graphics
| architectures as GPUs, but the "graphics" part of GPU designs
| make up a significant part of the complexity, to this day.
| jdiff wrote:
| If it processes graphics, I think it counts, even if it has no
| output. There's still use for GPUs even if they're not
| outputting anything. My place of work has around 75
| workstations with mid-tier Quadros, but they only have mini-
| DisplayPort and my employer only springs for HDMI cables, so
| they're all hooked into the onboard graphics. The cards still
| accelerate our software, they still process graphics, they just
| don't output them.
| Narishma wrote:
| > If it processes graphics, I think it counts, even if it has
| no output.
|
| That's not a good definition, since a CPU or a DSP would
| count as a GPU. Both have been used for such purpose in the
| past.
|
| > There's still use for GPUs even if they're not outputting
| anything.
|
| The issue is not their existence, it about calling them GPUs
| when they have no graphics functionality.
| vineyardlabs wrote:
| Graphics functionality != display output What about laptop
| GPUs, which don't necessarily output to the screen at all
| times. Sometimes they don't even have a capability to do
| so. If it's coprocessor working alongside the general
| processor for the primary purpose of accelerating graphics
| computing workloads, it seems appropriate to call it a GPU.
|
| Edit: perhaps your point is that it doesn't make sense to
| call a device designed primarily to accelerate ML workloads
| or just general purpose vector calculations. In that case
| I'd agree that GPU isn't the right name.
| omikun wrote:
| >> Graphics functionality != display output Exactly.
| Graphics functionality also includes graphics specific
| hardware like vertex and fragment processing, which this
| does not have. It has no graphics specific hardware, ergo
| not a GPU.
| jdiff wrote:
| If it looks like a duck and it walks like a duck, why is it
| not a duck? If you are using a DSP to process graphics,
| then at least in the context of your system it has become
| your graphics processor.
|
| Plenty of GPUs don't have (or aren't used for their)
| display output. It's a GPU because of what it does:
| graphics processing. Not because of what connectivity it
| has.
| Narishma wrote:
| But it doesn't do graphics, so it shouldn't be called
| GPU. That's the whole point of this thread.
| samus wrote:
| But it does - it just needs an application to retrieve
| the buffer and do something with it. For example pushing
| it to storage.
| jdiff wrote:
| It does do graphics. Calculating graphics is different
| from handling display output. You can separate the two.
|
| Like someone else mentioned, laptops often have discrete
| graphics cards that are not wired to display hardware at
| all, needing to shuffle framebuffers through the onboard
| graphics when something needs to make its way to a
| screen.
| Narishma wrote:
| > Like someone else mentioned, laptops often have
| discrete graphics cards that are not wired to display
| hardware at all, needing to shuffle framebuffers through
| the onboard graphics when something needs to make its way
| to a screen.
|
| Those are GPUs even if they aren't connected to a display
| because they still have graphics components like ROPs,
| TMUs and whatnot.
| jdiff wrote:
| You're free to define it that way, but that's
| substantially different from GP's "if it's not a display
| adapter, it's not a GPU" that I was pushing against. It
| does seem pretty fragile to define a GPU in terms of the
| particular architecture of the day, though. There's
| plenty of things called GPUs that don't/didn't have TMUs,
| for example.
| samus wrote:
| CPUs and DSPs are not primarily designed for graphics work,
| therefore they don't count as GPUs. CPU are general-
| purpose, DSPs might be abused for graphics work.
|
| The "G" in GPU doesn't imply that they have to render
| directly to a screen. In fact, professional graphics cards
| are commonly used for bulk rendering for animating videos.
|
| Datacenter GPUs are mostly used for AI these days, but they
| can nevertheless do graphics work very well, and if they
| are used for generative AI or if their built-in super
| sampling capability is used, the distinction becomes rather
| blurry.
| omikun wrote:
| It's the shader core of a GPU. There are no graphics specific
| pipelines, eg: vertex processing, culling, rasterizer, color
| buffer, depth buffer, etc. That's like saying a CPU is also a
| GPU if it runs graphics in software.
| Narishma wrote:
| Yet another "GPU" providing no graphics functionality. IMO theses
| should be called something else.
| tossandthrow wrote:
| I think the establishing term is AIA, AI Accelerator.
| n4r9 wrote:
| That would ignore applications like crypto mining, which I'm
| guessing is still a biggie.
|
| What is it exactly that sets these units apart from CPUs?
| Something to do with the parallel nature of the hardware?
| Narishma wrote:
| CPUs are also pretty parallel. They have multiple cores,
| each of which can execute multiple instructions working on
| multiple data items all in a single clock cycle.
| tossandthrow wrote:
| the fact that they are not central. they work as a
| coprocessor.
|
| However, a CPU could easily embed an AIA, and certainly,
| they do.
| pjc50 wrote:
| The distinction that seems to be important is the warp-
| thread architecture: multiple compute units sharing a
| single program counter, but instead of the SIMD abstraction
| they are presented as conceptually separate threads.
|
| Also they tend to lack interrupt mechanisms and
| virtualization, at least at the programmer API level
| (usually NVIDIA systems have these but managed by the
| proprietary top level scheduler).
| fancyfredbot wrote:
| I have seen the term NPU used in reference to neural network
| accelerators a lot. I think AMD, Intel and Qualcomm all use
| this term for their AI accelerators. I think Apple call their
| AI accelerators neural engines, but I've definitely heard
| others refer to these as NPUs even though that's not their
| official name.
|
| I'll be honest I've never heard the AIA acronym used in this
| way. It seems all acronyms for all processors need to end in
| PU, for better or for worse.
| andersa wrote:
| Easy, it's now a General Processing Unit. Or perhaps a Great
| Processing Unit?
| Narishma wrote:
| But how is that different from a CPU?
| andersa wrote:
| It starts with a G.
| ginko wrote:
| It's not the central processor.
| trollied wrote:
| It's a coprocessor. They have existed for a very long
| time.
| shrubbery wrote:
| It's a brocessor.
| barkingcat wrote:
| Is that pronounced gee pee you, or Gip Pee You?
| Lichtso wrote:
| The first question is why is there a divide between CPUs and
| GPUs in the first place. Yes, the gap is closing and both
| categories are adding features of one another but there still
| is a significant divide. IMO it has to do with Amdahl's law
| [0]. In that sense CPUs should be called Latency-Optimizing-
| Processors (LOPs) and GPUs should be called Throughput-
| Optimizing-Processors (TOPs).
|
| More specifically [1] we could also call CPUs long / deep data
| dependency processors (LDDPs) and GPUs wide / flat data
| dependency processors (WDDPs).
|
| [0]: https://en.wikipedia.org/wiki/Amdahl%27s_law [1]:
| https://en.wikipedia.org/wiki/Data_dependency
| gpderetta wrote:
| The observation that graphic hardware and general purpose
| CPUs converge and diverge is not new:
| http://cva.stanford.edu/classes/cs99s/papers/myer-
| sutherland... .
|
| But as you observe, we are stuck in a local optimum where
| GPUs are optimized for throughput and CPUs for latency
| sensitive work.
| checker659 wrote:
| GPGPU
| cbm-vic-20 wrote:
| MPU- Matrix Processing Unit.
| 127 wrote:
| TPU, a Tensor Processing Unit
|
| Tensors are just n-dimensional arrays
|
| Then you can run software (firmware) on top of the TPU to make
| it behave like a GPU.
| deivid wrote:
| I've been thinking about starting a project to build a 'display
| adapter', but I've gotten stuck before starting as I wasn't
| able to figure out what is the communication protocol between
| UEFI's GOP driver and the display adapter. I've been trying to
| piece it together from EDK2's source, but it's unclear how much
| of this is QEMU-specific
| how2dothis wrote:
| ...Won't sound offending... But, but ...a Graphics-card; has
| "Ports (to attach a Keyboard on to)", RAM (verry fast), CPUs
| (many) and may be used like a full Computer, even without been
| driven by someone else like ...You -I suspect, not ?
|
| ...I for my part want to say thanks for the findings! :-)
|
| [Setting:Weekendmodus]
| djmips wrote:
| Haha I love this project but it's just PU
| piotrrojek wrote:
| Really awesome project. I want to get into FPGAs, but honestly
| it's even hard to grasp where to start and the whole field feels
| very intimidating. My eventual goal would be to create
| acceleration card for LLMs (completely arbitrary), so a lot of
| same bits and pieces as in this project, probably except for
| memory offloading part to load bigger models.
| IshKebab wrote:
| You might want to pick a more realistic goal! An FPGA capable
| of accelerating LLMs is going to cost at least tens of
| thousands, probably hundreds.
| JoachimS wrote:
| Depends heavily on what system it is supposed to provide
| acceleration for.
|
| If it is a MCU based on a simple ARM Cortex M0, M0+, M3 or
| RISC-V RV3I, then you could use an iCE40 or similar FPGA to
| provide a big acceleration by just using the DSPs and the big
| SPRAM.
|
| Basically add the custom compute operations and space that
| doesn't exist in the MCU, operations that would take several,
| many instructions to do in SW. Also, just by offloading to
| the FPGA AI 'co-processor' frees up the MCU to do other
| things.
|
| The kernel operations in the Tiny GPU project is actually
| really good examples of things you could efficiently
| implement in an iCE40UP FPGA device, resulting in substantial
| acceleration. And using EBRs (block RAM) and/or the SPRAM for
| block queues would make a nice interface to the MCU.
|
| One could also implement a RISC-V core in the FPGA, thus
| having a single chip with a low latency interface to the AI
| accelerator. You could even implement the AI acceleator as a
| set of custom instructions. There are so many possible
| solutions!
|
| An ice40UP-5K FPGA will set you back 9 EUR in single
| quantity.
|
| This concept of course scales up to performance and cost
| levels you talk about. With many possible steps in between.
| rjsw wrote:
| Or use one of the combined CPU+FPGA chips like the
| AMD/Xilinx Zynq, there are plenty of low cost dev boards
| for them.
| JoachimS wrote:
| Sure, a good example of a step between a really tiny
| system and 100k+ systems.
| imtringued wrote:
| Something that appears to be hardly known is that the
| transformer architecture needs to become more compute bound.
| Inventing a machine learning architecture which is FLOPs
| heavy instead of bandwidth heavy would be a good start.
|
| It could be as simple as using a CNN instead of a V matrix.
| Yes, this makes the architecture less efficient, but it also
| makes it easier for an accelerator to speed it up, since CNNs
| tend to be compute bound.
| checker659 wrote:
| I'm in the same boat. Here's my plan.
|
| 1. Read Harris, Harris - Digital Design and Computer
| Architecture. (2022). Elsevier.
| https://doi.org/10.1016/c2019-0-00213-0
|
| 2. Follow the author's RVFpga course to build an actual RISC-V
| CPU on an FPGA - https://www.youtube.com/watch?v=ePv3xD3ZmnY
| dailykoder wrote:
| Love the Harris and Harris book!
|
| I might add these:
|
| - Computer Architecture, Fifth Edition: A Quantitative
| Approach - https://dl.acm.org/doi/book/10.5555/1999263
|
| - Computer Organization and Design RISC-V Edition: The
| Hardware Software Interface -
| https://dl.acm.org/doi/10.5555/3153875
|
| both by Patterson and Hennessy
|
| Edit: And if you want to get into CPU design and can get a
| grip on "Advanced Computer Architecture: Parallelism,
| Scalability, Programmability" by Kai Hwang, then i'd
| recommend that too. It's super old and probably some things
| are made differently in newer CPUs, but it's exceptionally
| good to learn the fundamentals. Very well written. But I
| think it's hard to find a good (physical) copy.
| Aromasin wrote:
| Reframe it in your mind. "Getting into FPGAs" needs to be
| broken down. There are so many subsets of skills within the
| field that you need to level expectations. No one expects a
| software engineer to jump into things by building a full
| computer from first principles, writing an instruction set
| architecture, understanding machine code, converting that to
| assembly, and then developing a programming language so that
| they can write a bit of Python code to build an application.
| You start from the top and work your way down the stack.
|
| If you abstract away the complexities and focus on building a
| system using some pre-built IP, FPGA design is pretty easy. I
| always point people to something like MATLAB, so they can
| create some initial applications using HDL Coder on a DevKit
| with a Reference design. Otherwise, there's the massive
| overhead of learning digital computing architecture, Verilog,
| timing, transceivers/IO, pin planning, Quartus/Vivado,
| simulation/verification, embedded systems, etc.
|
| In short, start with some system-level design. Take some plug-
| and-play IP, learn how to hook together at the top level, and
| insert that module into a prebuilt reference design.
| Eventually, peel back the layers to reveal the complexity
| underneath.
| samvher wrote:
| I don't know where you are in your journey, but I found these
| resources very helpful to better understand digital logic and
| CPU/GPU architecture:
|
| 1. https://learn.saylor.org/course/CS301
|
| 2. https://www.coursera.org/learn/comparch
|
| 3. https://hdlbits.01xz.net/wiki/Main_Page
| imtringued wrote:
| If you want to accelerate LLMs, you will need to know the
| architecture first. Start from that. The hardware is actually
| both the easy (design) and the hard part (manufacturing).
| jgarzik wrote:
| Nice! I warmly encourage open-core GPU work.
|
| Here's another: https://github.com/jbush001/NyuziProcessor
| joe_the_user wrote:
| What would be nice would be a bare-bones CUDA implementation
| for one these open-core processors.
|
| What size run would be needed for TSMC or some other fab to
| produce such a processor economically?
| novaRom wrote:
| I did something similar many years ago in VHDL. There was a site
| called opencores for different open source HDL projects. I wonder
| if is there any good HPC level large scale distributed HDL
| simulator exists today? It makes sense to utilize modern GPUs for
| making RTL level simulations.
| Someone wrote:
| > There was a site called opencores for different open source
| HDL projects
|
| Was? https://opencores.org/projects?language=VHDL. Or is that
| not the same but similar?
| Jasper_ wrote:
| > Since threads are processed in parallel, tiny-gpu assumes that
| all threads "converge" to the same program counter after each
| instruction - which is a naive assumption for the sake of
| simplicity.
|
| > In real GPUs, individual threads can branch to different PCs,
| causing branch divergence where a group of threads threads
| initially being processed together has to split out into separate
| execution.
|
| Whoops. Maybe this person should try programming for a GPU before
| attempting to build one out of silicon.
|
| Not to mention the whole SIMD that... isn't.
|
| (This is the same person who stapled together other people's
| circuits to blink an LED and claimed to have built a CPU)
| bootsmann wrote:
| Isn't the first just equivalent to calling __syncthreads() on
| every launch?
| hyperbovine wrote:
| Which experienced CUDA programmers do anyways!
| stanleykm wrote:
| syncthreads synchronizes threads within a threadgroup and not
| across all threads.
| Jasper_ wrote:
| No, that effectively syncs all warps in a thread group. This
| implementation isn't doing any synchronization, it's
| independently doing PC/decode for different instructions, and
| just assuming they won't diverge. That's... a baffling
| combination of decisions; why do independent PC/decode if
| they're not to diverge? It reads as a very basic lack of
| ability to understand the core fundamental value of a GPU.
| And this isn't a secret GPU architecture thing. Here's a
| slide deck from 2009 going over the actual high-level
| architecture of a GPU. Notice how fetch/decode are shared
| between threads.
|
| https://engineering.purdue.edu/~smidkiff/ece563/slides/GPU.p.
| ..
| mk_stjames wrote:
| Uh, the ALU implements a DIV instruction straight up at the
| hardware level? Is this normal to have as a real instruction in
| something like a modern CUDA core or is DIV usually a software
| emulation instead? Because actual hardware divide circuits take
| up a ton a space and I wouldn't have expected them in a GPU ALU.
|
| It's so easy to write "DIV: begin alu_out_reg <= rs / rt; end" in
| your verilog but that one line takes a lotta silicon. But the
| person simulating this might not never see that if all they do is
| simulate the verilog.
| daghamm wrote:
| This is just someone learning Verilog.
|
| The project stops at simulation, making real hardware out of
| this requires much more work.
| userbinator wrote:
| _Because the GPU market is so competitive, low-level technical
| details for all modern architectures remain proprietary._
|
| Except for Intel, which publishes lots of technical documentation
| on their GPUs: https://kiwitree.net/~lina/intel-gfx-docs/prm/
|
| You can also find the i810/815 manuals elsewhere online, but
| except for an odd gap between that and the 965 (i.e. missing the
| 855/910/915/945) for some reason, they've been pretty consistent
| with the documentation.
| matheusmoreira wrote:
| The Linux drivers are also high quality and mainlined. Wish
| every company followed their lead.
| kimixa wrote:
| AMD also publish a fair bit of documentation -
| https://www.amd.com/en/developer/browse-by-resource-type/doc...
|
| Includes full ISA documentation of their current and past
| offerings, though look like they tend to be aimed at
| implementors rather than "high level" description for
| interested enthusiasts.
| vineyardlabs wrote:
| Is there a reason they're mixing non-blocking and blocking
| assignment operators in sequential always blocks here?
| urmish wrote:
| looks like those are local variables
| CamperBob2 wrote:
| You can feel free to do that, if you're not too hung up on
| simulation-synthesis matching.
___________________________________________________________________
(page generated 2024-04-25 23:01 UTC)