[HN Gopher] Tiny GPU: A minimal GPU implementation in Verilog
       ___________________________________________________________________
        
       Tiny GPU: A minimal GPU implementation in Verilog
        
       Author : fgblanch
       Score  : 241 points
       Date   : 2024-04-25 05:36 UTC (17 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | ginko wrote:
       | Really cool project I love seeing HW projects like this in the
       | open. But I'd argue that this is a SIMD coprocessor. For
       | something to be a GPU it should at least have some sort of
       | display output.
       | 
       | I know the terminology has gotten quite loose in recent years
       | with Nvidia & Co. selling server-only variants of their graphics
       | architectures as GPUs, but the "graphics" part of GPU designs
       | make up a significant part of the complexity, to this day.
        
         | jdiff wrote:
         | If it processes graphics, I think it counts, even if it has no
         | output. There's still use for GPUs even if they're not
         | outputting anything. My place of work has around 75
         | workstations with mid-tier Quadros, but they only have mini-
         | DisplayPort and my employer only springs for HDMI cables, so
         | they're all hooked into the onboard graphics. The cards still
         | accelerate our software, they still process graphics, they just
         | don't output them.
        
           | Narishma wrote:
           | > If it processes graphics, I think it counts, even if it has
           | no output.
           | 
           | That's not a good definition, since a CPU or a DSP would
           | count as a GPU. Both have been used for such purpose in the
           | past.
           | 
           | > There's still use for GPUs even if they're not outputting
           | anything.
           | 
           | The issue is not their existence, it about calling them GPUs
           | when they have no graphics functionality.
        
             | vineyardlabs wrote:
             | Graphics functionality != display output What about laptop
             | GPUs, which don't necessarily output to the screen at all
             | times. Sometimes they don't even have a capability to do
             | so. If it's coprocessor working alongside the general
             | processor for the primary purpose of accelerating graphics
             | computing workloads, it seems appropriate to call it a GPU.
             | 
             | Edit: perhaps your point is that it doesn't make sense to
             | call a device designed primarily to accelerate ML workloads
             | or just general purpose vector calculations. In that case
             | I'd agree that GPU isn't the right name.
        
               | omikun wrote:
               | >> Graphics functionality != display output Exactly.
               | Graphics functionality also includes graphics specific
               | hardware like vertex and fragment processing, which this
               | does not have. It has no graphics specific hardware, ergo
               | not a GPU.
        
             | jdiff wrote:
             | If it looks like a duck and it walks like a duck, why is it
             | not a duck? If you are using a DSP to process graphics,
             | then at least in the context of your system it has become
             | your graphics processor.
             | 
             | Plenty of GPUs don't have (or aren't used for their)
             | display output. It's a GPU because of what it does:
             | graphics processing. Not because of what connectivity it
             | has.
        
               | Narishma wrote:
               | But it doesn't do graphics, so it shouldn't be called
               | GPU. That's the whole point of this thread.
        
               | samus wrote:
               | But it does - it just needs an application to retrieve
               | the buffer and do something with it. For example pushing
               | it to storage.
        
               | jdiff wrote:
               | It does do graphics. Calculating graphics is different
               | from handling display output. You can separate the two.
               | 
               | Like someone else mentioned, laptops often have discrete
               | graphics cards that are not wired to display hardware at
               | all, needing to shuffle framebuffers through the onboard
               | graphics when something needs to make its way to a
               | screen.
        
               | Narishma wrote:
               | > Like someone else mentioned, laptops often have
               | discrete graphics cards that are not wired to display
               | hardware at all, needing to shuffle framebuffers through
               | the onboard graphics when something needs to make its way
               | to a screen.
               | 
               | Those are GPUs even if they aren't connected to a display
               | because they still have graphics components like ROPs,
               | TMUs and whatnot.
        
               | jdiff wrote:
               | You're free to define it that way, but that's
               | substantially different from GP's "if it's not a display
               | adapter, it's not a GPU" that I was pushing against. It
               | does seem pretty fragile to define a GPU in terms of the
               | particular architecture of the day, though. There's
               | plenty of things called GPUs that don't/didn't have TMUs,
               | for example.
        
             | samus wrote:
             | CPUs and DSPs are not primarily designed for graphics work,
             | therefore they don't count as GPUs. CPU are general-
             | purpose, DSPs might be abused for graphics work.
             | 
             | The "G" in GPU doesn't imply that they have to render
             | directly to a screen. In fact, professional graphics cards
             | are commonly used for bulk rendering for animating videos.
             | 
             | Datacenter GPUs are mostly used for AI these days, but they
             | can nevertheless do graphics work very well, and if they
             | are used for generative AI or if their built-in super
             | sampling capability is used, the distinction becomes rather
             | blurry.
        
           | omikun wrote:
           | It's the shader core of a GPU. There are no graphics specific
           | pipelines, eg: vertex processing, culling, rasterizer, color
           | buffer, depth buffer, etc. That's like saying a CPU is also a
           | GPU if it runs graphics in software.
        
       | Narishma wrote:
       | Yet another "GPU" providing no graphics functionality. IMO theses
       | should be called something else.
        
         | tossandthrow wrote:
         | I think the establishing term is AIA, AI Accelerator.
        
           | n4r9 wrote:
           | That would ignore applications like crypto mining, which I'm
           | guessing is still a biggie.
           | 
           | What is it exactly that sets these units apart from CPUs?
           | Something to do with the parallel nature of the hardware?
        
             | Narishma wrote:
             | CPUs are also pretty parallel. They have multiple cores,
             | each of which can execute multiple instructions working on
             | multiple data items all in a single clock cycle.
        
             | tossandthrow wrote:
             | the fact that they are not central. they work as a
             | coprocessor.
             | 
             | However, a CPU could easily embed an AIA, and certainly,
             | they do.
        
             | pjc50 wrote:
             | The distinction that seems to be important is the warp-
             | thread architecture: multiple compute units sharing a
             | single program counter, but instead of the SIMD abstraction
             | they are presented as conceptually separate threads.
             | 
             | Also they tend to lack interrupt mechanisms and
             | virtualization, at least at the programmer API level
             | (usually NVIDIA systems have these but managed by the
             | proprietary top level scheduler).
        
           | fancyfredbot wrote:
           | I have seen the term NPU used in reference to neural network
           | accelerators a lot. I think AMD, Intel and Qualcomm all use
           | this term for their AI accelerators. I think Apple call their
           | AI accelerators neural engines, but I've definitely heard
           | others refer to these as NPUs even though that's not their
           | official name.
           | 
           | I'll be honest I've never heard the AIA acronym used in this
           | way. It seems all acronyms for all processors need to end in
           | PU, for better or for worse.
        
         | andersa wrote:
         | Easy, it's now a General Processing Unit. Or perhaps a Great
         | Processing Unit?
        
           | Narishma wrote:
           | But how is that different from a CPU?
        
             | andersa wrote:
             | It starts with a G.
        
             | ginko wrote:
             | It's not the central processor.
        
               | trollied wrote:
               | It's a coprocessor. They have existed for a very long
               | time.
        
               | shrubbery wrote:
               | It's a brocessor.
        
           | barkingcat wrote:
           | Is that pronounced gee pee you, or Gip Pee You?
        
         | Lichtso wrote:
         | The first question is why is there a divide between CPUs and
         | GPUs in the first place. Yes, the gap is closing and both
         | categories are adding features of one another but there still
         | is a significant divide. IMO it has to do with Amdahl's law
         | [0]. In that sense CPUs should be called Latency-Optimizing-
         | Processors (LOPs) and GPUs should be called Throughput-
         | Optimizing-Processors (TOPs).
         | 
         | More specifically [1] we could also call CPUs long / deep data
         | dependency processors (LDDPs) and GPUs wide / flat data
         | dependency processors (WDDPs).
         | 
         | [0]: https://en.wikipedia.org/wiki/Amdahl%27s_law [1]:
         | https://en.wikipedia.org/wiki/Data_dependency
        
           | gpderetta wrote:
           | The observation that graphic hardware and general purpose
           | CPUs converge and diverge is not new:
           | http://cva.stanford.edu/classes/cs99s/papers/myer-
           | sutherland... .
           | 
           | But as you observe, we are stuck in a local optimum where
           | GPUs are optimized for throughput and CPUs for latency
           | sensitive work.
        
         | checker659 wrote:
         | GPGPU
        
         | cbm-vic-20 wrote:
         | MPU- Matrix Processing Unit.
        
         | 127 wrote:
         | TPU, a Tensor Processing Unit
         | 
         | Tensors are just n-dimensional arrays
         | 
         | Then you can run software (firmware) on top of the TPU to make
         | it behave like a GPU.
        
         | deivid wrote:
         | I've been thinking about starting a project to build a 'display
         | adapter', but I've gotten stuck before starting as I wasn't
         | able to figure out what is the communication protocol between
         | UEFI's GOP driver and the display adapter. I've been trying to
         | piece it together from EDK2's source, but it's unclear how much
         | of this is QEMU-specific
        
         | how2dothis wrote:
         | ...Won't sound offending... But, but ...a Graphics-card; has
         | "Ports (to attach a Keyboard on to)", RAM (verry fast), CPUs
         | (many) and may be used like a full Computer, even without been
         | driven by someone else like ...You -I suspect, not ?
         | 
         | ...I for my part want to say thanks for the findings! :-)
         | 
         | [Setting:Weekendmodus]
        
         | djmips wrote:
         | Haha I love this project but it's just PU
        
       | piotrrojek wrote:
       | Really awesome project. I want to get into FPGAs, but honestly
       | it's even hard to grasp where to start and the whole field feels
       | very intimidating. My eventual goal would be to create
       | acceleration card for LLMs (completely arbitrary), so a lot of
       | same bits and pieces as in this project, probably except for
       | memory offloading part to load bigger models.
        
         | IshKebab wrote:
         | You might want to pick a more realistic goal! An FPGA capable
         | of accelerating LLMs is going to cost at least tens of
         | thousands, probably hundreds.
        
           | JoachimS wrote:
           | Depends heavily on what system it is supposed to provide
           | acceleration for.
           | 
           | If it is a MCU based on a simple ARM Cortex M0, M0+, M3 or
           | RISC-V RV3I, then you could use an iCE40 or similar FPGA to
           | provide a big acceleration by just using the DSPs and the big
           | SPRAM.
           | 
           | Basically add the custom compute operations and space that
           | doesn't exist in the MCU, operations that would take several,
           | many instructions to do in SW. Also, just by offloading to
           | the FPGA AI 'co-processor' frees up the MCU to do other
           | things.
           | 
           | The kernel operations in the Tiny GPU project is actually
           | really good examples of things you could efficiently
           | implement in an iCE40UP FPGA device, resulting in substantial
           | acceleration. And using EBRs (block RAM) and/or the SPRAM for
           | block queues would make a nice interface to the MCU.
           | 
           | One could also implement a RISC-V core in the FPGA, thus
           | having a single chip with a low latency interface to the AI
           | accelerator. You could even implement the AI acceleator as a
           | set of custom instructions. There are so many possible
           | solutions!
           | 
           | An ice40UP-5K FPGA will set you back 9 EUR in single
           | quantity.
           | 
           | This concept of course scales up to performance and cost
           | levels you talk about. With many possible steps in between.
        
             | rjsw wrote:
             | Or use one of the combined CPU+FPGA chips like the
             | AMD/Xilinx Zynq, there are plenty of low cost dev boards
             | for them.
        
               | JoachimS wrote:
               | Sure, a good example of a step between a really tiny
               | system and 100k+ systems.
        
           | imtringued wrote:
           | Something that appears to be hardly known is that the
           | transformer architecture needs to become more compute bound.
           | Inventing a machine learning architecture which is FLOPs
           | heavy instead of bandwidth heavy would be a good start.
           | 
           | It could be as simple as using a CNN instead of a V matrix.
           | Yes, this makes the architecture less efficient, but it also
           | makes it easier for an accelerator to speed it up, since CNNs
           | tend to be compute bound.
        
         | checker659 wrote:
         | I'm in the same boat. Here's my plan.
         | 
         | 1. Read Harris, Harris - Digital Design and Computer
         | Architecture. (2022). Elsevier.
         | https://doi.org/10.1016/c2019-0-00213-0
         | 
         | 2. Follow the author's RVFpga course to build an actual RISC-V
         | CPU on an FPGA - https://www.youtube.com/watch?v=ePv3xD3ZmnY
        
           | dailykoder wrote:
           | Love the Harris and Harris book!
           | 
           | I might add these:
           | 
           | - Computer Architecture, Fifth Edition: A Quantitative
           | Approach - https://dl.acm.org/doi/book/10.5555/1999263
           | 
           | - Computer Organization and Design RISC-V Edition: The
           | Hardware Software Interface -
           | https://dl.acm.org/doi/10.5555/3153875
           | 
           | both by Patterson and Hennessy
           | 
           | Edit: And if you want to get into CPU design and can get a
           | grip on "Advanced Computer Architecture: Parallelism,
           | Scalability, Programmability" by Kai Hwang, then i'd
           | recommend that too. It's super old and probably some things
           | are made differently in newer CPUs, but it's exceptionally
           | good to learn the fundamentals. Very well written. But I
           | think it's hard to find a good (physical) copy.
        
         | Aromasin wrote:
         | Reframe it in your mind. "Getting into FPGAs" needs to be
         | broken down. There are so many subsets of skills within the
         | field that you need to level expectations. No one expects a
         | software engineer to jump into things by building a full
         | computer from first principles, writing an instruction set
         | architecture, understanding machine code, converting that to
         | assembly, and then developing a programming language so that
         | they can write a bit of Python code to build an application.
         | You start from the top and work your way down the stack.
         | 
         | If you abstract away the complexities and focus on building a
         | system using some pre-built IP, FPGA design is pretty easy. I
         | always point people to something like MATLAB, so they can
         | create some initial applications using HDL Coder on a DevKit
         | with a Reference design. Otherwise, there's the massive
         | overhead of learning digital computing architecture, Verilog,
         | timing, transceivers/IO, pin planning, Quartus/Vivado,
         | simulation/verification, embedded systems, etc.
         | 
         | In short, start with some system-level design. Take some plug-
         | and-play IP, learn how to hook together at the top level, and
         | insert that module into a prebuilt reference design.
         | Eventually, peel back the layers to reveal the complexity
         | underneath.
        
         | samvher wrote:
         | I don't know where you are in your journey, but I found these
         | resources very helpful to better understand digital logic and
         | CPU/GPU architecture:
         | 
         | 1. https://learn.saylor.org/course/CS301
         | 
         | 2. https://www.coursera.org/learn/comparch
         | 
         | 3. https://hdlbits.01xz.net/wiki/Main_Page
        
         | imtringued wrote:
         | If you want to accelerate LLMs, you will need to know the
         | architecture first. Start from that. The hardware is actually
         | both the easy (design) and the hard part (manufacturing).
        
       | jgarzik wrote:
       | Nice! I warmly encourage open-core GPU work.
       | 
       | Here's another: https://github.com/jbush001/NyuziProcessor
        
         | joe_the_user wrote:
         | What would be nice would be a bare-bones CUDA implementation
         | for one these open-core processors.
         | 
         | What size run would be needed for TSMC or some other fab to
         | produce such a processor economically?
        
       | novaRom wrote:
       | I did something similar many years ago in VHDL. There was a site
       | called opencores for different open source HDL projects. I wonder
       | if is there any good HPC level large scale distributed HDL
       | simulator exists today? It makes sense to utilize modern GPUs for
       | making RTL level simulations.
        
         | Someone wrote:
         | > There was a site called opencores for different open source
         | HDL projects
         | 
         | Was? https://opencores.org/projects?language=VHDL. Or is that
         | not the same but similar?
        
       | Jasper_ wrote:
       | > Since threads are processed in parallel, tiny-gpu assumes that
       | all threads "converge" to the same program counter after each
       | instruction - which is a naive assumption for the sake of
       | simplicity.
       | 
       | > In real GPUs, individual threads can branch to different PCs,
       | causing branch divergence where a group of threads threads
       | initially being processed together has to split out into separate
       | execution.
       | 
       | Whoops. Maybe this person should try programming for a GPU before
       | attempting to build one out of silicon.
       | 
       | Not to mention the whole SIMD that... isn't.
       | 
       | (This is the same person who stapled together other people's
       | circuits to blink an LED and claimed to have built a CPU)
        
         | bootsmann wrote:
         | Isn't the first just equivalent to calling __syncthreads() on
         | every launch?
        
           | hyperbovine wrote:
           | Which experienced CUDA programmers do anyways!
        
           | stanleykm wrote:
           | syncthreads synchronizes threads within a threadgroup and not
           | across all threads.
        
           | Jasper_ wrote:
           | No, that effectively syncs all warps in a thread group. This
           | implementation isn't doing any synchronization, it's
           | independently doing PC/decode for different instructions, and
           | just assuming they won't diverge. That's... a baffling
           | combination of decisions; why do independent PC/decode if
           | they're not to diverge? It reads as a very basic lack of
           | ability to understand the core fundamental value of a GPU.
           | And this isn't a secret GPU architecture thing. Here's a
           | slide deck from 2009 going over the actual high-level
           | architecture of a GPU. Notice how fetch/decode are shared
           | between threads.
           | 
           | https://engineering.purdue.edu/~smidkiff/ece563/slides/GPU.p.
           | ..
        
       | mk_stjames wrote:
       | Uh, the ALU implements a DIV instruction straight up at the
       | hardware level? Is this normal to have as a real instruction in
       | something like a modern CUDA core or is DIV usually a software
       | emulation instead? Because actual hardware divide circuits take
       | up a ton a space and I wouldn't have expected them in a GPU ALU.
       | 
       | It's so easy to write "DIV: begin alu_out_reg <= rs / rt; end" in
       | your verilog but that one line takes a lotta silicon. But the
       | person simulating this might not never see that if all they do is
       | simulate the verilog.
        
         | daghamm wrote:
         | This is just someone learning Verilog.
         | 
         | The project stops at simulation, making real hardware out of
         | this requires much more work.
        
       | userbinator wrote:
       | _Because the GPU market is so competitive, low-level technical
       | details for all modern architectures remain proprietary._
       | 
       | Except for Intel, which publishes lots of technical documentation
       | on their GPUs: https://kiwitree.net/~lina/intel-gfx-docs/prm/
       | 
       | You can also find the i810/815 manuals elsewhere online, but
       | except for an odd gap between that and the 965 (i.e. missing the
       | 855/910/915/945) for some reason, they've been pretty consistent
       | with the documentation.
        
         | matheusmoreira wrote:
         | The Linux drivers are also high quality and mainlined. Wish
         | every company followed their lead.
        
         | kimixa wrote:
         | AMD also publish a fair bit of documentation -
         | https://www.amd.com/en/developer/browse-by-resource-type/doc...
         | 
         | Includes full ISA documentation of their current and past
         | offerings, though look like they tend to be aimed at
         | implementors rather than "high level" description for
         | interested enthusiasts.
        
       | vineyardlabs wrote:
       | Is there a reason they're mixing non-blocking and blocking
       | assignment operators in sequential always blocks here?
        
         | urmish wrote:
         | looks like those are local variables
        
         | CamperBob2 wrote:
         | You can feel free to do that, if you're not too hung up on
         | simulation-synthesis matching.
        
       ___________________________________________________________________
       (page generated 2024-04-25 23:01 UTC)