hngopher.com

       [HN Gopher] Ask HN: How does a CPU communicate with a GPU?
       ___________________________________________________________________
        
       Ask HN: How does a CPU communicate with a GPU?
        
       I've been learning about computer architecture [1] and I've become
       comfortable with my understanding of how a processor communicates
       with main memory - be it directly, with the presence of caches or
       even virtual memory - and I/O peripherals.  But something that
       seems weirdly absent from the courses I took and what I have found
       online is how the CPU communicates with other processing units,
       such as GPUs - not only that, but an in-depth description of
       interconnecting different systems with buses (by in-depth I mean an
       RTL example/description).  I understand that as you add more
       hardware to a machine, complexity increases and software must
       intervene - so a generalistic answer won't exist and the answer
       will depend on the implementation being talked about. That's fine
       by me.  What I'm looking for is a description of how a CPU tells a
       GPU to start executing a program. Through what means do they
       communicate - a bus? How does such a communication instance look
       like?  I'd love get pointers to resources such as books and
       lectures that are more hands-on/implementation aware.  [1] Just so
       that my background knowledge is clear: I've concluded NAND2TETRIS,
       watched and concluded Berkeley's 2020 CS61C and have read a good
       chunk of H&P (both Computer Architecture: A Quantitative Approach
       and Computer Organization and Design: RISC-V edition), and now am
       moving on to Onur Mutlu's lectures on advanced computer
       architecture.
        
       Author : pedrolins
       Score  : 58 points
       Date   : 2022-03-30 20:17 UTC (2 hours ago)
        
       | simne wrote:
       | Lot of things happen there.
       | 
       | But most important, PCIe bus is serial bus, which have
       | virtualized interface, so there is no physical process of
       | communication, what happen more similar to Ethernet network, mean
       | on each device exists few endpoints, each has it's own controller
       | with its own address and few registers to store state and
       | transitions, and memory buffer(s).
       | 
       | Videocards usually have many behaviors. In simplest modes, they
       | behave just as RAM mapped to large chunk of system RAM space,
       | plus video registers to control video output, and to control
       | address mapping of video ram, and to switch modes.
       | 
       | In more complex modes, Videocards generate interrupts (just
       | special type of message on PCIe).
       | 
       | In 3D modes, which are most complex, Videocontroller take data
       | from its own memory (which mapped to system space), there are
       | stored tree of graphic primitives, some draw directly from
       | videoram, but for others used bus master option of PCIe, in which
       | videocontroller read additional data (textures) from predefined
       | chunks of system RAM.
       | 
       | About GPU operation, usually, CPU copy data to Videoram directly,
       | than ask videocontroller to run program in videoram, and when
       | complete, GPU issue interrupt, and than CPU copied result from
       | videoram.
       | 
       | Recent additions where, add GPU possibility to read data from
       | system disks, using mentioned before bus master, but those
       | additions are not already wide implemented.
        
         | simne wrote:
         | For beginner, I think the best to begin read about Atari
         | consoles, Atari-65/130, NES, as their ideas where later
         | implemented in all commodity videocards, just slightly
         | extended.
         | 
         | BTW all modern videos use bank-switching.
        
       | melenaboija wrote:
       | It is old and I am not sure everything still applies but I found
       | this course useful to understand how GPUs work:
       | 
       | Intro to Parallel Programming:
       | 
       | https://classroom.udacity.com/courses/cs344
       | 
       | https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...
        
       | aliasaria wrote:
       | There is some good information on how PCI-Express works here:
       | https://blog.ovhcloud.com/how-pci-express-works-and-why-you-...
        
       | dragontamer wrote:
       | I'm no expert on PCIe, but its been described to me as a network.
       | 
       | PCIe has switches, addresses, and so forth. Very much like IP-
       | addresses, except PCIe operates on a significantly faster level.
       | 
       | At its lowest-level, PCIe x1 is a single "lane", a singular
       | stream of zeros-and-ones (with various framing / error correction
       | on top). PCIe x2, x4, x8, and x16 are simply 2x, 4x, 8x, or 16
       | lanes running in parallel and independently.
       | 
       | -------
       | 
       | PCIe is a very large and complex protocol however. This "serial"
       | comms can become abstracted into Memory-mapped I/O. Instead of
       | programming at the "packet" level, most PCIe operations are seen
       | as just RAM.
       | 
       | > even virtual memory
       | 
       | So you understand virtual memory? PCIe abstractions go up to and
       | include the virtual memory system. When your OS sets aside some
       | virtual-memory for PCIe devices, when programs read/write to
       | those memory-addresses, the OS (and PCIe bridge) will translate
       | those RAM reads/writes into PCIe messages.
       | 
       | --------
       | 
       | I now handwave a few details and note: GPUs do the same thing on
       | their end. GPUs can also have a "virtual memory" that they
       | read/write to, and translates into PCIe messages.
       | 
       | This leads to a system called "Shared Virtual Memory" which has
       | become very popular in a lot of GPGPU programming circles. When
       | the CPU (or GPU) read/write to a memory address, it is then
       | automatically copied over to the other device as needed. Caching
       | layers are layered on top to improve the efficiency (Some SVM may
       | exist on the CPU-side, so the GPU will fetch the data and store
       | it in its own local memory / caches, but always rely upon the CPU
       | as the "main owner" of the data. The reverse, GPU-side shared
       | memory, also exists, where the CPU will communicate with the
       | GPU).
       | 
       | To coordinate access to RAM properly, the entire set of atomic
       | operations + memory barriers have been added to PCIe 3.0+. So you
       | can perform "compare-and-swap" to shared virtual memory, and
       | read/write to these virtual memory locations in a standardized
       | way across all PCIe devices.
       | 
       | PCIe 4.0 and PCIe 5.0 are adding more and more features, making
       | PCIe feel more-and-more like a "shared memory system", akin to
       | cache-coherence strategies that multi-CPU / multi-socket CPUs use
       | to share RAM with each other. In the long term, I expect Future
       | PCIe standards to push the interface even further in this "like a
       | dual-CPU-socket" memory-sharing paradigm.
       | 
       | This is great because you can have 2-CPUs + 4 GPUs on one system,
       | and when GPU#2 writes to Address#0xF1235122, the shared-virtual-
       | memory system automatically translates that to its "physical"
       | location (wherever it is), and the lower-level protocols pass the
       | data to the correct location without any assistance from the
       | programmer.
       | 
       | This means that a GPU can do things like perform a linked-list
       | traversal (or tree traversal), even if all of the nodes of the
       | tree/list are in CPU#1, CPU#2, GPU#4, and GPU#1. The shared-
       | virtual-memory paradigm just handwaves the details and lets PCIe
       | 3.0 / 4.0 / 5.0 protocols handle the details automatically.
        
         | simne wrote:
         | I agree that PCIe is mostly shared memory system.
         | 
         | But for videocards this sharing is unequal, because their RAM
         | sizes exceeds 32bit address space, and lot of still used
         | mainboards have 32bit PCIe controller, so all PCIe addresses
         | should be inside 4GB address space, and this is seen on windows
         | machines as total installed memory is nor all, but minus
         | approximately 0.5GB, from which 256MB is videoram access
         | window.
         | 
         | So in most cases, remain in force rule, that videocard share
         | all it's memory through 256mb window using bank-switching.
         | 
         | As for GPU read main system memory, usually this is useless,
         | because vram is magnitudes faster, even if not consider usage
         | of bus bandwidth by other devices, like HDD/SSD.
         | 
         | And in most cases, only usage of access GPU to main system
         | memory, is traditional read of textures (for 3D accelerator)
         | from main system memory - for example ALL 3D software using GPU
         | rendering, could only use for this videoram, none use system
         | ram.
        
       | roschdal wrote:
       | Through the electrical wires in the PCI express port.
        
         | danielmarkbruce wrote:
         | I could be misunderstanding the context of the question, but I
         | think OP is imagining some sophisticated communication logic
         | involved at the chip level. The CPU doesn't know anything much
         | about the GPU other than it's there and data can be sent back
         | and forth to it. It doesn't know what any of the data means.
         | 
         | I think the logic OP imagines does exist, but it's actually in
         | the compiler (eg the cuda compiler), figuring exactly what
         | bytes to send which will start a program etc.
        
           | coolspot wrote:
           | Not in the compiler but in GPU driver. A graphic program (or
           | compute) just calls APIs (DirectX/Vulkan/CUDA) of a driver,
           | which then knows how to do that on a low-level writing to
           | particular regions of RAM mapped to GPU registers.
        
             | danielmarkbruce wrote:
             | Yes! This is correct. My bad, it's been too long. I guess
             | either way the point is that it's done in software, not
             | hardware.
        
               | lxgr wrote:
               | There's also odd/interesting architectures like one of
               | the earlier Raspberry Pis, where the GPU was actually
               | running its own operating system that would take care of
               | things like shader compilation.
               | 
               | In that case, what's actually being written to
               | shared/mapped memory is very high level instructions that
               | are then compiled or interpreted on the GPU (which is
               | really an entire computer, CPU and all) itself.
        
         | alberth wrote:
         | Nit pick...
         | 
         | Technically it's not "through" the electrical wires, it's
         | actually through the electrical field created _around_ the
         | electrical wires.
         | 
         | Veritasium explains https://youtu.be/bHIhgxav9LY
        
           | tux3 wrote:
           | Nitpicking the nitpick: the energy is what's in the fields,
           | but the electrical wires aren't just for show, the electrons
           | do need to be able to move in the wire for there to be a
           | current, and the physical properties of the wire have a big
           | impact on the signal.
           | 
           | So things get very complicated and unintuitive, especially at
           | high frequencies, but it's okay to say through the wire!
        
             | a9h74j wrote:
             | And as you might be alluding, particularly high
             | frequencies: in the skin (via skin effect) of the wire!
             | 
             | I'll confess I have never seen a plot of actual rms current
             | density vs radius related to skin effect.
        
       | rayiner wrote:
       | Typically CPU and GPU communicate over the PCI Express bus. (It's
       | not technically a bus but a point to point connection.) From the
       | perspective of software running on the CPU, these days, that
       | communication is typically in the form of memory-mapped IO. The
       | GPU has registers and memory mapped into the CPU address space
       | using PCIE. A write to a particular address generates a message
       | on the PCIE bus that's received by the GPU and produces a write
       | to a GPU register or GPU memory.
       | 
       | The GPU also has access to system memory through the PCIE bus.
       | Typically, the CPU will construct buffers in memory with data
       | (textures, vertices), commands, and GPU code. It will then store
       | the buffer address in a GPU register and ring some sort of
       | "doorbell" by writing to another GPU register. The GPU
       | (specifically, the GPU command processor) will then read the
       | buffers from system memory, and start executing the commands.
       | Those commands can include, for example, loading GPU shader
       | programs into shader memory and triggering the shaders to execute
       | those shaders.
        
         | Keyframe wrote:
         | If OP or anyone else wants to see this firsthand.. well shit, I
         | feel old now, but.. try an exercise into assembly programming
         | of commodore 64. Get a VICE emulator and dig into it for a few
         | weeks. It's real easy to get into, CPU (6502 based), video chip
         | (VIC II), sound chip (famous SID), ROM chips.. they all love in
         | this address space (yeah, not mentioning pages), CPU has three
         | registers.. it's also real fun to get into, even to this day.
        
           | vletal wrote:
           | Nice exercise. Similarly I learned most about basic computer
           | architecture by programing 8050 in ASM as well as C.
           | 
           | And I'm 32. Am I old yet? I'm not right? Right?
        
             | silisili wrote:
             | Sorry pal!
             | 
             | I remember playing Halo in my early 20's, and chatting with
             | a guy from LA who was 34. Wow, he's so old, why was he
             | still playing video games.
             | 
             | Here I sit in my late 30's...still playing games when I
             | have time, denying that I'm old, despite the noises I make
             | getting up and random aches and pains.
        
             | Keyframe wrote:
             | 40s are new thirties, my friend. Also, painkillers help.
        
           | jeroenhd wrote:
           | There's a nice guide by Ben Eater on Youtube about a
           | breadboard computers: https://www.youtube.com/playlist?list=P
           | LowKtXNTBypFbtuVMUVXN...
           | 
           | It doesn't sport any modern features like DMA, but builds up
           | from the core basics: a 6502 chip, a clock, and a blinking
           | LED, all hooked up on a breadboard. He also built a basic VGA
           | card and explains protocols like PS/2, USB, and SPI. It's a
           | great introduction or refresher into the low level hardware
           | concepts behind computers. You can even buy kits to play
           | along at home!
        
           | zokier wrote:
           | Is my understanding correct that compared to those historical
           | architectures, modern GPUs are a lot more asynchronous?
           | 
           | What I mean that these days you'd issue a data transfer or
           | program execution on the GPU, they will complete at its own
           | pace and the CPU in the meanwhile continues executing other
           | code; in contrast in those 8 bitters you'd poke a video
           | register or whatev and expect that to have more immediate
           | effect allowing those famous race the beam effects etc?
        
             | Keyframe wrote:
             | There were interrupts telling you when certain things
             | happened. If anything, it was asynchronous. Big thing is
             | also that you had to tally the cost of what you eere doing.
             | There was a budget of how many cycles you got per line, per
             | screen and then fit whatever you had to in that. With
             | playing sound it was common to draw color when you fed the
             | music into SID so you could tell, like a crude debug/ad hoc
             | printf, how many cycles your music routines ate.
        
         | divbzero wrote:
         | Going one deeper, how does the communication work on a physical
         | level? I'm guessing the wires of the PCI Express bus passively
         | propagate the voltage and the CPU and GPU do "something" with
         | that voltage?
        
           | throw82473751 wrote:
           | Voltages yes.. usually its all binary digital signals,
           | running serial/parallel and following some communication
           | protocol. Maybe you should have a look at something really
           | simple/old like UART communication to get some idea how this
           | works and then study next how this is scaled up over PCIE to
           | understand the chat between CPU/GPU?
           | 
           | Or maybe not, one does not need all the details, so often
           | just scaled concepts :)
           | 
           | https://en.m.wikipedia.org/wiki/Universal_asynchronous_recei.
           | ..
           | 
           | Edit: Wait it is really already QAM over PCIE? Yeah then UART
           | is a gross simplification, but maybe still a good one to
           | start with depending on knowledge level?
        
             | _3u10 wrote:
             | https://pcisig.com/sites/default/files/files/PCI_Express_El
             | e... It doesn't say QAM explicitly but it has all the QAM
             | terminology like 128 codes. Inter symbol interference etc.
             | I'm not an RF guy by any stretch but it sounds like QAM to
             | me.
             | 
             | This is an old spec. I think it's like equivalent to
             | QAM-512 for PCIe 6
        
             | rayiner wrote:
             | PCI-E isn't QAM. It's NRZ over a differential link, with
             | 64/66b encoding, and then scrambled to reduce long runs of
             | 0s or 1s.
        
           | wyldfire wrote:
           | It might be easier to start with older or simpler/slower
           | buses. ISA, SPI, I2C. In some ways ISA is very different -
           | latching multiple parallel channels together instead of
           | ganging independent serial lanes. But it makes sense to start
           | off simple and consider the evolution. Modern PCIe layers
           | several awesome technologies together, especially FEC.
           | Originally they used 8b10b but I see now they're using
           | 242b256b.
        
           | rayiner wrote:
           | Before you get that deep, you need to step back for a bit.
           | The CPU is itself several different processors and
           | controllers. Look at a modern Intel CPU:
           | https://www.anandtech.com/show/3922/intels-sandy-bridge-
           | arch.... The individual x86 cores are connected via a ring
           | bus to a system agent. The ring bus is a kind of parallel
           | bus. In general, a parallel bus works by having every device
           | on the bus operating on a clock. At each clock tick (or after
           | some number of clock ticks), data can be transferred by
           | pulling address lines high or low to signify an address, and
           | pulling data lines high or low to signify the data value to
           | be written to that address.
           | 
           | The system agent then receives the memory operation and looks
           | at the system address map. If the target address is PCI-E
           | memory, it generates a PCI-E transaction using its built-in
           | PCI-E controller. The PCI-E bus is actually a multi-lane
           | serial bus. Each lane is a pair of wires using differential
           | signaling
           | (https://en.wikipedia.org/wiki/Differential_signalling). Bits
           | are sent on each lane according to a clock by manipulating
           | the voltages on the differential pairs. The voltage swings
           | don't correspond directly to 0s and 1s. Because of the data
           | rates involved and the potential for interference, cross-
           | talk, etc., an extremely complex mechanism is used to turn
           | bits into voltage swings on the differential pairs: https://p
           | cisig.com/sites/default/files/files/PCI_Express_Ele...
           | 
           | From the perspective of software, however, it's just bits
           | sent over a wire. The bits encode a PCI-E message packet:
           | https://www.semisaga.com/2019/07/pcie-tlp-header-packet-
           | form.... The packet has headers, address information, and
           | data information. But basically the packet can encode
           | transactions such as a memory write or read or register write
           | or read.
        
           | tenebrisalietum wrote:
           | Older CPUs - the CPU had a bunch of A pins (address), a bunch
           | of D pins (data).
           | 
           | The A pins would be a binary representation of an address,
           | and the D pins would be the binary representation of data.
           | 
           | A couple of other pins would select behavior (read or write)
           | and allow handshaking.
           | 
           | Those pins were connected to everything else that needed to
           | talk with the CPU on a physical level, such as RAM, I/O
           | devices, and connectors for expansion. Think 10-base-T
           | networking where multiple nodes are physically modulating one
           | common wire on an electrical level. Same concept, but you
           | have many more wires (and they're way shorter).
           | 
           | Arbitration logic was needed so things didn't step on each
           | other. Sometimes things did anyway and you couldn't talk to
           | certain devices in certain ways or your system would lock up
           | or misbehave.
           | 
           | Were there "switches" to isolate and select among various
           | banks of components? Sure, they are known as "gate arrays" -
           | those could be ASICs or implemented with simple 74xxx ICs.
           | 
           | Things like NuBus and PCI came about - the bus controller is
           | directly connected and addressable to the CPU as a device,
           | but everything else is connected to the bus controller, so
           | now the new-style bus isn't tied to the CPU and can operate
           | at a different speed and CPU and bus speed are now decoupled.
           | (This was done on video controllers in the old 8-bit days as
           | well - to get to video RAM you had to talk to the video chip,
           | and couldn't talk to video RAM directly on some 8-bit
           | systems).
           | 
           | PCIE is no longer a bus, it's more like switched Ethernet -
           | there's packets and switching and data goes over what's
           | basically one wire - this ends up being faster and more
           | reliable if you use advanced modulation schemes than keeping
           | multiple wires in sync at high speeds. The controllers facing
           | the CPU still implement the same interface, though.
        
           | _3u10 wrote:
           | It's signaled similar to QAM. Far more complicated than GPIO
           | type stuff. Think FM radio / spread spectrum rather than
           | bitbanging / old school serial / parallel ports.
           | 
           | Similar to old school modems if the line is noisy it can drop
           | to lower "baud" rates. You can manually try to recover higher
           | rates if the noise is gone but it's simpler to just reboot.
        
           | tux3 wrote:
           | Oh, that is _several_ levels deeper! PCIe is a big standard
           | with several layers of abstraction, and it 's far from
           | passive.
           | 
           | The different versions of PCIe use a different encoding, so
           | it's hard to sum it all up in a couple sentences in terms of
           | what the voltage does.
        
         | monkeybutton wrote:
         | IMO memory-mapped IO is the coolest thing since sliced bread.
         | It's a great example in computing where many different kinds of
         | hardware can all be brought together under a relatively simple
         | abstraction.
        
           | the__alchemist wrote:
           | It was a glorious "click" when learning embedded programming.
           | Even when writing Rust in typical desktop uses, it all
           | feels... abstract. Computer program logic. Where does the
           | magic happen? Where do you go from abstract logic to making
           | things happen? The answer is in voltatile memory reads and
           | writes to memory-mapped IO. You write a word to a memory
           | address, and a voltage changes. Etc.
        
       | justsomehnguy wrote:
       | TL;DR: bi-directional memory access with some means to notify the
       | other part about "something has changed".
       | 
       | It's not that different for any other PIC/E device, be it a
       | network card or a disk/HBA/RAID controller.
       | 
       | If you want to understand how it came to this - look at the
       | history of ISA, PCI/PCI-X, a short stint for AGP and finally
       | PCI-E.
       | 
       | Other comments provides a good ELI15 for the topic.
       | 
       | A minor note about "bus" - for PCEe it is mostly a historic term,
       | because it's a serial, P2P connection, though the process of
       | enumerating and qurying the devices is still very akin to what
       | you would do on some bus-based system, e.g.: SAS is a serial
       | "bus", compared to SCSI, but still you operate with it as some
       | "logical" bus, because it is easier for humans to grok it this
       | way.
        
       | dyingkneepad wrote:
       | On my system, the CPU sees the GPU as a PCI device. The "PCI
       | config space" [0] is a standard thing and so the CPU can read it
       | and figure out its device ID, vendor ID, revision, class, etc.
       | From that, the OS looks at its PCI drivers and tries to find
       | which one claims to drive that specific PCI device_id/vendor_id
       | combination (or class in case there's some kind of generic
       | universal driver for a certain class).
       | 
       | From there, the driver pretty much knows what to do. But
       | primarily the driver will map the registers to memory addresses,
       | so accessing offset 0xF0 from that map is equivalent as accessing
       | register 0xF0. The definition of what each register does is
       | something that the HW developers provide to the SW developers
       | [1].
       | 
       | Setting modes (screen resolution) and a lot of other stuff is
       | done directly by reading and writing to these registers. At some
       | point they also have to talk about memory (and virtual addresses)
       | and there's quite a complicated dance to map GPU virtual memory
       | to CPU virtual memory. On discrete GPUs the data is actually
       | "sent" to the memory somehow through the PCI bus (I suppose the
       | GPU can read directly from the memory without going through the
       | CPU?), but in the driver this is usually abstracted to "this is
       | another memory map". On integrated systems both the CPU and GPU
       | read directly from the system memory, but they may not share all
       | caches so extra care is required here. In fact, caches may also
       | mess the communication on discrete graphics, so extra care is
       | always required. This paragraph is mostly done by the Kernel
       | driver in Linux.
       | 
       | At some point the CPU will tell the GPU that a certain region of
       | memory is the framebuffer to be displayed. And then the CPU will
       | formulate binary programs that are written in the GPU's machine
       | code, and the CPU will submit those programs (batches) and the
       | GPU will execute them. These programs are generally in the form
       | of "I'm using textures from these addresses, this memory holds
       | the fragment shader, this other holds the geometry shader, the
       | configuration of threading and execution units is described in
       | this structure as you specified, SSBO index 0 is at this address,
       | now go and run everything". After everything is done the CPU may
       | even get an interrupt from the GPU saying things are done, so
       | they can notify user space. This paragraph describes mostly the
       | work done by the user space driver (in Linux, this is Mesa),
       | which implements OpenGL/Vulkan/etc abstractions.
       | 
       | [0]: https://en.wikipedia.org/wiki/PCI_configuration_space [1]:
       | https://01.org/linuxgraphics/documentation/hardware-specific...
        
       | derekzhouzhen wrote:
       | Other has mentioned MMIO. MMIO has several kinds:
       | 
       | 1. CPU accessing GPU hw with uncache-able MMIO, such as lower
       | level register access
       | 
       | 2. GPU accessing CPU memory with cache-able MMIO, or DMA. such as
       | command and data stream
       | 
       | 3. CPU accessing GPU memory with cache-able MMIO, such as
       | textures
       | 
       | They all happen on the bus with different latency and bandwidth.
        
       | ar_te wrote:
       | And I you looking for some strange architecture forgoten by
       | time:). https://www.copetti.org/writings/consoles/sega-saturn/
        
       | throwra620 wrote:
        
       | brooksbp wrote:
       | Woah there, my dude. Let's try to understand a simple model
       | first.
       | 
       | A CPU can access memory. When a CPU performs loads & stores it
       | initiates transactions containing the address of the memory.
       | Therefore, it is a bus master--it initiates transactions. A slave
       | accepts transactions and services them. The interconnect routes
       | those transactions to the appropriate hardware, e.g. the DDR
       | controller, based on the system address map.
       | 
       | Let's add a CPU, interconnect, and 2GB of DRAM memory:
       | +-------+       |  CPU  |       +---m---+           |
       | +---s--------------------+       |      Interconnect      |
       | +-------m----------------+               |
       | +----s-----------+          | DDR controller |
       | +----------------+                     System Address Map:
       | 0x8000_0000 - 0x0000_0000  DDR controller
       | 
       | So, a memory access to 0x0004_0000 is going to DRAM memory
       | storage.
       | 
       | Let's add a GPU.                 +-------+    +-------+       |
       | CPU  |    |  GPU  |       +---m---+    +---s---+           |
       | |       +---s------------m-------+       |      Interconnect
       | |       +-------m----------------+               |
       | +----s-----------+          | DDR controller |
       | +----------------+                     System Address Map:
       | 0x9000_0000 - 0x8000_0000  GPU         0x8000_0000 - 0x0000_0000
       | DDR controller
       | 
       | Now the CPU can perform loads & stores from/to the GPU. The CPU
       | can read/write registers in the GPU. But that's only one-way
       | communication. Let's make the GPU a bus master as well:
       | +-------+    +-------+       |  CPU  |    |  GPU  |
       | +---m---+    +--s-m--+           |           | |       +---s
       | -----------m-s-----+       |      Interconnect      |
       | +-------m----------------+               |
       | +----s-----------+          | DDR controller |
       | +----------------+                     System Address Map:
       | 0x9000_0000 - 0x8000_0000  GPU         0x8000_0000 - 0x0000_0000
       | DDR controller
       | 
       | Now, the GPU can not only receive transactions, but it can also
       | initiate transactions. Which also means it has access to DRAM
       | memory too.
       | 
       | But this is still only one-way communication (CPU->GPU). How can
       | the GPU communicate to the CPU? Well, both have access to DRAM
       | memory. The CPU can store information in DRAM memory (0x8000_0000
       | - 0x0000_0000) and then write to a register in the GPU
       | (0x9000_0000 - 0x8000_0000) to inform the GPU that the
       | information is ready. The GPU then reads that information from
       | DRAM memory. In the other direction, the GPU can store
       | information in DRAM memory, and then send an interrupt to the CPU
       | to inform the CPU that the information is ready. The CPU then
       | reads that information from DRAM memory. An alternative to using
       | interrupts is to have the CPU poll. The GPU stores information in
       | DRAM memory and then sets some bit in DRAM memory. The CPU polls
       | on this bit in DRAM memory, and when it changes, the CPU knows
       | that it can read the information in DRAM memory that was
       | previously written by the GPU.
       | 
       | Hope this helps. It's very fun stuff!
        
       | pizza234 wrote:
       | You'll find a very good introduction in the comparch book "Write
       | Great Code, Volume 1", chapter 12 ("Input and Output"), which
       | also explains the history of system buses (therefore, you'll find
       | an explanation of how ISA works).
       | 
       | Interestingly, there is a footnote explaining that "Computer
       | Architecture: A Quantitative Approach provided a good chapter on
       | I/O devices and buses; sadly, as it covered very old peripheral
       | devices, the authors dropped the chapter rather than updating it
       | in subsequent revisions."
        
       | throwmeariver1 wrote:
       | Everyone in tech should read the book "Understanding the Digital
       | World" by Brian W. Kernighan.
        
         | arduinomancer wrote:
         | Is it very in-depth or more for layman readers?
        
           | throwmeariver1 wrote:
           | Most normal people would get a red head when reading it and
           | techies would nod along and sometimes say "uh... so that's
           | how it really works". It's in between but a good primer on
           | the essentials.
        
         | dyingkneepad wrote:
         | Is this before or after they read Knuth?
        
       | zoenolan wrote:
       | Other are not wrong in saying Memory mapped IO. taking a look at
       | the Amiga hardware Reference manual [1] and a simple example [2]
       | or a NES programming guide [3] would be a good way to see this in
       | operation.
       | 
       | A more modern CPU/GPU setup is likely to use a ring buffer. The
       | buffer will be in CPU memory. That memory is also mapped into the
       | GPU address space. The Driver on the CPU will write commands into
       | the buffer which the GPU will execute. These will be different to
       | the shader unit instruction set.
       | 
       | Commands would be setting some internal GPU register to a value.
       | Allowing the setting resolution, framebuffer base pointer, set up
       | the output resolution, setting the mouse pointer position,
       | reference a texture from system memory, load a shader, execute a
       | shader, set a fence value (Useful for seeing when a resource,
       | texture, shader is no longer in use).
       | 
       | Hierarchical DMA buffers are a useful feature of some DMA
       | engines. You can think of them as similar to sub routines. The
       | command buffer can contain an instruction to switch execution to
       | another chunk of memory. This allows the driver to reuse
       | operations or expensive to generate sequences. OpenGL's display
       | list commonly compiled down to separate buffer.
       | 
       | [1] https://archive.org/details/amiga-hardware-reference-
       | manual-...
       | 
       | [2] https://www.reaktor.com/blog/crash-course-to-amiga-
       | assembly-...
       | 
       | [3] https://www.nesdev.org/wiki/Programming_guide
        
       | chubot wrote:
       | BTW I believe memory maps are set up by the ioctl() system call
       | on Unix (including OS X), which is kind of a "catch all" hole
       | poked through the kernel. Not sure about Windows.
       | 
       | I didn't understand that for a long time ...
       | 
       | I would like to see a "hello world GPU" example. I think you
       | open() the device and the ioctl() it ... But what happens when
       | things go wrong?
       | 
       | Similar to this "Hello JIT", where it shows you have to call
       | mmap() to change permissions on the memory to execute dynamically
       | generated code.
       | 
       | https://blog.reverberate.org/2012/12/hello-jit-world-joy-of-...
       | 
       | I guess one problem is that this may be typically done in vendor
       | code and they don't necessarily commit to an interface? They make
       | you link their huge SDK
        
       ___________________________________________________________________
       (page generated 2022-03-30 23:01 UTC)