[HN Gopher] Hardware Acceleration of LLMs: A comprehensive surve...
       ___________________________________________________________________
        
       Hardware Acceleration of LLMs: A comprehensive survey and
       comparison
        
       Author : matt_d
       Score  : 248 points
       Date   : 2024-09-06 22:09 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | moffkalast wrote:
       | In-memory sounds like the way to go not just in terms of
       | performance, but in that it makes no sense to build an ASIC or
       | program an FPGA for a model that will most likely be obsolete in
       | a few months at best if you're lucky.
        
         | limit499karma wrote:
         | https://arxiv.org/pdf/2402.09709
        
         | throwawaymaths wrote:
         | Yeah, it's not like foundational models ever share compute
         | kernels, or anything.
        
           | moffkalast wrote:
           | Eh, there's so much shenanigans these days even in fine
           | tuning, people adding empty layers and pruning and whatnot,
           | it's unlikely that even models based on the same one will
           | have the same architecture.
           | 
           | For new foundation models it's even worse, because there's
           | some fancy experiment every time and the llama.cpp team needs
           | two weeks to figure out how to implement it so the model can
           | even run.
        
             | throwawaymaths wrote:
             | Yeah so you might need to implement a weird activation or
             | positional enciding in software or something but I suspect
             | 90% will probably be the same... If it's just layer count
             | or skipped matrices I assume it should be possible to write
             | an orchestrator that could run most of those models...
             | Unless we move to mamba or something
        
       | yjftsjthsd-h wrote:
       | I'm unfamiliar; in this context is "in-memory" specialized
       | hardware that combines CPU+RAM?
        
         | limit499karma wrote:
         | In-mem (generally) means no (re)loading of data from a storage
         | device.
        
           | yjftsjthsd-h wrote:
           | Sure, but I don't think that makes sense here; when I run an
           | LLM on CPU, I load to memory and run it, when I run on GPU I
           | load the model into the GPU's memory and run it, and I don't
           | have anything like that much money to burn but I imagine if I
           | used an FPGA then I would load the model into its memory and
           | then run it from there. So the fact that they're saying "in-
           | memory" in _contrast_ to ex. GPU makes me think that they 're
           | talking about something different here.
        
             | mmoskal wrote:
             | It's a different kind of memory chip that also does some
             | computation. See https://en.m.wikipedia.org/wiki/In-
             | memory_processing
        
               | adrian_b wrote:
               | While this has been proposed repeatedly for many decades,
               | I doubt that it will ever become useful.
               | 
               | Combining memory with computation seems good in theory,
               | but it is difficult to do in practice.
               | 
               | The fabrication technologies for DRAM and for
               | computational devices are very different. If you
               | implement computational units on a DRAM chip, they will
               | have a much worse performance than those implemented with
               | a dedicated fabrication process, so for instance their
               | performance per watt and per occupied area will be worse,
               | leading to higher costs than for using separate memories
               | and computational devices.
               | 
               | The higher cost might be acceptable in certain cases if a
               | much higher performance is obtained. However it is
               | unavoidable that unlike with a CPU/GPU/FPGA, where you
               | can easily reprogram the device to implement a completely
               | different algorithm, a device with in-memory computation
               | would be much less flexible, so it either will implement
               | extremely simple operations, like adding to memory or
               | multiplying the memory, which would not increase much the
               | performance due to communication overheads, or it would
               | implement some more complex operations, which might
               | implement some ML/AI algorithm that is popular for the
               | moment, but which would be hard to use to implement
               | better algorithms when such algorithms are discovered.
        
               | vlovich123 wrote:
               | I suspect that the attempts to remove the DRAM controller
               | and embedding it into the chips directly will succeed in
               | meaningfully reducing the power per retrieval and
               | increase the bandwidth by big enough that it'll postpone
               | these more esoteric architectures even though its pretty
               | clear that bulk data processing like LLMs (and maybe even
               | graphics) is better suited to this architecture since
               | it's cheaper to fan out the code than it is to shuffle
               | all these bits back and forth.
        
               | p1esk wrote:
               | In-memory doesn't mean in-DRAM.
               | 
               | https://arxiv.org/pdf/2406.08413
        
               | adrian_b wrote:
               | SRAM does not have enough capacity to be useful for in-
               | memory computation.
               | 
               | The existing CPUs, GPUs and FPGAs are full of SRAM that
               | is intimately mixed with the computational parts of the
               | chips and you could not find any structure improving on
               | that.
               | 
               | All the talk about in-memory computing is strictly about
               | DRAM, because only DRAM could increase the amount of
               | memory from the up to hundreds of MB of memory that is
               | currently contained inside the biggest CPUs or GPUs to
               | the hundreds of GB that might be needed by the biggest
               | ML/AI applications.
               | 
               | All the other memory technologies mentioned in the paper
               | linked by you are many years or even decades away from
               | being usable as simple memory devices. In order to be
               | used for in-memory computing, one must first solve the
               | problem of making them work as memories. For now, it is
               | not even clear if this simpler problem can be solved.
        
               | p1esk wrote:
               | Let's see: Mythic uses flash, d-Matrix uses SRAM.
               | Encharge is the only one who uses capacitor based
               | crossbars, but those are custom built from scratch and
               | very different from any existing DRAM technology.
               | 
               | Which companies are using DRAM for in-memory computing?
        
               | adrian_b wrote:
               | Mythic does not do in-memory computing, despite their
               | claims.
               | 
               | Flash cannot be used for in-memory computing, because
               | writing it is too slow.
               | 
               | According to what they say, they have an inference device
               | that uses analog computing for inference. They have a
               | flash memory, but that stores only the weights of the
               | model, which are constant during the computation, so the
               | flash is not a working memory, it is used only for the
               | reconfiguration of the device, when a new model is
               | loaded.
               | 
               | Analog computing for inference is actually something that
               | is much more promising than in-memory computing, so
               | Mythic might be able to develop useful devices.
               | 
               | d-Matrix appears to do true in-memory computing, but the
               | price of their devices for an amount of memory matching a
               | current GPU will be astronomical.
               | 
               | Perhaps there will be organizations willing to pay huge
               | amounts of money for a very high performance, like those
               | which are buying Cerebras nowadays, but such an expensive
               | technology will always be a niche too small to be
               | relevant for most users.
        
               | p1esk wrote:
               | You don't need to write anything back to flash to use it
               | to compute something: the output of a floating gate
               | transistor is written to some digital buffer nearby
               | (usually SRAM). Yes, it's only used for inference, not
               | sure how that disqualifies it from being in-memory
               | computing? In-memory computing simply means there's a
               | memory device/circuit (transistor, capacitor, memristor,
               | etc) that holds a value and is used to compute another
               | value based on some input received by the cell. As
               | opposed to a traditional ALU which receives two inputs
               | from a separate memory circuit (registers) to compute the
               | output.
        
               | adrian_b wrote:
               | This is not in-memory computing, because from the point
               | of view of the inference algorithm the flash memory is
               | not a memory.
               | 
               | You can remove all the flash memory and replace all its
               | bits with suitable connections to ground or the supply
               | voltage, corresponding to the weights of the model.
               | 
               | Then the device without any flash memory will continue to
               | function exactly like before, computing the inference
               | algorithm without changes. Therefore it should be obvious
               | that this is not in-memory computing, if you can remove
               | the memory without affecting the computing.
               | 
               | The memory is needed only if you want to be able to
               | change the model, by loading another set of weights.
               | 
               | The flash memory is a configuration memory, exactly like
               | the configuration memories of logic devices like FPGAs or
               | CPLDs. In FPGAs or CPLDs you do the same thing, you load
               | the configuration memory with a new set of values, then
               | the FPGA/CPLD will implement a new logic device, until
               | the next reloading of the configuration memory.
               | 
               | Exactly like in this device, the configuration memory of
               | the FPGAs/CPLDs, which may be a flash memory too, is not
               | counted as a working memory. The FPGAs/CPLDs contain
               | memories and registers, but those are distinct from the
               | configuration memory and they cannot be implemented with
               | flash memory, like the configuration memory.
               | 
               | In this inference device with analog computing there must
               | also be a working memory, which contains mutable state,
               | but that must be implemented with capacitors that store
               | analog voltages.
               | 
               | You might talk about an in-memory computing only with
               | reference to the analog memory with capacitors, but even
               | this description is likely to be misleading, because from
               | the point of view of the analog memory it is more
               | probable that the structure of the inference device is
               | some kind of dataflow structure, where the memory
               | capacitors implement some kind of analog shift registers
               | and not anything resembling memory cells in which
               | information is stored for later retrieval.
        
               | vlovich123 wrote:
               | Am I misreading something?
               | 
               | > At their core, NVM arrays are arranged in two
               | dimensions and programmed to discrete conductances (Fig.
               | 5). Each crosspoint in the array has two terminals
               | connected to a word line and a bit line. Digital inputs
               | are converted to voltages, which then activate the word
               | lines. The multiplication operation is performed between
               | the conductance gij and the voltage Vi by applying Ohm's
               | law at each cell, while currents Ij accumulate along each
               | column according to Kirchhoff's current law
               | 
               | Sounds like the compute element is embedded within the
               | DRAM but instead of doing a digital computation it's done
               | in analog space (which feels a bit wrong since the
               | DAC+ADC combo would eat quite a bit of power but maybe
               | it's easier to manufacture or other reasons to do it in
               | analog space).
               | 
               | Or you're saying it would be better with flash storage
               | because it could be used for even larger models. I think
               | that's right but my overall point holds - removal of the
               | DRAM controller could free up significant amounts of DRAM
               | bandwidth (like 20x IIRC) and reduction in power (by 100x
               | IIRC). There's value in that regardless and it would just
               | be a free speedup and would significantly benefit
               | existing LLMs that rely on RAM. An analog compute circuit
               | embedded within flash would be usable basically only for
               | today's LLMs architecture and not be very flexible and
               | require a huge change in how this stuff works to take
               | advantage. Might still make sense if the architecture
               | remains largely unchanged and other approaches can't be
               | as competitive, but it does lock you into a design more
               | than something more digitally programmable that can also
               | do other things.
        
               | sroussey wrote:
               | Using analog means it will be faster (digital is slow,
               | waiting for the carry on each bit), but I am curious how
               | they do the ADC. RAM stuff is generally so different that
               | not introducing logic gates in the memory makes sense.
        
               | vlovich123 wrote:
               | Digital is slow, but I would think converting the signal
               | to/from digital might be slow too. Maybe it's taking the
               | analog signal from the RAM itself & storing back the
               | analog signal with a little bit of cleanup without ever
               | going into the digital domain?
        
               | sroussey wrote:
               | Oh, absolutely. Never switching to digital would be the
               | way. And not hard for low bit counts like 4. I am very
               | interested in the methodology if they do this with 64bit.
        
               | p1esk wrote:
               | _Am I misreading something?_
               | 
               | Yes, you are. NVM stands for 'non-volatile memory", which
               | is literally the opposite of DRAM.
               | 
               | Analog computation can be done using any memory cell
               | technology (transistor, capacitor, memristor, etc), but
               | the result will always go through ADC to be stored in a
               | digital buffer.
               | 
               | Flash does not provide any advantages as far as model
               | size, the size of crossbar is constrained by other
               | factors (e.g. current leakage), and typically it's in the
               | ballpark of 1kx1k matmuls. You simply put more of them on
               | a chip, and try to parallelize as much as possible.
               | 
               | But I largely agree with your conclusion.
        
               | janwas wrote:
               | +1. Personal opinion: accelerators are useful today but
               | have kept us in a local minimum which is certainly not
               | the ideal. There are interesting approaches such as near
               | linear low-rank approximation of attention gradients [1].
               | Would we rather have that, or somewhat better constant
               | factors? [1] https://arxiv.org/html/2408.13233v1
        
           | fulafel wrote:
           | Not in the context of discussing hardware architectures.
           | 
           | (Context in the abstract is "First, we present the
           | accelerators based on FPGAs, then we present the accelerators
           | targeting GPUs and finally accelerators ported on ASICs and
           | In-memory architectures" and the section title in the paper
           | body is "V. In-Memory Hardware Accelerators")
        
         | kurthr wrote:
         | I'd expect it to be MAC hardware embedded on the DRAM die (or
         | in the case of stacked HBM, possibly on the substrate die).
         | 
         | To quote from an old article about such acceleration which sees
         | 19x improvements over DRAM + GPU:                  Since MAC
         | operations consume the dominant part of most ML workload
         | runtime,        we propose in-subarray multiplication coupled
         | with intra-bank accumulation.        The multiplication
         | operation is performed by performing AND operations and
         | addition in column-based fashion while only adding less than 1%
         | area overhead.
         | 
         | https://arxiv.org/pdf/2105.03736
        
       | jumploops wrote:
       | Curious if anyone is making AccelTran ASICs?
        
       | refibrillator wrote:
       | This paper is light on background so I'll offer some additional
       | context:
       | 
       | As early as the 90s it was observed that CPU speed (FLOPs) was
       | improving faster than memory bandwidth. In 1995 William Wulf and
       | Sally Mckee predicted this divergence would lead to a "memory
       | wall", where most computations would be bottlenecked by data
       | access rather than arithmetic operations.
       | 
       | Over the past 20 years peak server hardware FLOPS has been
       | scaling at 3x every 2 years, outpacing the growth of DRAM and
       | interconnect bandwidth, which have only scaled at 1.6 and 1.4
       | times every 2 years, respectively.
       | 
       | Thus for training and inference of LLMs, the performance
       | bottleneck is increasingly shifting toward memory bandwidth.
       | Particularly for autoregressive Transformer decoder models, it
       | can be the _dominant_ bottleneck.
       | 
       | This is driving the need for new tech like Compute-in-memory
       | (CIM), also known as processing-in-memory (PIM). Hardware in
       | which operations are performed directly on the data in memory,
       | rather than transferring data to CPU registers first. Thereby
       | improving latency and power consumption, and possibly
       | sidestepping the great "memory wall".
       | 
       | Notably to compare ASIC and FPGA hardware across varying
       | semiconductor process sizes, the paper uses a fitted polynomial
       | to extrapolate to a common denominator of 16nm:
       | 
       |  _> Based on the article by Aaron Stillmaker and B.Baas titled
       | "Scaling equations for the accurate prediction of CMOS device
       | performance from 180 nm to 7nm," we extrapolated the performance
       | and the energy efficiency on a 16nm technology to make a fair
       | comparison_
       | 
       | But extrapolation for CIM/PIM is not done because they claim:
       | 
       |  _> As the in-memory accelerators the performance is not based
       | only on the process technology, the extrapolation is performed
       | only on the FPGA and ASIC accelerators where the process
       | technology affects significantly the performance of the systems._
       | 
       | Which strikes me as an odd claim at face value, but perhaps
       | others here could offer further insight on that decision.
       | 
       | Links below for further reading.
       | 
       | https://arxiv.org/abs/2403.14123
       | 
       | https://en.m.wikipedia.org/wiki/In-memory_processing
       | 
       | http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.Tech...
        
         | bilsbie wrote:
         | Thanks for the background. Whatever happened to memristors and
         | the promise of memory living alongside cpu?
        
           | dewarrn1 wrote:
           | That's funny, I had thought that memristors were a solved
           | problem based on this talk from a while back (2010!):
           | https://www.youtube.com/watch?v=bKGhvKyjgLY, but I gather HP
           | never really commercialized the technology. More recently,
           | there does seem to be interest in and research on the topic
           | for the reasons you and the GP post noted (e.g.,
           | https://www.nature.com/articles/s41586-023-05759-5).
        
           | tonetegeatinst wrote:
           | I believe asianometry did a YouTube video on
           | memristors....might be worth watching.
        
           | Lerc wrote:
           | Or even an architecture akin to an atonishingly large number
           | of RP2050's. It does seem like it would work well for certain
           | types of nnet architectures.
           | 
           | I've always been partial to the idea of two parallel surfaces
           | with optical links, Make a connection machine style hypercube
           | where the bit of the ID of every processor indicates its
           | location in the hypercube. Place all of the even parity CPUs
           | on one surface and the odd parity CPUs on the other surface,
           | every CPU would have line of sight on its neighbour in the
           | hypercube (as well as the diametrically opposed CPU with all
           | the ID bits flipped)
        
             | phh wrote:
             | > Or even an architecture akin to an atonishingly large
             | number of RP2050's.
             | 
             | Groq and Cerebras are probably that kind of architecture
        
           | iml7 wrote:
           | It came and went in the form of optane.
        
           | chatmasta wrote:
           | I'm a layman on this topic, so I'm definitely about to say
           | something wrong. But I recall an intriguing idea about a sort
           | of "reversion to analog," whereby we use the full range of
           | voltage crossing a resistor. Instead of cutting it in half to
           | produce binary (high voltage is 1, low voltage is 0), we
           | could treat the voltage as a scalar weight within a network
           | of resistors.
           | 
           | Has anyone else heard of this idea or have any insight on it?
        
             | gugagore wrote:
             | The question then emerges: how do we program these things?
             | 
             | Sara Acour has some answers, e.g. https://scholar.google.co
             | m/citations?view_op=view_citation&h...
        
         | nickpsecurity wrote:
         | They mostly failed in the market. I have a list of them here:
         | 
         | https://news.ycombinator.com/item?id=41069685
         | 
         | I like the one that's in RAM sticks with an affordable price. I
         | could imagine cramming a bunch of them into a 1U board with
         | high-speed interconnects. Or just PCI cards full of them.
        
         | _zoltan_ wrote:
         | While this might have been true for a while before 2018, since
         | then 400GbE ethernet became the fastest adapted interconnect,
         | and _today_ 1.6Tbit interconnects exist. PCI-e V4 came and went
         | so fast that it lived maybe 2 years.
         | 
         | NVMeOF has been scaling with fabric performance and it's been
         | great.
         | 
         | 400GB/s interconnect on the H100 DGX today.
        
         | fsndz wrote:
         | True. Samsung's Dr. Jung Bae Lee was also talking about that
         | recently.
         | 
         | "rapid growth of AI models is being constrained by a growing
         | disparity between compute performance and memory bandwidth.
         | While next-generation models like GPT-5 are expected to reach
         | an unprecedented scale of 3-5 trillion parameters, the
         | technical bottleneck of memory bandwidth is becoming a critical
         | obstacle to realizing their full potential."
         | 
         | https://www.lycee.ai/blog/2024-09-04-samsung-memory-bottlene...
        
       | next_xibalba wrote:
       | Could a FPGA + ASICs + in-mem hybrid architecture have any role
       | to play in scaling/flexibility? Each one has its own benefits
       | (e.g., FPGAs for flexibility, ASICs for performance, in-memory
       | for energy efficiency), so could a hybrid approach integrating
       | each to juice LLM perf even further?
        
         | synergy20 wrote:
         | normally it's FPGA + memory first, when it hits a sweet spot in
         | the market with volume, you then turn FPGA to ASIC for
         | performance and cost saving. For big companies they will go
         | ASIC directly.
        
       | synergy20 wrote:
       | Memory move is the bottleneck these days, thus the expensive HBM,
       | Nvidia's design is also memory-optimized since it's the true
       | bottleneck chip wise and system wise.
        
         | DrNosferatu wrote:
         | Why haven't all GPUs migrated to HBMx?
         | 
         | You seldom see it.
        
           | iml7 wrote:
           | expensive
        
           | deepnotderp wrote:
           | It's expensive, not needed for most consumer workloads and
           | ironically, is actually often _worse for latency_ for many
           | patterns, even though it's much higher bandwidth
        
             | ska wrote:
             | why is that ironic? Many ways to increase bandwidth come at
             | the cost of latency...
        
       | koolala wrote:
       | I'd love to watch a LLM run in WebGL where everything is
       | Textures. Would be neat to visually see the difference in
       | architectures.
        
         | vanviegen wrote:
         | Wouldn't that be just like watching static noise?
        
           | archerx wrote:
           | I think some patterns would appear.
        
             | Twirrim wrote:
             | Why would you expect patterns to appear?
        
         | iml7 wrote:
         | Doesn't Google have a tool that allows you to check the
         | activation status of the matrix? Gemma scope
        
       | smcleod wrote:
       | Is there a "nice" way to read content on Arxiv?
       | 
       | Every time I land on that site I'm so confused / lost in it's
       | interface (or lack there of) I usually end up leaving without
       | getting to the content.
        
         | buildbot wrote:
         | It's a paper pre-publishing website, so everything is formatted
         | in PDFs by default. They recently added html:
         | https://arxiv.org/html/2409.03384v1 That's the best way per
         | paper. There's a few arxiv frontends, like https://arxiv-
         | sanity-lite.com/
        
         | Noumenon72 wrote:
         | Same here -- I visited this link earlier today and thought "Oh,
         | it's just an abstract, I'm out". I've read Arxiv papers before
         | but the UI just doesn't look like it offers any content.
        
         | johndough wrote:
         | Click on "View PDF" or "HTML (experimental)" on the top right
         | to get to the content.
        
           | TheMysteryTrain wrote:
           | I've always had the same issue as OP - it's never bothered me
           | much because I'm rarely in the mood to read something so
           | dense.
           | 
           | But I find it quite interesting that I've managed to
           | completely miss the big obvious blue buttons every time, I
           | just immediately scan down to the first paragraph.
           | 
           | The cynic in me guesses it's because I'm so used to
           | extraneous content taking up space that I instinctively skim
           | past, but maybe that's too pessimistic & there's another
           | UX/psychological reason for it.
        
             | beefnugs wrote:
             | Its the idea of the gateway: we click on a link and expect
             | it to be what we are looking for. Having some "summary" or
             | paywall or anything between us and what we thought we are
             | getting is , QUICK CLOSE THE STUPID WEBSITE as fast as
             | possible triggering.
        
       | fulafel wrote:
       | Related:
       | 
       | https://arxiv.org/pdf/2406.08413 Memory Is All You Need: An
       | Overview of Compute-in-Memory Architectures for Accelerating
       | Large Language Model Inference
        
       | smusamashah wrote:
       | There was a paper about LLM running on same power as a light
       | bulb.
       | 
       | https://arxiv.org/abs/2406.02528
       | 
       | https://news.ucsc.edu/2024/06/matmul-free-llm.html
        
         | transpute wrote:
         | Claims 90% memory reduction with OSS code for replication on
         | standard GPUs, https://github.com/ridgerchu/matmulfreellm
         | 
         |  _> ..avoid using matrix multiplication using two main
         | techniques. The first is a method to force all the numbers
         | within the matrices to be ternary, meaning they can take one of
         | three values: negative one, zero, or positive one. This allows
         | the computation to be reduced to summing numbers rather than
         | multiplying.. Instead of multiplying every single number in one
         | matrix with every single number in the other matrix.. the
         | matrices are overlaid and only the most important operations
         | are performed.. researchers were able to maintain the
         | performance of the neural network by introducing time-based
         | computation in the training of the model. This enables the
         | network to have a "memory" of the important information it
         | processes, enhancing performance.
         | 
         | > ... On standard GPUs.. network achieved about 10 times less
         | memory consumption and operated about 25 percent faster.. could
         | provide a path forward to enabling the algorithms to run at
         | full capacity on devices with smaller memory like smartphones..
         | Over three weeks, the team created a [FPGA] prototype of their
         | hardware.. surpasses human-readable throughput.. on just 13
         | watts of power. Using GPUs would require about 700 watts of
         | power, meaning that the custom hardware achieved more than 50
         | times the efficiency of GPUs._
        
       | mikewarot wrote:
       | I've always been partial to systolic arrays. I iterated through a
       | bunch of options over the past few decades, and settled upon what
       | I think is the optimal solution, a cartesian grid of cells.
       | 
       | Each cell would have 4 input bits, 1 each from the neighbors, and
       | 4 output bits, again, one to each neighbor. In the middle would
       | be 64 bits of shift register from a long scan chain, the output
       | of which goes to 4 16:1 multiplexers, and 4 bits of latch.
       | 
       | Through the magic of graph coloring, a checkerboard pattern would
       | be used to clock all of the cells to allow data to flow in any
       | direction without preference, and _without race conditions_. All
       | of the inputs to any given cell would be stable.
       | 
       | This allows the flexibility of an FPGA, without the need to worry
       | about timing issues or race conditions, glitches, etc. This also
       | keeps all the lines short, so everything is local and fast/low
       | power.
       | 
       | What it doesn't do is be efficient with gates, nor give the
       | fastest path for logic. Every single operation happens
       | effectively in parallel. All computation is pipelined.
       | 
       | I've had this idea since about 1982... I really wish someone
       | would pick it up and run with it. I call it the BitGrid.
        
         | covoeus wrote:
         | Sounds similar to the GA144 chip from the inventor of Forth
        
           | mikewarot wrote:
           | The GA144 chip is a 12x12 grid of RISC Stack oriented CPUs
           | with networking hardware between them. Each can be doing
           | something different.
           | 
           | The BitGrid is bit oriented, not bytes or words, which makes
           | if flexible for doing FP8, FP16, or whatever other type of
           | calculation is required. Since everything is clocked at the
           | same rate/step, data flows continuously, without need for
           | queue/dequeue, etc.
           | 
           | Ideally, a chip could be made that's a billion cells. The
           | structure is simple, and uniform. An emulator exists, and
           | that would return the exact same answer (albeit much slower)
           | than a billion cell chip. You could divide up simulation
           | among a network of CPUs to speed it up.
        
       | DrNosferatu wrote:
       | The values (namely the FPGAs) should have been normalized also by
       | price.
        
       | fsndz wrote:
       | this explains the success of Groq's ASIC-powered LPUs. LLM
       | inference on Groq Cloud is blazingly fast. Also, the reduction in
       | energy consumption is nice.
        
       ___________________________________________________________________
       (page generated 2024-09-07 23:01 UTC)