[HN Gopher] Ask HN: What is an A.I. chip and how does it work?
       ___________________________________________________________________
        
       Ask HN: What is an A.I. chip and how does it work?
        
       With all the current news about NVIDIA AI/ML chips;  Can anybody
       give an overview of AI/ML/NPU/TPU/etc chips and pointers to
       detailed technical papers/books/videos about them? All i am able to
       find are marketing/sales/general overviews which really don't
       explain anything.  Am looking for a technical deep dive.
        
       Author : rramadass
       Score  : 126 points
       Date   : 2023-05-27 07:46 UTC (15 hours ago)
        
       | sharph wrote:
       | Great video from asianometry explaining AI chips (GPGPUS, general
       | purpose GPUs) roots in GPUs (graphics processing units) -- how
       | did we get here and what do these chips do?
       | 
       | https://www.youtube.com/watch?v=GuV-HyslPxk
        
       | arroz wrote:
       | AI chips is just regular chips that do AI stuff faster
       | 
       | So dedicated hardware to do math stuff
        
       | psychphysic wrote:
       | Google's TPU which they sell via Coral is just a systolic array
       | of multiply-accumulates arranged in a grid.
       | 
       | Here's a decent overview from the horse's mouth.
       | https://cloud.google.com/blog/products/ai-machine-learning/a...
       | 
       | It's called a systolic array because the data moves through it in
       | waves similar to what an engineer imagines the heart looks like
       | :)
        
       | m3kw9 wrote:
       | AI chip is basically a chip that calculates matrix's better than
       | general purpose CPUs
        
       | joyeuse6701 wrote:
       | Interestingly, this might be well answered by the LLMs built on
       | the technology you're interested in.
        
       | mongol wrote:
       | What would be an affordable/ cheap way to get hands on with this
       | type of hardware? Right now I have zero knowledge.
        
         | ajb117 wrote:
         | I'm pretty sure you can't buy TPUs, but people usually buy GPUs
         | instead. If you're building a personal rig, these days, you can
         | get an Nvidia RTX 3090 for about $720 USD on ebay used, which
         | is pretty cheap for 24GB VRAM. There's also the A6000 with 48GB
         | VRAM but that'll cost about $5000 on Amazon. Of course, there's
         | new cards that are faster with more VRAM like the 4090 and RTX
         | 6000, but they're also more expensive.
         | 
         | Of course, this is all pretty expensive still. If your models
         | are small enough you can get away with even older GPUs with
         | less VRAM like a GTX 1080 Ti. And then of course there's
         | services like Google Collab and vast.ai where you can rent a
         | TPU or GPU in the cloud.
         | 
         | I'd check out Tim Dettmers' guide for buying GPUs:
         | https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
        
           | jmrm wrote:
           | AFAIK Google Coral is an inexpensive TPU you can buy right
           | now: https://coral.ai/products/accelerator/
        
             | kmeisthax wrote:
             | The problem is that this is an "inferencing" accelerator -
             | i.e. it can only execute pretrained models. You cannot
             | train a model on one of these, you need a training
             | accelerator. And pretty much all of those are either NVidia
             | GPUs or cloud-only offerings.
        
             | worldsayshi wrote:
             | Very cool! Although seems to have no memory to speak of so
             | many use cases like LLM goes away because of that I guess?
        
               | muxamilian wrote:
               | It has 8 MB of memory. It also supports live streaming
               | the neural network to the chip, although that is slower
               | than when it is cached in the memory.
        
         | wyldfire wrote:
         | Qualcomm has an SDK [1] where you can run software on a DSP/NSP
         | simulator.
         | 
         | [1] https://developer.qualcomm.com/software/hexagon-dsp-sdk
        
         | fulafel wrote:
         | Apple and Google consumer hardware have specialized ML compute
         | features.
         | 
         | https://www.tomsguide.com/news/google-pixel-7s-most-critical...
         | 
         | https://www.macobserver.com/tips/deep-dive/what-is-apple-neu...
        
         | lexicality wrote:
         | Depending on your definition of affordable, the windows dev kit
         | 2023 makes a big deal out of their NPU but you'll have to deal
         | with windows 11 to access it unfortunately
        
       | ttul wrote:
       | On top of what others have said here about TPUs and their kin,
       | you can make things really scream by taping out an ASIC for a
       | specific frozen neural network (i.e. including the weights and
       | parameters).
       | 
       | If you never have to change the network - for instance to do
       | image segmentation or object recognition - then you can't get any
       | more efficient than a custom silicon design that bakes in the
       | weights as transistors.
        
       | binarymax wrote:
       | I'd start with CUDA, because knowing what a chip does won't click
       | until you see how it can be programmed to do massive parallel
       | computation and matmul.
       | 
       | I read the first book in this list about 10 years ago, and though
       | it's pretty old the concepts are solid.
       | 
       | https://developer.nvidia.com/cuda-books-archive
        
         | dogma1138 wrote:
         | CUDA abstracts most of the parallelism, the magic of CUDA is it
         | gave developers a C/C++ API or language if you will that
         | doesn't really requires them to think about that they can
         | continue writing their problems as they did when programming
         | for mostly single core single threaded CPUs back in the day and
         | CUDA takes care of the rest.
         | 
         | Even "manual" CUDA optimizations deal more with concurrency and
         | data residency than parallelism and even those are usually
         | limited to following the compute guide for your specific
         | hardware and feature set and the driver does the majority of
         | the heavy lifting.
        
       | nl wrote:
       | There's a lot of information here about chips which are mostly
       | built for training neural networks.
       | 
       | It's worth noting there are very widely deployed chips primarily
       | built for inference (running the network) especially on mobile
       | phones.
       | 
       | Depending on the device and manufacturer sometimes this is
       | implemented as part of the CPU itself, but functionally it's the
       | same idea.
       | 
       | The Apple Neural Engine is a good example of this. This is
       | separate to the GPU which is also on the CPU.
       | 
       | Further information is here:
       | https://machinelearning.apple.com/research/neural-engine-tra...
       | 
       | The Google Tensor CPU used in the pixel has a similar coprocessor
       | called the EdgeTPU.
        
       | sremani wrote:
       | I want to latch on this question a bit -- which company out there
       | is primed to bring us a CUDA competitor. AMD has failed, so any
       | wise words from the people in the industry?
        
       | nottorp wrote:
       | An "AI" chip is marketing. But as other posts say, "linear
       | algebra coprocessor" doesn't roll of the tongue as well.
       | 
       | Incidentally there used to be a proper "AI" chip. The original
       | perceptron was intended to be implemented in hardware. But
       | general purpose chips evolved much faster.
       | 
       | https://en.wikipedia.org/wiki/Perceptron
        
       | zoogeny wrote:
       | There is a YouTube channel TechTechPotato [1] that has a podcast
       | on AI hardware called "The AI Hardware Show". Pretty small and it
       | gives you a view on how niche this market is - but if you want
       | the 10k foot view from young budding tech journalists then I
       | think this fits the bill.
       | 
       | Some random examples of video titles from the last 6 months of
       | the channel:
       | 
       | * A Deep Dive into IBM's New Machine Learning Chip
       | 
       | * Does my PC actually use Machine Learning?
       | 
       | * Intel's Next-Gen 2023 Max CPU and Max GPU
       | 
       | * A Deep Dive into Avant, the new chip from Lattice Semiconductor
       | (White Paper Video)
       | 
       | * The AI Hardware Show 2023, Episode 1: TPU, A100, AIU, BR100,
       | MI250X
       | 
       | I think the podcasters background is actually in HPC (High
       | Performance Computing), i.e. super computers. But that overlaps
       | just enough with AI hardware that he saw an opportunity to
       | capitalize on the new AI hype.
       | 
       | 1. https://www.youtube.com/c/TechTechPotato
        
         | rramadass wrote:
         | Nice, looks like a good starting point to survey the field.
        
       | anon291 wrote:
       | I've worked in this space for the past five years. The chips are
       | essentially highly parallel processors. There's no unifying
       | architecture. You have the graph-based / hpc-simulator chips like
       | Cerebras, Graphcore, etc which are basically a stick-as-many-
       | cores-as-possible situation with a high-speed networking fabric.
       | You have the 'tensor' cores like Groq where the chip operates as
       | a whole and is just well suited for tensor processing
       | (parallelizable, high-speed memory, etc).
       | 
       | At the end of the day, it's matrix multiplication acceleration
       | mostly, and then IO optimization. Literally most of the
       | optimization has nothing to do with compute. We can compute
       | faster than we can ingest.
        
         | rramadass wrote:
         | >The chips are essentially highly parallel processors
         | 
         | Right. AFAIK we already were doing SIMD, Vector Processing,
         | VLIW etc. to speed up parallel processing for numerical
         | calculations in AI/ML. What then is the reason for the
         | explosion of these _different categories of chips_? Are they
         | just glorified ASICs designed for their specific domains or
         | FPGAs programmed accordingly? If so what is their architecture
         | i.e. what are their functional units and how do they work
         | together?
        
       | neom wrote:
       | Veritasium did a pretty good video on some of them:
       | https://www.youtube.com/watch?v=GVsUOuSjvcg
        
         | atgctg wrote:
         | That video is about analog computers
        
           | neom wrote:
           | The video is about analog chips for ML/NNs. His profile of
           | this company was particularly interesting: https://mythic.ai/
           | 
           | OP asked about chips for AI.
        
             | formerly_proven wrote:
             | Do companies pay him or does he do these ads for free?
        
               | cinntaile wrote:
               | He gets paid. It says so at the start of the video, but I
               | guess that could depend on where you live.
        
       | nologic01 wrote:
       | It may help your digging and search if you have in mind what
       | those chips really try to do: Accelerate numerical linear algebra
       | calculations.
       | 
       | If you are familiar with linear algebra these specialized chips
       | literally etch silicon so as to perform vector (and more general
       | multi-array or tensor) computations faster than a general purpose
       | CPU. They do that by loading and operating a whole set of numbers
       | (a chunk of a vector or a matrix) _simultaneously_ (whereas the
       | CPU would operate mostly serially - one at a time).
       | 
       | The advantage is (in a nutshell) that you can get a significant
       | speedup. How much depends on the problem and how big a chunk you
       | can process simultaneously but it can be a significant factor.
       | 
       | There are disadvantages that people ignore in the current AI
       | hype:
       | 
       | * The speedup in a one-off gain, the death of Moore's law is
       | equally dead for "AI chips" and CPU's
       | 
       | * It is extremely specialized and fine-tuned software you need to
       | develop and run and it only applies to the above linear algebra
       | problems.
       | 
       | * In the past such specialized numerical algebra hardware was the
       | domain of HPC (high performance computing). Many a supercomputer
       | vendor went bankrupt in the past because the cost versus market
       | size was not there.
        
         | benreesman wrote:
         | While a given generation of accelerator can only target model
         | architectures that are comparatively proven out, and there's a
         | lag time, it's measured in years not decades.
         | 
         | I remember when NVIDIA didn't have hardware for ReLU.
         | 
         | The fact of the matter on Moore's law is that we've got
         | transistors, but _not_ TDP to burn and have for years. These
         | stupid big L3 caches are just: "fuck it, I've got die to burn".
         | 
         | This is an old story, things migrate in and out of the "CPU",
         | but the current outlook is that we'll be targeting specialized
         | hardware more rather than less for the foreseeable future.
        
           | nologic01 wrote:
           | > the current outlook is that we'll be targeting specialized
           | hardware more rather than less for the foreseeable future.
           | 
           | I think there are some important question marks still
           | unresolved that bear on how things will play out. E.g. how
           | the training versus inference balance will land in terms of
           | usage and economics.
           | 
           | Inference is inherently more "mass market". You need it
           | locally without lags from moving data around. But inference
           | is just numerical linear algebra. Ultimately augmenting the
           | CPU to provide inference natively might be the optimal
           | arrangement.
        
           | JoeAltmaier wrote:
           | There have been some really strange instruction sets
           | conceived. As a student at Stanford we had a time-share
           | system that was home-grown (as I remember). It had opcodes to
           | reverse bits in a bitstring! And odder things. Somebody
           | needed that for some research project I guess. And then
           | repurposed the damn thing for timeshare.
           | 
           | It was pretty sad timeshare as I recall. The only machine(s?)
           | available to a population of what? 20,000? And about 40
           | terminals total. You had to sign up for 15-minute measured
           | timeslots.
           | 
           | I came from a state school that had over 1000 terminals on a
           | dozen machines, all unlimited time to students. It was a big
           | shock to find the star of Silicon Valley had such crappy
           | student services.
        
             | anthomtb wrote:
             | > It had opcodes to reverse bits in a bitstring
             | 
             | Bit reversal is used in Fast Fourier Transforms. Its not
             | entirely surprising to me that you'd have specialized
             | hardware for that operation.
             | 
             | Ref: https://en.wikipedia.org/wiki/Bit-
             | reversal_permutation#Appli...
        
               | JoeAltmaier wrote:
               | Also to count ones in a bitstring. And so on.
        
           | danieldk wrote:
           | _I remember when NVIDIA didn't have hardware for ReLU._
           | 
           | Could you elaborate? ReLU is _max(x,0)_ and CUDA had _fmaxf_
           | since CUDA 1.0 (2007).
        
             | vgatherps wrote:
             | Not sure if or what the chance was, but fmax(x, 0) only
             | requires checking the sign bit instead of doing a full
             | floating point comparison (putting aside nan handling).
             | 
             | A hypothetical relu instruction could probably get away
             | with much less power and die soace?
        
             | benreesman wrote:
             | I could easily be wrong, but IIUC the software/hardware
             | stack was, as of 2014 or so, fusing down to a specific set
             | of circuits for broadcast activation functions under the
             | broader banner of "Tensor Cores" (along with a bunch of
             | devilish FMA hot-pathing under the sheets).
             | 
             | This is a fairly random press piece off Google but there
             | are a ton of them. Hard to tell whether it's the hardware,
             | the blob, or the process node unless you work there.
             | 
             | https://developer.nvidia.com/blog/accelerating-relu-and-
             | gelu...
        
         | juujian wrote:
         | Good explanation. That also gives you an idea why GPUs are
         | decent at acceleration computations for neural networks (think
         | cuda) -- they are already optimized for doing many small
         | computations in parallel rather with slower processors and have
         | a lot of dedicated ram (VRAM).
        
         | VeninVidiaVicii wrote:
         | Is this why my M1 MacBook Air can run the same R code at least
         | 10x faster than my giant Linux tower?
        
           | danieldk wrote:
           | Apple Silicon Macs have special matrix multiplication units
           | (AMX) that can do matrix multiplication fast and with low
           | energy requirements [1]. These AMX units can often beat
           | matrix multiplication on AMD/Intel CPUs (especially those
           | without a very large number of cores). Since a lot of linear
           | algebra code uses matrix multiplication and using the AMX
           | units is only a matter of linking against Accelerate (for its
           | BLAS interface), a lot of software that uses BLAS is faster o
           | Apple Silicon Macs.
           | 
           | That said, the GPUs in your M1 Mac are faster than the AMX
           | units and any reasonably modern NVIDIA GPU will wipe the
           | floor with the AMX units or Apple Silicon GPUs in raw
           | compute. However, a lot of software does not use CUDA by
           | default and for small problem sets AMX units or CPUs with
           | just AVX can be faster because they don't incur the cost of
           | data transfers from main memory to GPU memory and vice versa.
           | 
           | [1] Benchmarks:
           | 
           | https://github.com/danieldk/gemm-benchmark#example-results
           | 
           | https://explosion.ai/blog/metal-performance-shaders (scroll
           | down a bit for AMX and MPS numbers)
        
             | stephencanon wrote:
             | > That said, the GPUs in your M1 Mac are faster than the
             | AMX units
             | 
             | Not for double, which is what R mostly uses IIRC.
        
               | danieldk wrote:
               | Ah, thanks for the correction! I never use R, so I
               | assumed that it uses/supports single-precision floating
               | point.
        
         | amelius wrote:
         | > Accelerate numerical linear algebra calculations.
         | 
         | Like compute eigenvalues/eigenvectors of large matrices,
         | compute SVDs, solve large sparse systems of equations, etc?
        
           | Q6T46nT668w6i3m wrote:
           | Nothing that fancy. Usually matrix-matrix and matrix-scalar
           | multiplication.
        
             | nologic01 wrote:
             | This is indeed the bread-and-butter, but there is use of
             | all sorts of standard linear algebra algorithms. You can
             | check various xla-related (accelerated linear algebra)
             | folders in tensorflow or torch folders in pytorch to see
             | the list of what is used [1],[2]
             | 
             | [1] https://github.com/tensorflow/tensorflow/tree/8d9b35f44
             | 2045b...
             | 
             | [2] https://github.com/pytorch/pytorch/blob/6e3e3dd477e0fb9
             | 768ee...
        
         | [deleted]
        
         | sebzim4500 wrote:
         | >* The speedup in a one-off gain, the death of Moore's law is
         | equally dead for "AI chips" and CPU's
         | 
         | This does not seem to be true in reality. The H100 is about
         | 2.5x 'better' for AI than the A100 (obviously depends exactly
         | what you are doing), and they released about 2 years apart.
         | That is roughly in line with Moore's law.
        
           | _nalply wrote:
           | The difference seems to be: more units for parallel
           | calculation, but the speed of calculation in itself doesn't
           | double anymore. In other words: Moore's law has stopped for
           | raw speed and perhaps other areas, but is still alive in
           | other areas. This has some weird consequences: Some models
           | can't be processed in smaller chips (because swapping in
           | parts is too slow to be useful), but after a threshold is
           | crossed, suddenly the large models run efficiently.
           | 
           | Probably we will see usage of large AI models in smaller
           | devices the next years because there's another way to
           | optimize: use more efficient representations of the model
           | weights. I think about posits, a different floating point
           | system where even 6 bits are perhaps usable. When models can
           | switch to 6 bit posits from f16 (half floats), hardware can
           | load more than three times larger models. We will see whether
           | hardware for this will be mass-produced.
        
             | daveguy wrote:
             | Moore's law was never about raw speed. It is about
             | transistors per unit area and it is still very much active.
             | That transistor count is going into cache and core count,
             | but it can just as easily go into specialized linear
             | algebra units, which is what AI chips do.
        
             | raincole wrote:
             | Why from 16 to 6 is "more than three times larger"? Why not
             | 16/6=2.667 times?
        
         | rerdavies wrote:
         | > Accelerate numerical linear algebra calculations.
         | 
         | Technically, for AI, you need to accelerate numerical non-
         | linear algebra calculations, which take the general form of
         | matrix multiplication, but interpose non-linear functions at
         | key points in the calculation.
        
           | ra1231963 wrote:
           | the calculations within a neural network involve both linear
           | and non-linear operations, but the fundamental mathematical
           | framework underlying AI remains rooted in linear algebra
        
         | bhu1st wrote:
         | > If you are familiar with linear algebra
         | 
         | Could you please write what are some common day to day life
         | applications of linear algebra in computing?
        
           | juujian wrote:
           | I have really lost touch with common day to day life at this
           | point, but Excel would be an excellent example. If you have a
           | column with a million data points and do some basic
           | calculation -- subtract or multiply another column -- whether
           | or not the software you use can translate that into a
           | vectorized operation under the hood using linear algebra can
           | significantly speed up your operation.
        
           | dreamcompiler wrote:
           | Nvidia exists because linear algebra is essential for 3D
           | graphics. They got a second boost from crypto-currency, which
           | isn't strictly about linear algebra but it can put those
           | computation units to good use. Now Nvidia is riding high
           | again because neural nets are all about linear algebra, and
           | LLMs are big neural nets.
        
             | visarga wrote:
             | I would also add scientific simulations to the list of
             | tasks that GPUs are used for. They parallelise well on many
             | cores.
        
           | JamesLeonis wrote:
           | Linear Algebra is like a lot of math in CS; you don't
           | necessarily see it initially, but once you get some
           | familiarity you start seeing it everywhere.
           | 
           | Others have commented on computer graphics, but (as it turns
           | out) the exact same algorithms apply for Collision Detection
           | in 3D space. And since games already are manipulating
           | graphics, you add on _another_ set of Linear Algebra
           | transforms that change the position /rotation/shear of those
           | vertices. In a similar way, science (especially physics) use
           | linear algebra to build simulations of all kinds of systems.
           | 
           | One surprising use is in Advertising (and other user
           | preference aggregators). Turns out a _preference_ acts like a
           | magnitude of a one-dimensional vector. String _N_ preference
           | vectors together and you get an _N-Dimension Vector_ that you
           | can perform Linear Algebra operations upon. One common
           | application is the _Dot Product_ , which is a fancy way of
           | taking two N-Dimension vectors and measuring how close those
           | vectors point in the same direction in a [1, -1] range.
           | 
           | Yet another common place to find Linear Algebra is in
           | computer science papers. _Most of the time_ this is simply
           | notation; a lot of common programming forms can be
           | represented by MxN matrices. However some of those algorithms
           | will use LA as a way of parsing and breaking down a problem.
           | You will see this in compiler papers often, but its
           | transferable to many other domains.
           | 
           | As a final and personal observation, I found that Linear
           | Algebra helped me grasp Functional Programming. In both cases
           | I am applying a transform to some input, and often stringing
           | together many such transforms. Also in both cases, the
           | transformations are sensitive to their order, and a bad
           | ordering produces nonsense just like garbage data.
        
             | visarga wrote:
             | > One common application is the Dot Product , which is a
             | fancy way of taking two N Dimension vectors and measuring
             | how close those vectors point in the same direction in a
             | [1, 1] range.
             | 
             | That's cosine similarity, or normalised dot product. The
             | dot product can take any value when the vectors are not
             | unit norm.
        
           | wongarsu wrote:
           | The obvious fields are computer graphics (all kinds: 2d, 3d
           | rasterized and 3d raytraced are all heavy on linear algebra,
           | though 3d rasterized is the easiest to speed up with the
           | lockstep SIMD architectures we call GPUs) and neural
           | networks.
           | 
           | Computer graphics mostly because you can view the real world
           | as 3d space and the screen as 2d space, and linear algebra
           | gives you all the tools to manipulate something in 3d space
           | and project it into 2d space. Neural networks because you can
           | treat them as matrix multiplications.
        
       | tomek32 wrote:
       | You might find this talk interesting, The AI Chip Revolution with
       | Andrew Feldman of Cerebras, https://youtu.be/JjQkNzzPm_Q
       | 
       | It's the founder of a new AI chip company and they talk a bit on
       | the differences
        
       | fragmede wrote:
       | Starting from https://cloud.google.com/tpu/docs/system-
       | architecture-tpu-vm what are you looking for?
        
         | rramadass wrote:
         | Yes, this is what i am looking for.
         | 
         |  _System Architecture_ of these chips with detailed Functional
         | Units and how they are used by the _AI algorithm Instruction
         | /Data streams_.
        
       | ftxbro wrote:
       | This isn't a technical deep dive, but here's a simplified
       | explanation.
       | 
       | It's a matrix multiplication
       | (https://en.wikipedia.org/wiki/Matrix_multiplication) accelerator
       | chip. Matrix multiplication is important for a few reasons. 1)
       | it's the slow part of a lot of AI algorithms, 2) it's a 'high
       | intensity' algorithm (naively n^3 computation vs. n^2 data), 3)
       | it's easily parallelizable, and 4) it's conceptually and
       | computationally simple.
       | 
       | The situation is kind of analogous to FPUs (floating point math
       | co-processor units) when they were first introduced before they
       | were integrated into computers.
       | 
       | There's more to it than this, but that's the basic idea.
        
       | neximo64 wrote:
       | Short story, CPU's can do calculations, they can do them one at a
       | time. Think of something like 1+1, = 2. If you had 1 million
       | equations like these, CPU's will generally do them one at a go,
       | i.e the first one, then the second, etc.
       | 
       | GPUs were optimised to draw, so were able to do dozens of these
       | at a go. So these can be used for AI/ML in both gradient descent
       | and inference (forward passes). Because you can do many at a go,
       | in parallel, they speed things up dramatically. Geoff Hinton
       | experimented with GPUs exploiting their ability to do this, but
       | they aren't actually optimised to do that. It just turned out
       | that it is the best way available to do it at the time, and still
       | currently.
       | 
       | AI chips, are optimised to do either inference or gradient
       | descent. They are not good at drawing like GPUs are. They are
       | optimised for machine learning and joining other AI chips
       | together so you can have massive networks of chips that can
       | parallel compute.
       | 
       | One other class of chips that has not yet shown up are ASICs that
       | mimic the transformers architectures for even more speed - though
       | it changes too much at the moment for it to be useful.
       | 
       | Also because of the mechanics of scale manufacturing: GPUs are
       | currently cheaper per flop of compute as the aggregate of scale
       | is shared with graphical uses. Though with time if there is
       | enough scale AI chips should end up cheaper
        
         | strohwueste wrote:
         | Do you have any sources on those informations? I find it really
         | hard to find stuff for what you describe. Also do you know
         | about the detail of producing those Asics? Are they CMOS or
         | flash (in-memory-compute?)
        
           | thrtythreeforty wrote:
           | All current AI accelerators (that aren't a research project)
           | are ordinary CMOS. Google published some papers about TPUv3.
           | You should read them if you want to know more about the
           | architecture of these kinds of chips.
        
       | imakwana wrote:
       | Relevant: Stanford Online course - CS217 Hardware acceleration
       | for machine learning
       | 
       | https://online.stanford.edu/courses/cs217-hardware-accelerat...
       | 
       | Course website with lecture notes: https://cs217.stanford.edu/
       | 
       | Reading list: https://cs217.stanford.edu/readings
        
         | rramadass wrote:
         | Excellent!
        
       | pulkas wrote:
       | It is what ASIC for bitcoin. A new era for AI models.
        
       | visarga wrote:
       | Trying to make a list of AI accelerator chip families, anything
       | missing?
       | 
       | - GPU (Graphics Processing Unit)
       | 
       | - TPU (Tensor Processing Unit): ASIC designed for TensorFlow
       | 
       | - IPU (Intelligence Processing Unit): Graphcore
       | 
       | - HPU (Habana Processing Unit): Intel Habana Labs' Gaudi and Goya
       | AI
       | 
       | - NPU (Neural Processing Unit): Huawei, Samsung, Microsoft
       | Brainwave
       | 
       | - VPU (Vision Processing Unit): Intel Movidius
       | 
       | - DPU (Data Processing Unit): NVIDIA data center infrastructure
       | processing unit
       | 
       | - Amazon's Inferentia: Amazon's accelerator chip focused on low
       | cost
       | 
       | - Cerebras Wafer Scale Engine (WSE)
       | 
       | - SambaNova Systems DataScale
       | 
       | - Groq Tensor Streaming Processor (TSP)
        
         | rramadass wrote:
         | Nice and very much in line with what i am looking for !
         | 
         | If you don't mind, could you add links to authoritative wiki
         | pages/whitepapers/articles for each of the above? I think it
         | will give us a good starting point to start our study/research
         | from.
        
       | HarHarVeryFunny wrote:
       | Modern AI/ML is increasingly about neural nets (deep learning),
       | whose performance is based on floating point math - mostly matrix
       | multiplication and multiply-and-add operations. These neural nets
       | are increasingly massive, e.g. GPT-3 has 175 billion parameters,
       | meaning that each pass thru the net (each word generated) is
       | going to involve in excess of 175B floating point
       | multiplications!
       | 
       | When you're multiplying two large matrices together (or other
       | similar operations) there are thousands of individual multiply
       | operations that need to be performed, and they can be done in
       | parallel since these are all independent (one result doesn't
       | depend on the other).
       | 
       | So, to train/run these ML/AI models as fast as possible requires
       | the ability to perform massive numbers of floating point
       | operations in parallel, but a desktop CPU only has a limited
       | capacity to do that, since they are designed as general purpose
       | devices, not just for math. A modern CPU has multiple "cores"
       | (individual processors than can run in parallel), but only a
       | small number ~10, and not all of these can do floating point
       | since it has specialized FPU units to do that, typically less in
       | number than the number of cores.
       | 
       | This is where GPU/TPU/etc "AI/ML" chips come in, and what makes
       | them special. They are designed specifically for this job - to do
       | massive numbers of floating point multiplications in parallel. A
       | GPU of course can run games too, but it turns out the
       | requirements for real-time graphics are very similar - a massive
       | amount of parallelism. In contrast to the CPUs ~10 cores, GPUs
       | have thousands of cores (e.g. NVIDIA GTX 4070 has 5,888) running
       | in parallel, and these are all floating-point capable. This
       | results in the ability to do huge numbers of floating point
       | operations per second (FLOPS), e.g. the GTX 4070 can do 30 TFLOPS
       | (Tera-FLOPS) - i.e. 30,000,000,000,000 floating point
       | multiplications per second !!
       | 
       | This brings us to the second specialization of these GPU/TPU
       | chips - since they can do these ridiculous number of FLOPS, they
       | need to be fed data at an equally ridiculous rate to keep them
       | busy, so they need massive memory bandwidth - way more than the
       | CPU needs to be kept busy. The normal RAM in a desktop computer
       | is too slow for this, and is in any case in the wrong place - on
       | the motherboard, where it can only be accessed across the PCI bus
       | which is again way too slow to keep up. GPU's solve this memory
       | speed problem by having a specially designed memory architecture
       | and lots of very fast RAM co-located very close to the GPU chip.
       | For example, that GTX 4070 has 12GB of RAM and can move data from
       | it into its processing cores at a speed (memory bandwidth) of
       | 1TB/sec !!
       | 
       | The exact designs of the various chips differ a bit (and a lot is
       | proprietary), but they are all designed to provided these two
       | capabilities - massive floating point parallelism, and massive
       | memory bandwidth to feed it.
       | 
       | If you want to get into this in detail, best place to start would
       | be to look into low level CUDA programming for NVIDIAs cards.
       | CUDA is the lowest level API that NVIDIA provide to program their
       | GPUs.
        
         | unnouinceput wrote:
         | A few finer points:
         | 
         | 1 - It's RTX 4070, not GTX 4070
         | 
         | 2 - the 30 TFLOPS you mention are at the very top when
         | overclocked, they go for 22 normally.
         | 
         | 3 - Also those are single precision TFLOPS, as in 32 bit. What
         | really matter nowadays is double precision. And in double
         | precision a 4070 is 0.35 TFLOPS (or 350 GFLOPS). 2 orders of
         | magnitude lower, still impressive though
        
           | HarHarVeryFunny wrote:
           | For neural nets it's actually the opposite - half-precision
           | bfloat16 is enough. You need large range, but not much
           | accuracy.
           | 
           | Yes, the exact numbers are going to vary, but just giving a
           | data point to indicate the magnitude of the numbers. If you
           | want to quibble there's CPU SIMD too.
        
             | unnouinceput wrote:
             | For gaming do matter those double precision. And we were
             | talking about a certain GPU, which is used for gaming, not
             | AI. Hence why the AI chips exists in the first place -
             | dedicated hardware for dedicated tasks (or ASIC for short)
        
               | HarHarVeryFunny wrote:
               | The NVIDIA cards are all dual-use for gaming and
               | compute/ML. Some features like the RTX 4070's Tensor
               | Cores (incl. bfloat16) are there primarily for ML, and
               | other features like ray tracing are there for gaming.
        
       | fulafel wrote:
       | Here's one from Google (paper link at the end):
       | https://cloud.google.com/blog/topics/systems/tpu-v4-enables-...
        
       | phh wrote:
       | There are very different architectures in the wild. Some are
       | simply standard GPUs (maybe with additional support for
       | bf16/float16) (Rockchip RK1808 has one like that). You give it a
       | list of instructions pretty much like a CPU (except massively
       | parallel), and it'll execute it. BTW when I say standard GPU, I'm
       | not saying "kinda like GPU, but really literally GPU
       | architecture. Linux mainline support for Amlogic A311D2's NPU is
       | 10 lines, and it's declaring a no-output vivante GPU.
       | 
       | Some are just hardware pipeline to compute 2d/3d convolutions +
       | activation function (Rockchip RK3588 has one like that). You give
       | it the memory address + dimensions of the input matrix, the
       | memory address + dimensions of the weights, memory address +
       | dimensions of the output, which activation function you want
       | (there are only like 4 supported), then you tell it RUN, you wait
       | for a bit, and you have the result at the output memory address.
       | 
       | (I took Rockchip example to show that even in one microcosm it
       | can change a lot)
       | 
       | And then you can imagine any architecture in-between.
       | 
       | AFAIK they all work with some RAM as input and some RAM as
       | output, but some may have its own RAM, some may share RAM with
       | the system, some might have mixed usage (RK3588 has some SRAM,
       | and when you ask it to compute the convolution, you can tell it
       | either to write to SRAM or system RAM)
       | 
       | It's possible that there are some components that are border line
       | between ISP (Image Signal Porcessing) and a NPU, where the input
       | is the direct camera stream, but my guess is that they do some
       | very small processing on the camera stream, then dump it to RAM,
       | then do all the heavy work from RAM to RAM. I think that Pixel
       | 4-5 had something like that.
        
         | grozzle wrote:
         | The RK3588 seems like a beast for performance per dollar, I
         | have one running desktop Linux for emulators and games, but I
         | haven't used the NPU yet, because I haven't taken the time to
         | figure out how to get OpenCV to talk to it.
         | 
         | Is Rockchip's stuff "good", as NPUs go? I'm thinking of buying
         | another 3588 SoC for my robotics hobby - you seem like you'd
         | know if that's a decent idea or not.
        
           | reesul wrote:
           | Yeah it's performance vs cost is honestly nuts. 6 TOPS is a
           | pretty solid NPU, but I don't know what their software is
           | like. Programming those accelerators is often difficult,
           | especially if you're a small time customer.
           | 
           | Curious if anyone can weigh in on their SW usability. A quick
           | search for their user level tools showed
           | examples/documentation in Chinese(?)
        
             | grozzle wrote:
             | Yeah it's the first under $100 system I've tried, (and I'm
             | a fan of the genre) that's truly a desktop replacement in
             | terms of being a snappy responsive desktop even with lots
             | of browser tabs, etc.
             | 
             | I do know they have their own special sauce to talk to the
             | NPU. I was discouraged from making the effort myself
             | because their special sauce to talk to the VPU has barely
             | any ffmpeg support, it p much only uses gstreamer, and I'm
             | neither a masochist nor French so that's a non-starter.
        
               | reesul wrote:
               | Yeah I'm very surprised to see a single board computer at
               | less than $100 for a processor like that. Hard to tell
               | what their actual 1ku price is, but if the random Alibaba
               | I found for $20 is right, then that price for the overall
               | board is absurd.
               | 
               | By VPU are you talking about stuff like ISP, video
               | encoder/decoder, or something else?
               | 
               | Among embedded processors I've seen touting vision
               | acceleration, gstreamer support is fairly widespread. I
               | bit the bullet to learn it because my role requires it.
               | Maybe it's Stockholm Syndrome talking, but I've somehow
               | grown to like gstreamer. The learning curve was awkward.
               | I struggled with documentation and learned more by
               | analyzing some examples and trial-and-error.
        
               | grozzle wrote:
               | By VPU, I meant the square on the board that eats h.265
               | hw_enc and hw_dec and things like that, yes. I've been
               | told it's a separate square from the GPU, by someone who
               | is full-time waist-deep on getting Arch running on that
               | hardware, so I take it as fact.
               | 
               | Oh no. The prospect of gstreamer being the only way... Oh
               | no.
               | 
               | Maybe there's a zsh plugin or smth that autocompletes
               | sane defaults? AAAAAAAAHHHHHH there surely isn't, anyone
               | merciful enough to make one would just use ffmpeg
               | instead...
        
       | elseless wrote:
       | H.T. Kung's 1982 paper on systolic arrays is the genesis of what
       | are now called TPUs:
       | http://www.eecs.harvard.edu/~htk/publication/1982-kung-why-s...
        
       | sethgoodluck wrote:
       | Remember crypto miners (ASICS). Exact same thing but built for
       | the math around AI work instead of Blockchain work.
        
       ___________________________________________________________________
       (page generated 2023-05-27 23:01 UTC)