[HN Gopher] Ask HN: What is an A.I. chip and how does it work?
___________________________________________________________________
Ask HN: What is an A.I. chip and how does it work?
With all the current news about NVIDIA AI/ML chips; Can anybody
give an overview of AI/ML/NPU/TPU/etc chips and pointers to
detailed technical papers/books/videos about them? All i am able to
find are marketing/sales/general overviews which really don't
explain anything. Am looking for a technical deep dive.
Author : rramadass
Score : 126 points
Date : 2023-05-27 07:46 UTC (15 hours ago)
| sharph wrote:
| Great video from asianometry explaining AI chips (GPGPUS, general
| purpose GPUs) roots in GPUs (graphics processing units) -- how
| did we get here and what do these chips do?
|
| https://www.youtube.com/watch?v=GuV-HyslPxk
| arroz wrote:
| AI chips is just regular chips that do AI stuff faster
|
| So dedicated hardware to do math stuff
| psychphysic wrote:
| Google's TPU which they sell via Coral is just a systolic array
| of multiply-accumulates arranged in a grid.
|
| Here's a decent overview from the horse's mouth.
| https://cloud.google.com/blog/products/ai-machine-learning/a...
|
| It's called a systolic array because the data moves through it in
| waves similar to what an engineer imagines the heart looks like
| :)
| m3kw9 wrote:
| AI chip is basically a chip that calculates matrix's better than
| general purpose CPUs
| joyeuse6701 wrote:
| Interestingly, this might be well answered by the LLMs built on
| the technology you're interested in.
| mongol wrote:
| What would be an affordable/ cheap way to get hands on with this
| type of hardware? Right now I have zero knowledge.
| ajb117 wrote:
| I'm pretty sure you can't buy TPUs, but people usually buy GPUs
| instead. If you're building a personal rig, these days, you can
| get an Nvidia RTX 3090 for about $720 USD on ebay used, which
| is pretty cheap for 24GB VRAM. There's also the A6000 with 48GB
| VRAM but that'll cost about $5000 on Amazon. Of course, there's
| new cards that are faster with more VRAM like the 4090 and RTX
| 6000, but they're also more expensive.
|
| Of course, this is all pretty expensive still. If your models
| are small enough you can get away with even older GPUs with
| less VRAM like a GTX 1080 Ti. And then of course there's
| services like Google Collab and vast.ai where you can rent a
| TPU or GPU in the cloud.
|
| I'd check out Tim Dettmers' guide for buying GPUs:
| https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
| jmrm wrote:
| AFAIK Google Coral is an inexpensive TPU you can buy right
| now: https://coral.ai/products/accelerator/
| kmeisthax wrote:
| The problem is that this is an "inferencing" accelerator -
| i.e. it can only execute pretrained models. You cannot
| train a model on one of these, you need a training
| accelerator. And pretty much all of those are either NVidia
| GPUs or cloud-only offerings.
| worldsayshi wrote:
| Very cool! Although seems to have no memory to speak of so
| many use cases like LLM goes away because of that I guess?
| muxamilian wrote:
| It has 8 MB of memory. It also supports live streaming
| the neural network to the chip, although that is slower
| than when it is cached in the memory.
| wyldfire wrote:
| Qualcomm has an SDK [1] where you can run software on a DSP/NSP
| simulator.
|
| [1] https://developer.qualcomm.com/software/hexagon-dsp-sdk
| fulafel wrote:
| Apple and Google consumer hardware have specialized ML compute
| features.
|
| https://www.tomsguide.com/news/google-pixel-7s-most-critical...
|
| https://www.macobserver.com/tips/deep-dive/what-is-apple-neu...
| lexicality wrote:
| Depending on your definition of affordable, the windows dev kit
| 2023 makes a big deal out of their NPU but you'll have to deal
| with windows 11 to access it unfortunately
| ttul wrote:
| On top of what others have said here about TPUs and their kin,
| you can make things really scream by taping out an ASIC for a
| specific frozen neural network (i.e. including the weights and
| parameters).
|
| If you never have to change the network - for instance to do
| image segmentation or object recognition - then you can't get any
| more efficient than a custom silicon design that bakes in the
| weights as transistors.
| binarymax wrote:
| I'd start with CUDA, because knowing what a chip does won't click
| until you see how it can be programmed to do massive parallel
| computation and matmul.
|
| I read the first book in this list about 10 years ago, and though
| it's pretty old the concepts are solid.
|
| https://developer.nvidia.com/cuda-books-archive
| dogma1138 wrote:
| CUDA abstracts most of the parallelism, the magic of CUDA is it
| gave developers a C/C++ API or language if you will that
| doesn't really requires them to think about that they can
| continue writing their problems as they did when programming
| for mostly single core single threaded CPUs back in the day and
| CUDA takes care of the rest.
|
| Even "manual" CUDA optimizations deal more with concurrency and
| data residency than parallelism and even those are usually
| limited to following the compute guide for your specific
| hardware and feature set and the driver does the majority of
| the heavy lifting.
| nl wrote:
| There's a lot of information here about chips which are mostly
| built for training neural networks.
|
| It's worth noting there are very widely deployed chips primarily
| built for inference (running the network) especially on mobile
| phones.
|
| Depending on the device and manufacturer sometimes this is
| implemented as part of the CPU itself, but functionally it's the
| same idea.
|
| The Apple Neural Engine is a good example of this. This is
| separate to the GPU which is also on the CPU.
|
| Further information is here:
| https://machinelearning.apple.com/research/neural-engine-tra...
|
| The Google Tensor CPU used in the pixel has a similar coprocessor
| called the EdgeTPU.
| sremani wrote:
| I want to latch on this question a bit -- which company out there
| is primed to bring us a CUDA competitor. AMD has failed, so any
| wise words from the people in the industry?
| nottorp wrote:
| An "AI" chip is marketing. But as other posts say, "linear
| algebra coprocessor" doesn't roll of the tongue as well.
|
| Incidentally there used to be a proper "AI" chip. The original
| perceptron was intended to be implemented in hardware. But
| general purpose chips evolved much faster.
|
| https://en.wikipedia.org/wiki/Perceptron
| zoogeny wrote:
| There is a YouTube channel TechTechPotato [1] that has a podcast
| on AI hardware called "The AI Hardware Show". Pretty small and it
| gives you a view on how niche this market is - but if you want
| the 10k foot view from young budding tech journalists then I
| think this fits the bill.
|
| Some random examples of video titles from the last 6 months of
| the channel:
|
| * A Deep Dive into IBM's New Machine Learning Chip
|
| * Does my PC actually use Machine Learning?
|
| * Intel's Next-Gen 2023 Max CPU and Max GPU
|
| * A Deep Dive into Avant, the new chip from Lattice Semiconductor
| (White Paper Video)
|
| * The AI Hardware Show 2023, Episode 1: TPU, A100, AIU, BR100,
| MI250X
|
| I think the podcasters background is actually in HPC (High
| Performance Computing), i.e. super computers. But that overlaps
| just enough with AI hardware that he saw an opportunity to
| capitalize on the new AI hype.
|
| 1. https://www.youtube.com/c/TechTechPotato
| rramadass wrote:
| Nice, looks like a good starting point to survey the field.
| anon291 wrote:
| I've worked in this space for the past five years. The chips are
| essentially highly parallel processors. There's no unifying
| architecture. You have the graph-based / hpc-simulator chips like
| Cerebras, Graphcore, etc which are basically a stick-as-many-
| cores-as-possible situation with a high-speed networking fabric.
| You have the 'tensor' cores like Groq where the chip operates as
| a whole and is just well suited for tensor processing
| (parallelizable, high-speed memory, etc).
|
| At the end of the day, it's matrix multiplication acceleration
| mostly, and then IO optimization. Literally most of the
| optimization has nothing to do with compute. We can compute
| faster than we can ingest.
| rramadass wrote:
| >The chips are essentially highly parallel processors
|
| Right. AFAIK we already were doing SIMD, Vector Processing,
| VLIW etc. to speed up parallel processing for numerical
| calculations in AI/ML. What then is the reason for the
| explosion of these _different categories of chips_? Are they
| just glorified ASICs designed for their specific domains or
| FPGAs programmed accordingly? If so what is their architecture
| i.e. what are their functional units and how do they work
| together?
| neom wrote:
| Veritasium did a pretty good video on some of them:
| https://www.youtube.com/watch?v=GVsUOuSjvcg
| atgctg wrote:
| That video is about analog computers
| neom wrote:
| The video is about analog chips for ML/NNs. His profile of
| this company was particularly interesting: https://mythic.ai/
|
| OP asked about chips for AI.
| formerly_proven wrote:
| Do companies pay him or does he do these ads for free?
| cinntaile wrote:
| He gets paid. It says so at the start of the video, but I
| guess that could depend on where you live.
| nologic01 wrote:
| It may help your digging and search if you have in mind what
| those chips really try to do: Accelerate numerical linear algebra
| calculations.
|
| If you are familiar with linear algebra these specialized chips
| literally etch silicon so as to perform vector (and more general
| multi-array or tensor) computations faster than a general purpose
| CPU. They do that by loading and operating a whole set of numbers
| (a chunk of a vector or a matrix) _simultaneously_ (whereas the
| CPU would operate mostly serially - one at a time).
|
| The advantage is (in a nutshell) that you can get a significant
| speedup. How much depends on the problem and how big a chunk you
| can process simultaneously but it can be a significant factor.
|
| There are disadvantages that people ignore in the current AI
| hype:
|
| * The speedup in a one-off gain, the death of Moore's law is
| equally dead for "AI chips" and CPU's
|
| * It is extremely specialized and fine-tuned software you need to
| develop and run and it only applies to the above linear algebra
| problems.
|
| * In the past such specialized numerical algebra hardware was the
| domain of HPC (high performance computing). Many a supercomputer
| vendor went bankrupt in the past because the cost versus market
| size was not there.
| benreesman wrote:
| While a given generation of accelerator can only target model
| architectures that are comparatively proven out, and there's a
| lag time, it's measured in years not decades.
|
| I remember when NVIDIA didn't have hardware for ReLU.
|
| The fact of the matter on Moore's law is that we've got
| transistors, but _not_ TDP to burn and have for years. These
| stupid big L3 caches are just: "fuck it, I've got die to burn".
|
| This is an old story, things migrate in and out of the "CPU",
| but the current outlook is that we'll be targeting specialized
| hardware more rather than less for the foreseeable future.
| nologic01 wrote:
| > the current outlook is that we'll be targeting specialized
| hardware more rather than less for the foreseeable future.
|
| I think there are some important question marks still
| unresolved that bear on how things will play out. E.g. how
| the training versus inference balance will land in terms of
| usage and economics.
|
| Inference is inherently more "mass market". You need it
| locally without lags from moving data around. But inference
| is just numerical linear algebra. Ultimately augmenting the
| CPU to provide inference natively might be the optimal
| arrangement.
| JoeAltmaier wrote:
| There have been some really strange instruction sets
| conceived. As a student at Stanford we had a time-share
| system that was home-grown (as I remember). It had opcodes to
| reverse bits in a bitstring! And odder things. Somebody
| needed that for some research project I guess. And then
| repurposed the damn thing for timeshare.
|
| It was pretty sad timeshare as I recall. The only machine(s?)
| available to a population of what? 20,000? And about 40
| terminals total. You had to sign up for 15-minute measured
| timeslots.
|
| I came from a state school that had over 1000 terminals on a
| dozen machines, all unlimited time to students. It was a big
| shock to find the star of Silicon Valley had such crappy
| student services.
| anthomtb wrote:
| > It had opcodes to reverse bits in a bitstring
|
| Bit reversal is used in Fast Fourier Transforms. Its not
| entirely surprising to me that you'd have specialized
| hardware for that operation.
|
| Ref: https://en.wikipedia.org/wiki/Bit-
| reversal_permutation#Appli...
| JoeAltmaier wrote:
| Also to count ones in a bitstring. And so on.
| danieldk wrote:
| _I remember when NVIDIA didn't have hardware for ReLU._
|
| Could you elaborate? ReLU is _max(x,0)_ and CUDA had _fmaxf_
| since CUDA 1.0 (2007).
| vgatherps wrote:
| Not sure if or what the chance was, but fmax(x, 0) only
| requires checking the sign bit instead of doing a full
| floating point comparison (putting aside nan handling).
|
| A hypothetical relu instruction could probably get away
| with much less power and die soace?
| benreesman wrote:
| I could easily be wrong, but IIUC the software/hardware
| stack was, as of 2014 or so, fusing down to a specific set
| of circuits for broadcast activation functions under the
| broader banner of "Tensor Cores" (along with a bunch of
| devilish FMA hot-pathing under the sheets).
|
| This is a fairly random press piece off Google but there
| are a ton of them. Hard to tell whether it's the hardware,
| the blob, or the process node unless you work there.
|
| https://developer.nvidia.com/blog/accelerating-relu-and-
| gelu...
| juujian wrote:
| Good explanation. That also gives you an idea why GPUs are
| decent at acceleration computations for neural networks (think
| cuda) -- they are already optimized for doing many small
| computations in parallel rather with slower processors and have
| a lot of dedicated ram (VRAM).
| VeninVidiaVicii wrote:
| Is this why my M1 MacBook Air can run the same R code at least
| 10x faster than my giant Linux tower?
| danieldk wrote:
| Apple Silicon Macs have special matrix multiplication units
| (AMX) that can do matrix multiplication fast and with low
| energy requirements [1]. These AMX units can often beat
| matrix multiplication on AMD/Intel CPUs (especially those
| without a very large number of cores). Since a lot of linear
| algebra code uses matrix multiplication and using the AMX
| units is only a matter of linking against Accelerate (for its
| BLAS interface), a lot of software that uses BLAS is faster o
| Apple Silicon Macs.
|
| That said, the GPUs in your M1 Mac are faster than the AMX
| units and any reasonably modern NVIDIA GPU will wipe the
| floor with the AMX units or Apple Silicon GPUs in raw
| compute. However, a lot of software does not use CUDA by
| default and for small problem sets AMX units or CPUs with
| just AVX can be faster because they don't incur the cost of
| data transfers from main memory to GPU memory and vice versa.
|
| [1] Benchmarks:
|
| https://github.com/danieldk/gemm-benchmark#example-results
|
| https://explosion.ai/blog/metal-performance-shaders (scroll
| down a bit for AMX and MPS numbers)
| stephencanon wrote:
| > That said, the GPUs in your M1 Mac are faster than the
| AMX units
|
| Not for double, which is what R mostly uses IIRC.
| danieldk wrote:
| Ah, thanks for the correction! I never use R, so I
| assumed that it uses/supports single-precision floating
| point.
| amelius wrote:
| > Accelerate numerical linear algebra calculations.
|
| Like compute eigenvalues/eigenvectors of large matrices,
| compute SVDs, solve large sparse systems of equations, etc?
| Q6T46nT668w6i3m wrote:
| Nothing that fancy. Usually matrix-matrix and matrix-scalar
| multiplication.
| nologic01 wrote:
| This is indeed the bread-and-butter, but there is use of
| all sorts of standard linear algebra algorithms. You can
| check various xla-related (accelerated linear algebra)
| folders in tensorflow or torch folders in pytorch to see
| the list of what is used [1],[2]
|
| [1] https://github.com/tensorflow/tensorflow/tree/8d9b35f44
| 2045b...
|
| [2] https://github.com/pytorch/pytorch/blob/6e3e3dd477e0fb9
| 768ee...
| [deleted]
| sebzim4500 wrote:
| >* The speedup in a one-off gain, the death of Moore's law is
| equally dead for "AI chips" and CPU's
|
| This does not seem to be true in reality. The H100 is about
| 2.5x 'better' for AI than the A100 (obviously depends exactly
| what you are doing), and they released about 2 years apart.
| That is roughly in line with Moore's law.
| _nalply wrote:
| The difference seems to be: more units for parallel
| calculation, but the speed of calculation in itself doesn't
| double anymore. In other words: Moore's law has stopped for
| raw speed and perhaps other areas, but is still alive in
| other areas. This has some weird consequences: Some models
| can't be processed in smaller chips (because swapping in
| parts is too slow to be useful), but after a threshold is
| crossed, suddenly the large models run efficiently.
|
| Probably we will see usage of large AI models in smaller
| devices the next years because there's another way to
| optimize: use more efficient representations of the model
| weights. I think about posits, a different floating point
| system where even 6 bits are perhaps usable. When models can
| switch to 6 bit posits from f16 (half floats), hardware can
| load more than three times larger models. We will see whether
| hardware for this will be mass-produced.
| daveguy wrote:
| Moore's law was never about raw speed. It is about
| transistors per unit area and it is still very much active.
| That transistor count is going into cache and core count,
| but it can just as easily go into specialized linear
| algebra units, which is what AI chips do.
| raincole wrote:
| Why from 16 to 6 is "more than three times larger"? Why not
| 16/6=2.667 times?
| rerdavies wrote:
| > Accelerate numerical linear algebra calculations.
|
| Technically, for AI, you need to accelerate numerical non-
| linear algebra calculations, which take the general form of
| matrix multiplication, but interpose non-linear functions at
| key points in the calculation.
| ra1231963 wrote:
| the calculations within a neural network involve both linear
| and non-linear operations, but the fundamental mathematical
| framework underlying AI remains rooted in linear algebra
| bhu1st wrote:
| > If you are familiar with linear algebra
|
| Could you please write what are some common day to day life
| applications of linear algebra in computing?
| juujian wrote:
| I have really lost touch with common day to day life at this
| point, but Excel would be an excellent example. If you have a
| column with a million data points and do some basic
| calculation -- subtract or multiply another column -- whether
| or not the software you use can translate that into a
| vectorized operation under the hood using linear algebra can
| significantly speed up your operation.
| dreamcompiler wrote:
| Nvidia exists because linear algebra is essential for 3D
| graphics. They got a second boost from crypto-currency, which
| isn't strictly about linear algebra but it can put those
| computation units to good use. Now Nvidia is riding high
| again because neural nets are all about linear algebra, and
| LLMs are big neural nets.
| visarga wrote:
| I would also add scientific simulations to the list of
| tasks that GPUs are used for. They parallelise well on many
| cores.
| JamesLeonis wrote:
| Linear Algebra is like a lot of math in CS; you don't
| necessarily see it initially, but once you get some
| familiarity you start seeing it everywhere.
|
| Others have commented on computer graphics, but (as it turns
| out) the exact same algorithms apply for Collision Detection
| in 3D space. And since games already are manipulating
| graphics, you add on _another_ set of Linear Algebra
| transforms that change the position /rotation/shear of those
| vertices. In a similar way, science (especially physics) use
| linear algebra to build simulations of all kinds of systems.
|
| One surprising use is in Advertising (and other user
| preference aggregators). Turns out a _preference_ acts like a
| magnitude of a one-dimensional vector. String _N_ preference
| vectors together and you get an _N-Dimension Vector_ that you
| can perform Linear Algebra operations upon. One common
| application is the _Dot Product_ , which is a fancy way of
| taking two N-Dimension vectors and measuring how close those
| vectors point in the same direction in a [1, -1] range.
|
| Yet another common place to find Linear Algebra is in
| computer science papers. _Most of the time_ this is simply
| notation; a lot of common programming forms can be
| represented by MxN matrices. However some of those algorithms
| will use LA as a way of parsing and breaking down a problem.
| You will see this in compiler papers often, but its
| transferable to many other domains.
|
| As a final and personal observation, I found that Linear
| Algebra helped me grasp Functional Programming. In both cases
| I am applying a transform to some input, and often stringing
| together many such transforms. Also in both cases, the
| transformations are sensitive to their order, and a bad
| ordering produces nonsense just like garbage data.
| visarga wrote:
| > One common application is the Dot Product , which is a
| fancy way of taking two N Dimension vectors and measuring
| how close those vectors point in the same direction in a
| [1, 1] range.
|
| That's cosine similarity, or normalised dot product. The
| dot product can take any value when the vectors are not
| unit norm.
| wongarsu wrote:
| The obvious fields are computer graphics (all kinds: 2d, 3d
| rasterized and 3d raytraced are all heavy on linear algebra,
| though 3d rasterized is the easiest to speed up with the
| lockstep SIMD architectures we call GPUs) and neural
| networks.
|
| Computer graphics mostly because you can view the real world
| as 3d space and the screen as 2d space, and linear algebra
| gives you all the tools to manipulate something in 3d space
| and project it into 2d space. Neural networks because you can
| treat them as matrix multiplications.
| tomek32 wrote:
| You might find this talk interesting, The AI Chip Revolution with
| Andrew Feldman of Cerebras, https://youtu.be/JjQkNzzPm_Q
|
| It's the founder of a new AI chip company and they talk a bit on
| the differences
| fragmede wrote:
| Starting from https://cloud.google.com/tpu/docs/system-
| architecture-tpu-vm what are you looking for?
| rramadass wrote:
| Yes, this is what i am looking for.
|
| _System Architecture_ of these chips with detailed Functional
| Units and how they are used by the _AI algorithm Instruction
| /Data streams_.
| ftxbro wrote:
| This isn't a technical deep dive, but here's a simplified
| explanation.
|
| It's a matrix multiplication
| (https://en.wikipedia.org/wiki/Matrix_multiplication) accelerator
| chip. Matrix multiplication is important for a few reasons. 1)
| it's the slow part of a lot of AI algorithms, 2) it's a 'high
| intensity' algorithm (naively n^3 computation vs. n^2 data), 3)
| it's easily parallelizable, and 4) it's conceptually and
| computationally simple.
|
| The situation is kind of analogous to FPUs (floating point math
| co-processor units) when they were first introduced before they
| were integrated into computers.
|
| There's more to it than this, but that's the basic idea.
| neximo64 wrote:
| Short story, CPU's can do calculations, they can do them one at a
| time. Think of something like 1+1, = 2. If you had 1 million
| equations like these, CPU's will generally do them one at a go,
| i.e the first one, then the second, etc.
|
| GPUs were optimised to draw, so were able to do dozens of these
| at a go. So these can be used for AI/ML in both gradient descent
| and inference (forward passes). Because you can do many at a go,
| in parallel, they speed things up dramatically. Geoff Hinton
| experimented with GPUs exploiting their ability to do this, but
| they aren't actually optimised to do that. It just turned out
| that it is the best way available to do it at the time, and still
| currently.
|
| AI chips, are optimised to do either inference or gradient
| descent. They are not good at drawing like GPUs are. They are
| optimised for machine learning and joining other AI chips
| together so you can have massive networks of chips that can
| parallel compute.
|
| One other class of chips that has not yet shown up are ASICs that
| mimic the transformers architectures for even more speed - though
| it changes too much at the moment for it to be useful.
|
| Also because of the mechanics of scale manufacturing: GPUs are
| currently cheaper per flop of compute as the aggregate of scale
| is shared with graphical uses. Though with time if there is
| enough scale AI chips should end up cheaper
| strohwueste wrote:
| Do you have any sources on those informations? I find it really
| hard to find stuff for what you describe. Also do you know
| about the detail of producing those Asics? Are they CMOS or
| flash (in-memory-compute?)
| thrtythreeforty wrote:
| All current AI accelerators (that aren't a research project)
| are ordinary CMOS. Google published some papers about TPUv3.
| You should read them if you want to know more about the
| architecture of these kinds of chips.
| imakwana wrote:
| Relevant: Stanford Online course - CS217 Hardware acceleration
| for machine learning
|
| https://online.stanford.edu/courses/cs217-hardware-accelerat...
|
| Course website with lecture notes: https://cs217.stanford.edu/
|
| Reading list: https://cs217.stanford.edu/readings
| rramadass wrote:
| Excellent!
| pulkas wrote:
| It is what ASIC for bitcoin. A new era for AI models.
| visarga wrote:
| Trying to make a list of AI accelerator chip families, anything
| missing?
|
| - GPU (Graphics Processing Unit)
|
| - TPU (Tensor Processing Unit): ASIC designed for TensorFlow
|
| - IPU (Intelligence Processing Unit): Graphcore
|
| - HPU (Habana Processing Unit): Intel Habana Labs' Gaudi and Goya
| AI
|
| - NPU (Neural Processing Unit): Huawei, Samsung, Microsoft
| Brainwave
|
| - VPU (Vision Processing Unit): Intel Movidius
|
| - DPU (Data Processing Unit): NVIDIA data center infrastructure
| processing unit
|
| - Amazon's Inferentia: Amazon's accelerator chip focused on low
| cost
|
| - Cerebras Wafer Scale Engine (WSE)
|
| - SambaNova Systems DataScale
|
| - Groq Tensor Streaming Processor (TSP)
| rramadass wrote:
| Nice and very much in line with what i am looking for !
|
| If you don't mind, could you add links to authoritative wiki
| pages/whitepapers/articles for each of the above? I think it
| will give us a good starting point to start our study/research
| from.
| HarHarVeryFunny wrote:
| Modern AI/ML is increasingly about neural nets (deep learning),
| whose performance is based on floating point math - mostly matrix
| multiplication and multiply-and-add operations. These neural nets
| are increasingly massive, e.g. GPT-3 has 175 billion parameters,
| meaning that each pass thru the net (each word generated) is
| going to involve in excess of 175B floating point
| multiplications!
|
| When you're multiplying two large matrices together (or other
| similar operations) there are thousands of individual multiply
| operations that need to be performed, and they can be done in
| parallel since these are all independent (one result doesn't
| depend on the other).
|
| So, to train/run these ML/AI models as fast as possible requires
| the ability to perform massive numbers of floating point
| operations in parallel, but a desktop CPU only has a limited
| capacity to do that, since they are designed as general purpose
| devices, not just for math. A modern CPU has multiple "cores"
| (individual processors than can run in parallel), but only a
| small number ~10, and not all of these can do floating point
| since it has specialized FPU units to do that, typically less in
| number than the number of cores.
|
| This is where GPU/TPU/etc "AI/ML" chips come in, and what makes
| them special. They are designed specifically for this job - to do
| massive numbers of floating point multiplications in parallel. A
| GPU of course can run games too, but it turns out the
| requirements for real-time graphics are very similar - a massive
| amount of parallelism. In contrast to the CPUs ~10 cores, GPUs
| have thousands of cores (e.g. NVIDIA GTX 4070 has 5,888) running
| in parallel, and these are all floating-point capable. This
| results in the ability to do huge numbers of floating point
| operations per second (FLOPS), e.g. the GTX 4070 can do 30 TFLOPS
| (Tera-FLOPS) - i.e. 30,000,000,000,000 floating point
| multiplications per second !!
|
| This brings us to the second specialization of these GPU/TPU
| chips - since they can do these ridiculous number of FLOPS, they
| need to be fed data at an equally ridiculous rate to keep them
| busy, so they need massive memory bandwidth - way more than the
| CPU needs to be kept busy. The normal RAM in a desktop computer
| is too slow for this, and is in any case in the wrong place - on
| the motherboard, where it can only be accessed across the PCI bus
| which is again way too slow to keep up. GPU's solve this memory
| speed problem by having a specially designed memory architecture
| and lots of very fast RAM co-located very close to the GPU chip.
| For example, that GTX 4070 has 12GB of RAM and can move data from
| it into its processing cores at a speed (memory bandwidth) of
| 1TB/sec !!
|
| The exact designs of the various chips differ a bit (and a lot is
| proprietary), but they are all designed to provided these two
| capabilities - massive floating point parallelism, and massive
| memory bandwidth to feed it.
|
| If you want to get into this in detail, best place to start would
| be to look into low level CUDA programming for NVIDIAs cards.
| CUDA is the lowest level API that NVIDIA provide to program their
| GPUs.
| unnouinceput wrote:
| A few finer points:
|
| 1 - It's RTX 4070, not GTX 4070
|
| 2 - the 30 TFLOPS you mention are at the very top when
| overclocked, they go for 22 normally.
|
| 3 - Also those are single precision TFLOPS, as in 32 bit. What
| really matter nowadays is double precision. And in double
| precision a 4070 is 0.35 TFLOPS (or 350 GFLOPS). 2 orders of
| magnitude lower, still impressive though
| HarHarVeryFunny wrote:
| For neural nets it's actually the opposite - half-precision
| bfloat16 is enough. You need large range, but not much
| accuracy.
|
| Yes, the exact numbers are going to vary, but just giving a
| data point to indicate the magnitude of the numbers. If you
| want to quibble there's CPU SIMD too.
| unnouinceput wrote:
| For gaming do matter those double precision. And we were
| talking about a certain GPU, which is used for gaming, not
| AI. Hence why the AI chips exists in the first place -
| dedicated hardware for dedicated tasks (or ASIC for short)
| HarHarVeryFunny wrote:
| The NVIDIA cards are all dual-use for gaming and
| compute/ML. Some features like the RTX 4070's Tensor
| Cores (incl. bfloat16) are there primarily for ML, and
| other features like ray tracing are there for gaming.
| fulafel wrote:
| Here's one from Google (paper link at the end):
| https://cloud.google.com/blog/topics/systems/tpu-v4-enables-...
| phh wrote:
| There are very different architectures in the wild. Some are
| simply standard GPUs (maybe with additional support for
| bf16/float16) (Rockchip RK1808 has one like that). You give it a
| list of instructions pretty much like a CPU (except massively
| parallel), and it'll execute it. BTW when I say standard GPU, I'm
| not saying "kinda like GPU, but really literally GPU
| architecture. Linux mainline support for Amlogic A311D2's NPU is
| 10 lines, and it's declaring a no-output vivante GPU.
|
| Some are just hardware pipeline to compute 2d/3d convolutions +
| activation function (Rockchip RK3588 has one like that). You give
| it the memory address + dimensions of the input matrix, the
| memory address + dimensions of the weights, memory address +
| dimensions of the output, which activation function you want
| (there are only like 4 supported), then you tell it RUN, you wait
| for a bit, and you have the result at the output memory address.
|
| (I took Rockchip example to show that even in one microcosm it
| can change a lot)
|
| And then you can imagine any architecture in-between.
|
| AFAIK they all work with some RAM as input and some RAM as
| output, but some may have its own RAM, some may share RAM with
| the system, some might have mixed usage (RK3588 has some SRAM,
| and when you ask it to compute the convolution, you can tell it
| either to write to SRAM or system RAM)
|
| It's possible that there are some components that are border line
| between ISP (Image Signal Porcessing) and a NPU, where the input
| is the direct camera stream, but my guess is that they do some
| very small processing on the camera stream, then dump it to RAM,
| then do all the heavy work from RAM to RAM. I think that Pixel
| 4-5 had something like that.
| grozzle wrote:
| The RK3588 seems like a beast for performance per dollar, I
| have one running desktop Linux for emulators and games, but I
| haven't used the NPU yet, because I haven't taken the time to
| figure out how to get OpenCV to talk to it.
|
| Is Rockchip's stuff "good", as NPUs go? I'm thinking of buying
| another 3588 SoC for my robotics hobby - you seem like you'd
| know if that's a decent idea or not.
| reesul wrote:
| Yeah it's performance vs cost is honestly nuts. 6 TOPS is a
| pretty solid NPU, but I don't know what their software is
| like. Programming those accelerators is often difficult,
| especially if you're a small time customer.
|
| Curious if anyone can weigh in on their SW usability. A quick
| search for their user level tools showed
| examples/documentation in Chinese(?)
| grozzle wrote:
| Yeah it's the first under $100 system I've tried, (and I'm
| a fan of the genre) that's truly a desktop replacement in
| terms of being a snappy responsive desktop even with lots
| of browser tabs, etc.
|
| I do know they have their own special sauce to talk to the
| NPU. I was discouraged from making the effort myself
| because their special sauce to talk to the VPU has barely
| any ffmpeg support, it p much only uses gstreamer, and I'm
| neither a masochist nor French so that's a non-starter.
| reesul wrote:
| Yeah I'm very surprised to see a single board computer at
| less than $100 for a processor like that. Hard to tell
| what their actual 1ku price is, but if the random Alibaba
| I found for $20 is right, then that price for the overall
| board is absurd.
|
| By VPU are you talking about stuff like ISP, video
| encoder/decoder, or something else?
|
| Among embedded processors I've seen touting vision
| acceleration, gstreamer support is fairly widespread. I
| bit the bullet to learn it because my role requires it.
| Maybe it's Stockholm Syndrome talking, but I've somehow
| grown to like gstreamer. The learning curve was awkward.
| I struggled with documentation and learned more by
| analyzing some examples and trial-and-error.
| grozzle wrote:
| By VPU, I meant the square on the board that eats h.265
| hw_enc and hw_dec and things like that, yes. I've been
| told it's a separate square from the GPU, by someone who
| is full-time waist-deep on getting Arch running on that
| hardware, so I take it as fact.
|
| Oh no. The prospect of gstreamer being the only way... Oh
| no.
|
| Maybe there's a zsh plugin or smth that autocompletes
| sane defaults? AAAAAAAAHHHHHH there surely isn't, anyone
| merciful enough to make one would just use ffmpeg
| instead...
| elseless wrote:
| H.T. Kung's 1982 paper on systolic arrays is the genesis of what
| are now called TPUs:
| http://www.eecs.harvard.edu/~htk/publication/1982-kung-why-s...
| sethgoodluck wrote:
| Remember crypto miners (ASICS). Exact same thing but built for
| the math around AI work instead of Blockchain work.
___________________________________________________________________
(page generated 2023-05-27 23:01 UTC)