[HN Gopher] Tenstorrent unveils Grayskull, its RISC-V answer to ...
___________________________________________________________________
Tenstorrent unveils Grayskull, its RISC-V answer to GPUs
Author : Brajeshwar
Score : 212 points
Date : 2024-03-10 13:15 UTC (9 hours ago)
(HTM) web link (www.techradar.com)
(TXT) w3m dump (www.techradar.com)
| curious_cat_163 wrote:
| > "...including BERT for natural language processing tasks,
| ResNet for image recognition, Whisper for speech recognition and
| translation, YOLOv5 for real-time object detection, and U-Net for
| image segmentation."
|
| I wonder why they are starting with these models. One could
| speculate that they are going for power efficiencies but that
| does not quite add up entirely.
| camel-cdr wrote:
| From what I've gathered from the Tenstorrent discord:
|
| The grayskull cards only have 8 GB of RAM and don't have a fast
| enough memory to NoC bandwidth to make working with multiple
| cards practical. The next generations, that is wormhole and
| newer don't have this limitations and are specifically designed
| to work in server racks, see the galaxy system. [0]
|
| The Grayskull really only is a devboard. It has some other
| quirks that will be improved by wormhole, like native 19 bit
| floats in their SIMT engines, instead of 32 bit, in wormhole.
|
| [0] https://tenstorrent.com/systems/galaxy/
| hhh wrote:
| YOLOv5 is pretty widely used, and it can be useful to get
| efficient compute at the edge.
|
| Disclosure: I work for a Tenstorrent customer.
| VHRanger wrote:
| > I wonder why they are starting with these models.
|
| They were building Greyskull in 2020/2021 initially. BERT_Large
| and similar were the SOTA at that time.
| transitionnel wrote:
| Like Intel SSE for media optimization probably?
|
| Those models do seem to be the best currently available (IMO),
| so yeah power efficiencies for guaranteed common use cases
| should be a 0-day integration.
| transitionnel wrote:
| Which reminds me... Hope those libraries are open and
| vetted...
| binarymax wrote:
| Next-gen BERT models are still very popular for embeddings.
| usrusr wrote:
| Highly speculative, and I suspect that I don't do these models
| justice:
|
| The purpose of these boards seems to get people acquainted with
| their _programming model_. Not as in using board and model as a
| turnkey solution (if that happens, fine, but this is not the
| goal), but as in getting potential customers for future boards
| to learn how to make their own models or third party models run
| on the board. The more models they supported out of the box,
| the less well the goal of building buyer-side expertise in the
| programming model would be served.
| PhilippGille wrote:
| Dev kits:
|
| - Grayskull e75 | drawing 75W | 96 Tensix cores | 1 GHz clock |
| 96 MB SRAM | 8GB LPDDR4 @ 102.4 GB/s | $599
|
| - Grayskull e150 | drawing 200W | 120 Tensix cores | 1.2 GHz
| clock | 120 MB SRAM | 8GB LPDDR4 @ 118.4 GB/s | $799
|
| It will be interesting to see their inference performance,
| compared to graphics cards. Will they be interesting for home
| labs?
|
| I found one interview with unboxing a preview version (if I
| understood correctly), with some background info, but no
| performance numbers:
| https://morethanmoore.substack.com/p/unboxing-the-tenstorren...
| christkv wrote:
| Memory seems very low for inference on bigger models?. Memory
| bandwidth is also massively lower than something like a 4090 or
| 7900xtx?. Am I missing some fairy dust magic here?
| ChainOfFools wrote:
| What's the memory interface like? Is an extreme NUMA thing
| with several orders greater potential for parallelization?
|
| I'm probably dreaming; new CPU core designs are one thing, a
| completely new memory design is likely high fantasy..
| christkv wrote:
| I mean it says LPDDR4, nothing about how many channels but
| I would be surprised if it's anything more than dual-
| channel. Big cache but not anything not already seen in the
| AMD 3XD chips.
| ac29 wrote:
| The quoted ~100-120GBps is going to be quad-channel
| LPDDR4
| onlyrealcuzzo wrote:
| Will graphics cards eventually get renamed?
|
| Are the optimizations for inference basically identical to
| graphics?
|
| I imagine the workloads are different, and that inference has
| been the main market for years now...
| yencabulator wrote:
| The new things are already getting renamed, but the dust
| hasn't settled yet.
|
| NPU: https://en.wikipedia.org/wiki/AI_accelerator
|
| TPU: https://en.wikipedia.org/wiki/Tensor_Processing_Unit
| adgjlsfhk1 wrote:
| why bother renaming? just re-acronym to "greatly parallel
| unit"
| algo_trader wrote:
| The website indicates each Tensix core ~1TOP theoretical. So
| not amazing.
|
| It will have to compete on price/memory in the crowded
| inference space.
| christkv wrote:
| Maybe power usage ?
| andy99 wrote:
| This is purely a developer kit, letting people get used to
| the hardware configuration and the software stacks before the
| big hardware comes later in future generations.
|
| I'd like to know what different precisions/ quantizations if
| any are supported. For LLMs 8GB is fine for playing around if
| weights are quantized. And as the article mentions it's more
| than enough for lots of computer vision models.
| camel-cdr wrote:
| Grayskull supports FP8, FP16, BFP16, VFP2, BFP4, BFP8 [0],
| and some sort of 19-bit floating point in the SFPU. [1]
|
| I don't know how and with what performance those are
| supported.
|
| FP4 has 221 TFLOPS on the Grayskull e75, and 332 TFLOPS on
| the Grayskull e150. [0]
|
| While wormhole has 82 INT8 TOPS per card. [2]
|
| I haven't found any other numbers.
|
| Edit: Some more info about data formats:
| https://docs.tenstorrent.com/tenstorrent/v/tt-
| buda/dataforma...
|
| [0] https://docs.tenstorrent.com/tenstorrent/add-in-boards-
| and-c...
|
| [1] https://tenstorrent-metal.github.io/tt-
| metal/latest/tt_metal...
|
| [2] https://tenstorrent.com/systems/galaxy/
| andy99 wrote:
| Cool, thank you!
| dannyw wrote:
| Why are they so skimpy on memory capacity? I guess this is a
| dev kit for anything but LLMs?
| aleph_minus_one wrote:
| These specifications rather look like some AI acceleration ICs,
| and not like GPUs (recall that "GPU" stands for _Graphics_
| Processing Unit).
| KeplerBoy wrote:
| Sure, no one actually cares about the graphics part anymore
| in these contexts. Many Nvidia Datacenter GPUs are also no
| longer GPUs in that sense.
| mise_en_place wrote:
| Yep just need to load the model. It can and probably should
| be headless.
| bee_rider wrote:
| I know it isn't the what you mean, but as an aside (rather
| than a contradiction), the gaming marked is still much
| bigger than the ML model training one, right? So presumably
| people care about the graphics part in the sense that
| development there funds the development of the cards, haha.
|
| I always assumed that was what killed Xeon Phi. Hard to
| justify a GEMM card in and of itself. However clever you
| get, NVIDIA will be back next year with twice as much
| bandwidth.
| KeplerBoy wrote:
| No, datacenter GPU revenue has surpassed gaming revenue
| some time ago. At least for Nvidia and I fully expect
| this trend to continue.
| SonOfLilit wrote:
| According to my googling, data center has been bigger
| than gaming for nvidia since 2022, and almost x2 in Q2
| 2023 following a crazy nosedive in gaming sales.
| bee_rider wrote:
| Woah, that's really something
| gleenn wrote:
| Many of them don't even have video out ports, they are
| purely for the compute power.
| fyrn_ wrote:
| They are "inferrence only" according to the article
| camel-cdr wrote:
| grayskull is "inference only", but wormhole will also
| "support" inference.
|
| I put those in quotes, because theoretically both can do
| training, it's just that grayskull isn't well suited for
| it, because of the internal fp19 format, and no good
| support for scaling to multiple cards. wormhole will have
| internal fp32, and is designed to scale out with across
| multiple cards, and servers.
|
| At least that's how I understood it.
| buildbot wrote:
| FP19 is perfectly fine for training - llama is trained in
| fp16, many train in bf16, with microscaling exponents,
| you can do 6 bit training :
| https://arxiv.org/abs/2310.10537
| smallnamespace wrote:
| I think the broader context here is that it's "fine for
| training" in the sense that you can successfully train a
| model, but it's not "fine for training" in the sense that
| it can only train small models due to the lack of scaling
| across cards, which directly cuts against where ML has
| been trending over the last several years.
|
| In LLM-land we've rapidly gone from training bespoke
| models to doing fine tuning to RLHF to zero-shot
| prompting. The better the underlying model the more you
| can do without additional training, so hardware that
| fails to scale up to the largest training runs will have
| limited practical utility despite technically supporting
| training.
| buildbot wrote:
| Wormhole does support scale out though via it's built in
| networking? And there seem to be links for something on
| top of the board, NVLINK style.
|
| And yes, I know, I've been working on LLM pretraining for
| about 4 years now, since 2020. The number formats
| themselves so far are mostly scale invariant or improve
| with larger scale - you can quantize a larger model and
| see less performance drop than a smaller model.
| RobotToaster wrote:
| Obviously they are graphic _less_ processing units.
| karmakaze wrote:
| GPGPU is a thing.
| aleph_minus_one wrote:
| > GPGPU is a thing.
|
| Indeed.
|
| GPGPU means "using a GPU for things that are General
| Purpose ( _GP_ GPU)". The aforementioned IC is not a GPU
| that is used for general-purpose computations, but a
| specialized chip for AI computations.
| Almondsetat wrote:
| Isn't a GPGPU just a PPU?
| threecheese wrote:
| I wonder, is the dual-use of GPUs a significant engineering
| bottleneck for Nvidia et al? My understanding is that for
| libraries like Tensorflow and PyTorch, much of the complexity
| is supporting the many-to-many mapping of model structures to
| GPU silicon in a way that reduces "compile time", with all
| the tradeoff landmines you might expect there. I would
| imagine that abstracting this at the hardware layer is really
| valuable. Are these non-GPU architectures the future of
| machine learning, and can we expect the big vendors to start
| competing here or is this a business risk for them? (robbing
| Peter to pay Paul, like Google investing in AI that
| cannibalizes their search revenue)
| dannyw wrote:
| I doubt it. Toolkits like TensorRT are specifically
| optimised, and NVIDIA already segregated their HPC vs
| gaming dies (and corresponding, die space).
|
| The H100s feature a minuscule amount of ROPs and TMUs
| (mainly for graphics); and doesn't come with NVENC/DEC,
| etc. DirectX, Vulkan is not supported.
|
| You'll just see more and more tensor cores, optimised CUDA
| cores for bfloat, etc.
| paulmd wrote:
| H100 does have NVDEC and jpeg ingest accelerators
|
| https://www.servethehome.com/wp-
| content/uploads/2023/10/NVID...
|
| Tbh it's mildly surprising they even removed NVENC
| considering the overall size of the chip (in the absolute
| we are only talking about low-single-digit mm2 savings)
| and then H100 is still advertised and has features
| targeting VM graphics/visualization still... remember
| they also still put a full graphics pipeline with
| ROPs/TMUs on the chip, just no actual display hardware.
| smoldesu wrote:
| The dual-use of GPUs is what's driving Nvidia's demand, for
| the most part. They bet big on more outlandish designs (eg.
| CUDA/Tensor cores) and mixed-precision computation, and did
| a lot of the hardware/software work to get it shipped out
| years ago. There wasn't much interest in CUDA outside niche
| research/industry applications for a while, it's a small
| miracle they kept it around long enough to see the crypto
| and AI booms. Now their bet is paying big dividends, and
| other companies are trying to figure out fast ways to
| leverage raster compute for general-purpose acceleration.
|
| Apple initially developed OpenCL with Khronos as a similar-
| ish analog for GPU acceleration a while ago. There were a
| few partners that invested in it, but it suffered the same
| lack of demand as CUDA and languished for a while. Now
| Apple doesn't support OpenCL, so the purpose of a multi-
| platform acceleration library has kinda been scuttled.
| Nvidia played their cards well, and their competitors are
| going to feel the pain for a while unless they work
| together again.
| tiahura wrote:
| Matrix-Math Coprocessors.
| camel-cdr wrote:
| Since people seem to be wondering how the architecture works, and
| the software stack is open (although I'm not sure to what
| extend), I'll share my understanding of it.
|
| The basic system consists of the cards consists of a bunch of
| Tensix cores and shared memory:
|
| > Each Tensix core contains a high-density tensor math unit (FPU)
| which performs most of the heavy lifting, a SIMD engine (SFPU),
| five Risc-V CPU cores, and a large local memory storage. [0]
|
| > The cores are connected with two torus-shaped, going in
| opposite directions. [0]
|
| The RISC-V cores in the Tensix cores, are tiny rv32i cores, that
| can control the FPU, SFPU, and are also used to prepare/move
| data.
|
| The FPU, does "dense tensor math", so I think it's probably a
| matmul engine, but I don't know any more specifics. [1]
|
| The SFPU is a more general purpose SIMT engine, that can be
| driven from the RISC-V cores.
|
| There is a SFPU simulator you can play around with on their
| github. [2] See the low level Kernels example for how the
| programming model works. [3]
|
| The grayskull SFPU has 4 general purpose LRegs, which each hold
| 64 19-bit values. Wormhole has 8 general purpose LRegs, which
| each hold 32 32-bit values.
|
| I've been told that wormhole SFPU has a ~3x IPC increase over
| grayskull, and a few new SFPU instructions.
|
| You can probably find out more by browsing the docs and rummaging
| through the github repos. [4]
|
| [0] https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware
|
| [1] https://docs.tenstorrent.com/tenstorrent/v/tt-
| buda/terminolo...
|
| [2] https://github.com/tenstorrent-
| metal/sfpi/blob/master/tests/...
|
| [3] https://tenstorrent-metal.github.io/tt-
| metal/latest/tt_metal...
|
| [4] https://github.com/tenstorrent-metal/
| BirAdam wrote:
| So, modern transputer but RISC-V. Seems neat.
| cmrdporcupine wrote:
| Oh, so am I to read this right that the actual vector work
| being done here is not done via the RiscV [V]ector extension,
| but by a purely custom core?
| camel-cdr wrote:
| Yes, the RISC-V Vector extension would be absolutely overkill
| for an ML accelerator, and waste a lot of die space.
| Currently the cards need an x86_64 host, but they plan to
| replace that with their ascalon risc-v cpu, that supports the
| risc-v vector extension (rvv). So most compute is done by the
| accelerator cards, but for some things the host system steps
| in. There rvv can come in handy, because it's a lot more
| flexible.
| margorczynski wrote:
| I wonder if with the recent research coming to light about the
| effectiveness of using 3 state weights do they plan to release
| something along those lines?
|
| Can they really compete in FLOP/$ with the likes on Nvidia even
| if it is a more bespoke architecture than a modified GPU?
| mastax wrote:
| Nvidia has insane margins so it isn't hard* for a well-funded
| player to beat them in FLOPS/$. The hard part is the compiler
| toolchains, the libraries, the documentation, getting models to
| run fast.
|
| * Everything is relative. It's hard to make even a simple
| microcontroller - but also it isn't hard, you know?
| transitionnel wrote:
| "Life in 1 comment"
| dannyw wrote:
| I'd wait till the dust settles. Maybe 2bit encoding (-1, 0,
| 0.5, 1) would be easier to design hardware for; as 0.5 can be a
| bitwise shift.
| imtringued wrote:
| Nobody cares about FLOPs anymore when it comes to transformer
| based architectures and considering the transformer
| architecture is applicable to almost any usecase beyond LLMs,
| memory bandwidth and memory capacity are the most important
| metrics. Grayskull has enough 16bit float performance to outrun
| its memory bandwidth even with 1 ternary bit quantization.
| ribit wrote:
| I was unable to find any reference to performance or
| architectural details. Memory bandwidth is very low for an ML-
| focused device. Price is extremely high.
|
| What am I missing?
| nabla9 wrote:
| Those are DevKits.
|
| I assume they are sold to hardware manufacturers and maybe
| software developers so that they check out and test their
| systems against real processor. Processors are probably
| manufactured in some relatively old process technology.
|
| The real thing will be manufactured using different 2nm process
| according to the article and performance will be different.
| physPop wrote:
| Outsider to the field and just curious: Does anyone know how
| these kinds of processors compare to the custom silicon by
| aws/google/tesla?
| VHRanger wrote:
| Much more efficient per watt for some specialized types of
| operations.
|
| Obviously immature ecosystem, expect a couple of years at
| minimum before adoption if it's the real deal.
| bschne wrote:
| Been following this for a while b/c Jim Keller, but every time I
| look at the arch [1; as linked by other commenter] as someone who
| doesn't know the first thing about CPU/ASIC design it just looks
| sort of... "wacky"? Does anyone who understands this have a good
| explainer for the rationale behind having a grid of cores with
| memory and IFs interspersed between and then something akin to a
| network interconnecting them with that topology? What is it about
| the target workloads here that makes this a good approach?
|
| 1. https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware
| tormeh wrote:
| You want memory to be close to where it's used because at the
| speeds of high-performance ICs, the latency caused by distance
| is actually significant.
| bschne wrote:
| but isn't that aspect common among this, CPUs, GPUs, ...? And
| it feels like the whole NOC thing would add quite some
| overhead to moving things around.
|
| Are you saying proximity here more than offsets this vs. e.g.
| each core having its own cache as I think they do in a
| "normal" CPU? And if so, is this more true of ML inference
| workloads than other workloads, for some reason?
| loeg wrote:
| I think the NOC approach is fundamentally similar to
| Intel's rings that they used for core interconnect back in
| the mid-2010s. It works.
|
| https://www.realworldtech.com/includes/images/articles/snbe
| p...
|
| https://en.wikichip.org/wiki/intel/microarchitectures/sandy
| _...
| paulmd wrote:
| They still use this today, and the ring interconnect is
| also the topology inside zen3/zen4 CCXs (with 8 cores).
| Ring is one of the simplest and best systems for >4
| cores, until you get to about 8-10 cores (at which point
| you generally split it into multiple "tiers" like
| multiple CCXs etc).
| adgjlsfhk1 wrote:
| I think the distinction with ML inference workloads is that
| you often have very little control flow, so this type of
| architecture lets you match layers to adjacent cores so
| that each operation gets it's data directly from the step
| before rather than from RAM.
| imtringued wrote:
| I honestly don't understand this latency obsession for LLMs.
| You are loading millions of parameters sequentially for each
| matrix. The access pattern is perfectly predictable. I just
| ran llama.cpp in with profiling and 99.9% of the time is
| spent in matrix multiplication. This shocked me, honestly,
| because I genuinely thought that there is going to be much
| more variety.
| transitionnel wrote:
| Definitely looks wacky! Has nice concept though, I like the
| "Network On Chip" reversed toruses.
|
| Hopefully some of y'all tinkerers with time and dough can bear
| some of these ideas to fruition, keep Nvidia on their toes ;)
|
| 64gb is a good RAM amount IMO, cheap yet still vastly
| underutilized since we play to the LCD of users... guessing
| Linux will be able to make that pivot much faster/so little
| baggage.
|
| Plus..."Grayskull"
| bschne wrote:
| > I like the "Network On Chip" reversed toruses
|
| What about them do you like as a design decision? (genuinely
| curious, as again, I don't understand it)
| binarymax wrote:
| > 64gb is a good RAM amount IMO
|
| It doesn't have 64gb of RAM, it has 8. The system
| requirements otherwise need 64gb of RAM for model compilation
| BirAdam wrote:
| This reminds me very much of transputers. The idea here is that
| each cpu can context switch extremely quickly with minimal
| latency to any resource and you have a topology that is great
| for matrix maths as a result.
| bschne wrote:
| What makes the resulting topology great for matrix math, vs.
| non-matrix math workloads? Naively if you know you're ,,only"
| going to multiply matrices, what do you need the flexibility
| and fast context-switching for? Is the end-game here that you
| can lay out the workload s.t. you have a series of closely
| colocated cores carrying out the operations of some linalg
| expression one after the other and the memory for
| intermediate results right in between, or something like
| that?
| cavisne wrote:
| I suspect thats basically it (operations one by one and
| then pipelining to saturate). Thats basically what Groq
| does also AFAIK. From their website it seems the chips are
| designed to be connected together into one big topology,
| the "Galaxy" system. Also similar to TPU's, although they
| use HBM with only a few very powerful "cores" vs DRAM with
| low powered cores.
| imtringued wrote:
| Is this some kind of trick question?
|
| Your core needs to be fully programmable so you can do
| things like kernel fusion. The simplest form is to load
| quantized weights and dequantize them to bfloat16 as you
| go. Llama.cpp and it's gguf files support various types of
| quantization and most of them require programmability to
| efficiently support them.
| Pet_Ant wrote:
| Sounds like a manycore architecture. If you have played TIS-100
| it is that exact same idea. If you haven't, but have played
| Factorio think of instead of having a central area where all
| the work happens you build a series of interconnected stations,
| each doing their own part of calculation before passing it onto
| the next.
|
| Upside is each core has its own code and is fully Turing
| complete and independent of eachother. You can handle
| conditionals much better. And you lose the latency of having
| network hops for workers.
|
| Downside is you need to break down your process to map onto
| specific nodes and flows.
|
| (Assuming it is in fact manycore - which is not the same as
| multicore)
| bschne wrote:
| > You can handle conditionals much better
|
| Because you can have a core set up for each branch and just
| pass to that vs. "context-switching" your core to execute the
| branch that ends up being taken?
| vrighter wrote:
| because each core has its own instruction counter
| pavlov wrote:
| Sounds kind of like the IBM/Sony/Toshiba Cell? It made an
| appearance in the PlayStation 3, but was supposed to be a
| more general high-performance architecture. At some point IBM
| sold blade servers with Cell processors.
| amelius wrote:
| Is that a dataflow architecture?
|
| https://en.wikipedia.org/wiki/Dataflow_architecture
| crq-yml wrote:
| It's another iteration of the unit record machine [0] - batch
| processes done with a physical arrangement of the steps in the
| process.
|
| CPU design moved away from this analogy a long while ago
| because the tasks being done with CPUs involved more dynamic
| control flow structures and arbitrary workloads. But workloads
| that are linear batches of brute force math don't need that
| kind of dynamism, so gridded designs become fashionable as a
| way of expressing a configurable pipeline with clear semantics
| - everything on the grid is at a linear, integer-scalable
| distance, buffers will be of the same size, etc.
|
| [0] https://en.m.wikipedia.org/wiki/Unit_record_equipment
| imtringued wrote:
| The real question is how do they plan to compete with say a
| Ryzen 8700G with 32 GB of overclocked RAM and the Ryzen AI NPU.
| 2x DDR5-6600 gives you more memory bandwidth than grayskull.
| Their primary advantage appears to be the large SRAM and not
| much else.
| bgnn wrote:
| In general putting memory physically close to compute is good.
| If two cores need to share that memory doesn't it make sense to
| place the memory at the interface?
| adinb wrote:
| Seeing the topology I had a flashback to college and the MasPar
| [1] we were using in '92!
|
| [1] https://en.wikipedia.org/wiki/MasPar
| binarymax wrote:
| The System Requirements on this page say 64GB RAM is a
| prerequisite on the host system. Why? Wouldn't an inference
| server be barebones aside from the Inference hardware?
|
| https://tenstorrent.com/cards/
| camel-cdr wrote:
| From the tenstorrent discord:
|
| > It has more to do with the memory resources required during
| model compilation. The requirement will vary from model to
| model, so 64GB is a safe limit for all.
| VHRanger wrote:
| And why do they specify PCIe 4.0?
|
| I imagine it's just to leverage bandwidth correctly? Or is
| there a necessary feature in there?
| loeg wrote:
| This article is extremely low detail. Does anyone have a source
| with more information?
| pella wrote:
| Grayskull(tm) e150 = Tensix Cores: 120 * 5 RISC-V => 600 RISC-V
| CPU core
|
| ( 1 Tensix core = 5 RISC-V core :
| https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware )
| hollerith wrote:
| I don't think that is as clear as you think it is.
|
| If you had to communicate it orally, what words would you add?
|
| ADDED. I think I figured out what it means: a Grayskull e150
| contains 120 'Tensix cores', each of which contains 5 RISC-V
| cores.
| loeg wrote:
| Do you think this thing will run Linux (re: now removed comment
| about Linux default core limit)?
| pella wrote:
| As I see this is : TT-Metalium [1] = a low level API [1]
|
| "TT-Metalium is a platform for programming heterogeneous
| collection of CPUs (host processors) and Tenstorrent
| acceleration devices, composed of many RISC-V processors.
| Target users of TT-Metal are expert parallel programmers
| wanting to write parallel and efficient code with full access
| to the Tensix hardware via low-level kernel APIs."
|
| https://tenstorrent-metal.github.io/tt-
| metal/latest/tt_metal...
|
| [1]
|
| _" The software stacks come in two varieties - a high level
| and a low level. The high-level is called TT-Buda, using
| higher-level APIs to get things up and running, along with
| interfaces into modern machine learning frameworks. The lower
| level is TT-Metalium, which provides fine-grained control
| over the hardware for custom operators, custom control, and
| even non-machine learning code. Tenstorrent states that there
| are no black boxes, no encrypted APIs, and no hidden
| functions."_ https://morethanmoore.substack.com/p/unboxing-
| the-tenstorren...
| audiofish wrote:
| This looks very similar to the Picochip designs used in a lot of
| small cellular base stations for SDR. I hope it is similar,
| because those were fantastic chips to program for. That influence
| could have come via Intel's acquisition of Picochip.
| bragr wrote:
| The stated architecture reminds me of how the Intel Project
| Larrabee GPU was supposed to work, except with RISC-V instead
| stripped x86 cores.
| usrusr wrote:
| Did Larrabee contain anything that wasn't an x86 core?
| According to other comments, this seems to consist of massive
| highly specialized processing units with just a few tiny CPU
| sprinkled in to keep the former fed. Quite the opposite of how
| I remember Larabee.
| VHRanger wrote:
| Not really, Xeon Phis were basically the a x86 CPU abusing
| AVX extensions to their limit
| shrubble wrote:
| Do you see any similarities with the Xeon Phi boards?
| multiphonics wrote:
| How is the networked multi CPU core Grayskull a different
| approach compared to, say, Ampere's 100+ ARM core chip?
| spintin wrote:
| I bought a 3050 low profile / half length with DDR6 at 14GHz for
| half this price.
|
| And it can play games too.
| wtallis wrote:
| _G_ DDR6, 14 GT/s if we want to be accurate. So about double
| the DRAM bandwidth, but less than 1/20th the SRAM capacity, and
| fairly different architectures on-chip. I'm not sure what ML
| workloads would benefit from the extra SRAM enough to overcome
| the DRAM bandwidth deficit, but they probably exist or
| Tenstorrent wouldn't be making such a SRAM-heavy chip.
| _0ffh wrote:
| I'm getting sick of this: Groq, Tenstorrent, all the promising
| startups are offering inference-only solutions. I have it from
| Groq official channels, that they do not plan to invest
| development time into enabling training, because inference is
| where the big money is at. Which I understand insofar, that
| inference demand probably outstrips training demand by ~millions,
| but I still can't help but find all of this so egreriously
| disappointing!
___________________________________________________________________
(page generated 2024-03-10 23:00 UTC)