[HN Gopher] Tenstorrent unveils Grayskull, its RISC-V answer to ...
       ___________________________________________________________________
        
       Tenstorrent unveils Grayskull, its RISC-V answer to GPUs
        
       Author : Brajeshwar
       Score  : 212 points
       Date   : 2024-03-10 13:15 UTC (9 hours ago)
        
 (HTM) web link (www.techradar.com)
 (TXT) w3m dump (www.techradar.com)
        
       | curious_cat_163 wrote:
       | > "...including BERT for natural language processing tasks,
       | ResNet for image recognition, Whisper for speech recognition and
       | translation, YOLOv5 for real-time object detection, and U-Net for
       | image segmentation."
       | 
       | I wonder why they are starting with these models. One could
       | speculate that they are going for power efficiencies but that
       | does not quite add up entirely.
        
         | camel-cdr wrote:
         | From what I've gathered from the Tenstorrent discord:
         | 
         | The grayskull cards only have 8 GB of RAM and don't have a fast
         | enough memory to NoC bandwidth to make working with multiple
         | cards practical. The next generations, that is wormhole and
         | newer don't have this limitations and are specifically designed
         | to work in server racks, see the galaxy system. [0]
         | 
         | The Grayskull really only is a devboard. It has some other
         | quirks that will be improved by wormhole, like native 19 bit
         | floats in their SIMT engines, instead of 32 bit, in wormhole.
         | 
         | [0] https://tenstorrent.com/systems/galaxy/
        
         | hhh wrote:
         | YOLOv5 is pretty widely used, and it can be useful to get
         | efficient compute at the edge.
         | 
         | Disclosure: I work for a Tenstorrent customer.
        
         | VHRanger wrote:
         | > I wonder why they are starting with these models.
         | 
         | They were building Greyskull in 2020/2021 initially. BERT_Large
         | and similar were the SOTA at that time.
        
         | transitionnel wrote:
         | Like Intel SSE for media optimization probably?
         | 
         | Those models do seem to be the best currently available (IMO),
         | so yeah power efficiencies for guaranteed common use cases
         | should be a 0-day integration.
        
           | transitionnel wrote:
           | Which reminds me... Hope those libraries are open and
           | vetted...
        
         | binarymax wrote:
         | Next-gen BERT models are still very popular for embeddings.
        
         | usrusr wrote:
         | Highly speculative, and I suspect that I don't do these models
         | justice:
         | 
         | The purpose of these boards seems to get people acquainted with
         | their _programming model_. Not as in using board and model as a
         | turnkey solution (if that happens, fine, but this is not the
         | goal), but as in getting potential customers for future boards
         | to learn how to make their own models or third party models run
         | on the board. The more models they supported out of the box,
         | the less well the goal of building buyer-side expertise in the
         | programming model would be served.
        
       | PhilippGille wrote:
       | Dev kits:
       | 
       | - Grayskull e75 | drawing 75W | 96 Tensix cores | 1 GHz clock |
       | 96 MB SRAM | 8GB LPDDR4 @ 102.4 GB/s | $599
       | 
       | - Grayskull e150 | drawing 200W | 120 Tensix cores | 1.2 GHz
       | clock | 120 MB SRAM | 8GB LPDDR4 @ 118.4 GB/s | $799
       | 
       | It will be interesting to see their inference performance,
       | compared to graphics cards. Will they be interesting for home
       | labs?
       | 
       | I found one interview with unboxing a preview version (if I
       | understood correctly), with some background info, but no
       | performance numbers:
       | https://morethanmoore.substack.com/p/unboxing-the-tenstorren...
        
         | christkv wrote:
         | Memory seems very low for inference on bigger models?. Memory
         | bandwidth is also massively lower than something like a 4090 or
         | 7900xtx?. Am I missing some fairy dust magic here?
        
           | ChainOfFools wrote:
           | What's the memory interface like? Is an extreme NUMA thing
           | with several orders greater potential for parallelization?
           | 
           | I'm probably dreaming; new CPU core designs are one thing, a
           | completely new memory design is likely high fantasy..
        
             | christkv wrote:
             | I mean it says LPDDR4, nothing about how many channels but
             | I would be surprised if it's anything more than dual-
             | channel. Big cache but not anything not already seen in the
             | AMD 3XD chips.
        
               | ac29 wrote:
               | The quoted ~100-120GBps is going to be quad-channel
               | LPDDR4
        
           | onlyrealcuzzo wrote:
           | Will graphics cards eventually get renamed?
           | 
           | Are the optimizations for inference basically identical to
           | graphics?
           | 
           | I imagine the workloads are different, and that inference has
           | been the main market for years now...
        
             | yencabulator wrote:
             | The new things are already getting renamed, but the dust
             | hasn't settled yet.
             | 
             | NPU: https://en.wikipedia.org/wiki/AI_accelerator
             | 
             | TPU: https://en.wikipedia.org/wiki/Tensor_Processing_Unit
        
             | adgjlsfhk1 wrote:
             | why bother renaming? just re-acronym to "greatly parallel
             | unit"
        
           | algo_trader wrote:
           | The website indicates each Tensix core ~1TOP theoretical. So
           | not amazing.
           | 
           | It will have to compete on price/memory in the crowded
           | inference space.
        
             | christkv wrote:
             | Maybe power usage ?
        
           | andy99 wrote:
           | This is purely a developer kit, letting people get used to
           | the hardware configuration and the software stacks before the
           | big hardware comes later in future generations.
           | 
           | I'd like to know what different precisions/ quantizations if
           | any are supported. For LLMs 8GB is fine for playing around if
           | weights are quantized. And as the article mentions it's more
           | than enough for lots of computer vision models.
        
             | camel-cdr wrote:
             | Grayskull supports FP8, FP16, BFP16, VFP2, BFP4, BFP8 [0],
             | and some sort of 19-bit floating point in the SFPU. [1]
             | 
             | I don't know how and with what performance those are
             | supported.
             | 
             | FP4 has 221 TFLOPS on the Grayskull e75, and 332 TFLOPS on
             | the Grayskull e150. [0]
             | 
             | While wormhole has 82 INT8 TOPS per card. [2]
             | 
             | I haven't found any other numbers.
             | 
             | Edit: Some more info about data formats:
             | https://docs.tenstorrent.com/tenstorrent/v/tt-
             | buda/dataforma...
             | 
             | [0] https://docs.tenstorrent.com/tenstorrent/add-in-boards-
             | and-c...
             | 
             | [1] https://tenstorrent-metal.github.io/tt-
             | metal/latest/tt_metal...
             | 
             | [2] https://tenstorrent.com/systems/galaxy/
        
               | andy99 wrote:
               | Cool, thank you!
        
           | dannyw wrote:
           | Why are they so skimpy on memory capacity? I guess this is a
           | dev kit for anything but LLMs?
        
         | aleph_minus_one wrote:
         | These specifications rather look like some AI acceleration ICs,
         | and not like GPUs (recall that "GPU" stands for _Graphics_
         | Processing Unit).
        
           | KeplerBoy wrote:
           | Sure, no one actually cares about the graphics part anymore
           | in these contexts. Many Nvidia Datacenter GPUs are also no
           | longer GPUs in that sense.
        
             | mise_en_place wrote:
             | Yep just need to load the model. It can and probably should
             | be headless.
        
             | bee_rider wrote:
             | I know it isn't the what you mean, but as an aside (rather
             | than a contradiction), the gaming marked is still much
             | bigger than the ML model training one, right? So presumably
             | people care about the graphics part in the sense that
             | development there funds the development of the cards, haha.
             | 
             | I always assumed that was what killed Xeon Phi. Hard to
             | justify a GEMM card in and of itself. However clever you
             | get, NVIDIA will be back next year with twice as much
             | bandwidth.
        
               | KeplerBoy wrote:
               | No, datacenter GPU revenue has surpassed gaming revenue
               | some time ago. At least for Nvidia and I fully expect
               | this trend to continue.
        
               | SonOfLilit wrote:
               | According to my googling, data center has been bigger
               | than gaming for nvidia since 2022, and almost x2 in Q2
               | 2023 following a crazy nosedive in gaming sales.
        
               | bee_rider wrote:
               | Woah, that's really something
        
             | gleenn wrote:
             | Many of them don't even have video out ports, they are
             | purely for the compute power.
        
           | fyrn_ wrote:
           | They are "inferrence only" according to the article
        
             | camel-cdr wrote:
             | grayskull is "inference only", but wormhole will also
             | "support" inference.
             | 
             | I put those in quotes, because theoretically both can do
             | training, it's just that grayskull isn't well suited for
             | it, because of the internal fp19 format, and no good
             | support for scaling to multiple cards. wormhole will have
             | internal fp32, and is designed to scale out with across
             | multiple cards, and servers.
             | 
             | At least that's how I understood it.
        
               | buildbot wrote:
               | FP19 is perfectly fine for training - llama is trained in
               | fp16, many train in bf16, with microscaling exponents,
               | you can do 6 bit training :
               | https://arxiv.org/abs/2310.10537
        
               | smallnamespace wrote:
               | I think the broader context here is that it's "fine for
               | training" in the sense that you can successfully train a
               | model, but it's not "fine for training" in the sense that
               | it can only train small models due to the lack of scaling
               | across cards, which directly cuts against where ML has
               | been trending over the last several years.
               | 
               | In LLM-land we've rapidly gone from training bespoke
               | models to doing fine tuning to RLHF to zero-shot
               | prompting. The better the underlying model the more you
               | can do without additional training, so hardware that
               | fails to scale up to the largest training runs will have
               | limited practical utility despite technically supporting
               | training.
        
               | buildbot wrote:
               | Wormhole does support scale out though via it's built in
               | networking? And there seem to be links for something on
               | top of the board, NVLINK style.
               | 
               | And yes, I know, I've been working on LLM pretraining for
               | about 4 years now, since 2020. The number formats
               | themselves so far are mostly scale invariant or improve
               | with larger scale - you can quantize a larger model and
               | see less performance drop than a smaller model.
        
           | RobotToaster wrote:
           | Obviously they are graphic _less_ processing units.
        
           | karmakaze wrote:
           | GPGPU is a thing.
        
             | aleph_minus_one wrote:
             | > GPGPU is a thing.
             | 
             | Indeed.
             | 
             | GPGPU means "using a GPU for things that are General
             | Purpose ( _GP_ GPU)". The aforementioned IC is not a GPU
             | that is used for general-purpose computations, but a
             | specialized chip for AI computations.
        
             | Almondsetat wrote:
             | Isn't a GPGPU just a PPU?
        
           | threecheese wrote:
           | I wonder, is the dual-use of GPUs a significant engineering
           | bottleneck for Nvidia et al? My understanding is that for
           | libraries like Tensorflow and PyTorch, much of the complexity
           | is supporting the many-to-many mapping of model structures to
           | GPU silicon in a way that reduces "compile time", with all
           | the tradeoff landmines you might expect there. I would
           | imagine that abstracting this at the hardware layer is really
           | valuable. Are these non-GPU architectures the future of
           | machine learning, and can we expect the big vendors to start
           | competing here or is this a business risk for them? (robbing
           | Peter to pay Paul, like Google investing in AI that
           | cannibalizes their search revenue)
        
             | dannyw wrote:
             | I doubt it. Toolkits like TensorRT are specifically
             | optimised, and NVIDIA already segregated their HPC vs
             | gaming dies (and corresponding, die space).
             | 
             | The H100s feature a minuscule amount of ROPs and TMUs
             | (mainly for graphics); and doesn't come with NVENC/DEC,
             | etc. DirectX, Vulkan is not supported.
             | 
             | You'll just see more and more tensor cores, optimised CUDA
             | cores for bfloat, etc.
        
               | paulmd wrote:
               | H100 does have NVDEC and jpeg ingest accelerators
               | 
               | https://www.servethehome.com/wp-
               | content/uploads/2023/10/NVID...
               | 
               | Tbh it's mildly surprising they even removed NVENC
               | considering the overall size of the chip (in the absolute
               | we are only talking about low-single-digit mm2 savings)
               | and then H100 is still advertised and has features
               | targeting VM graphics/visualization still... remember
               | they also still put a full graphics pipeline with
               | ROPs/TMUs on the chip, just no actual display hardware.
        
             | smoldesu wrote:
             | The dual-use of GPUs is what's driving Nvidia's demand, for
             | the most part. They bet big on more outlandish designs (eg.
             | CUDA/Tensor cores) and mixed-precision computation, and did
             | a lot of the hardware/software work to get it shipped out
             | years ago. There wasn't much interest in CUDA outside niche
             | research/industry applications for a while, it's a small
             | miracle they kept it around long enough to see the crypto
             | and AI booms. Now their bet is paying big dividends, and
             | other companies are trying to figure out fast ways to
             | leverage raster compute for general-purpose acceleration.
             | 
             | Apple initially developed OpenCL with Khronos as a similar-
             | ish analog for GPU acceleration a while ago. There were a
             | few partners that invested in it, but it suffered the same
             | lack of demand as CUDA and languished for a while. Now
             | Apple doesn't support OpenCL, so the purpose of a multi-
             | platform acceleration library has kinda been scuttled.
             | Nvidia played their cards well, and their competitors are
             | going to feel the pain for a while unless they work
             | together again.
        
           | tiahura wrote:
           | Matrix-Math Coprocessors.
        
       | camel-cdr wrote:
       | Since people seem to be wondering how the architecture works, and
       | the software stack is open (although I'm not sure to what
       | extend), I'll share my understanding of it.
       | 
       | The basic system consists of the cards consists of a bunch of
       | Tensix cores and shared memory:
       | 
       | > Each Tensix core contains a high-density tensor math unit (FPU)
       | which performs most of the heavy lifting, a SIMD engine (SFPU),
       | five Risc-V CPU cores, and a large local memory storage. [0]
       | 
       | > The cores are connected with two torus-shaped, going in
       | opposite directions. [0]
       | 
       | The RISC-V cores in the Tensix cores, are tiny rv32i cores, that
       | can control the FPU, SFPU, and are also used to prepare/move
       | data.
       | 
       | The FPU, does "dense tensor math", so I think it's probably a
       | matmul engine, but I don't know any more specifics. [1]
       | 
       | The SFPU is a more general purpose SIMT engine, that can be
       | driven from the RISC-V cores.
       | 
       | There is a SFPU simulator you can play around with on their
       | github. [2] See the low level Kernels example for how the
       | programming model works. [3]
       | 
       | The grayskull SFPU has 4 general purpose LRegs, which each hold
       | 64 19-bit values. Wormhole has 8 general purpose LRegs, which
       | each hold 32 32-bit values.
       | 
       | I've been told that wormhole SFPU has a ~3x IPC increase over
       | grayskull, and a few new SFPU instructions.
       | 
       | You can probably find out more by browsing the docs and rummaging
       | through the github repos. [4]
       | 
       | [0] https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware
       | 
       | [1] https://docs.tenstorrent.com/tenstorrent/v/tt-
       | buda/terminolo...
       | 
       | [2] https://github.com/tenstorrent-
       | metal/sfpi/blob/master/tests/...
       | 
       | [3] https://tenstorrent-metal.github.io/tt-
       | metal/latest/tt_metal...
       | 
       | [4] https://github.com/tenstorrent-metal/
        
         | BirAdam wrote:
         | So, modern transputer but RISC-V. Seems neat.
        
         | cmrdporcupine wrote:
         | Oh, so am I to read this right that the actual vector work
         | being done here is not done via the RiscV [V]ector extension,
         | but by a purely custom core?
        
           | camel-cdr wrote:
           | Yes, the RISC-V Vector extension would be absolutely overkill
           | for an ML accelerator, and waste a lot of die space.
           | Currently the cards need an x86_64 host, but they plan to
           | replace that with their ascalon risc-v cpu, that supports the
           | risc-v vector extension (rvv). So most compute is done by the
           | accelerator cards, but for some things the host system steps
           | in. There rvv can come in handy, because it's a lot more
           | flexible.
        
       | margorczynski wrote:
       | I wonder if with the recent research coming to light about the
       | effectiveness of using 3 state weights do they plan to release
       | something along those lines?
       | 
       | Can they really compete in FLOP/$ with the likes on Nvidia even
       | if it is a more bespoke architecture than a modified GPU?
        
         | mastax wrote:
         | Nvidia has insane margins so it isn't hard* for a well-funded
         | player to beat them in FLOPS/$. The hard part is the compiler
         | toolchains, the libraries, the documentation, getting models to
         | run fast.
         | 
         | * Everything is relative. It's hard to make even a simple
         | microcontroller - but also it isn't hard, you know?
        
           | transitionnel wrote:
           | "Life in 1 comment"
        
         | dannyw wrote:
         | I'd wait till the dust settles. Maybe 2bit encoding (-1, 0,
         | 0.5, 1) would be easier to design hardware for; as 0.5 can be a
         | bitwise shift.
        
         | imtringued wrote:
         | Nobody cares about FLOPs anymore when it comes to transformer
         | based architectures and considering the transformer
         | architecture is applicable to almost any usecase beyond LLMs,
         | memory bandwidth and memory capacity are the most important
         | metrics. Grayskull has enough 16bit float performance to outrun
         | its memory bandwidth even with 1 ternary bit quantization.
        
       | ribit wrote:
       | I was unable to find any reference to performance or
       | architectural details. Memory bandwidth is very low for an ML-
       | focused device. Price is extremely high.
       | 
       | What am I missing?
        
         | nabla9 wrote:
         | Those are DevKits.
         | 
         | I assume they are sold to hardware manufacturers and maybe
         | software developers so that they check out and test their
         | systems against real processor. Processors are probably
         | manufactured in some relatively old process technology.
         | 
         | The real thing will be manufactured using different 2nm process
         | according to the article and performance will be different.
        
       | physPop wrote:
       | Outsider to the field and just curious: Does anyone know how
       | these kinds of processors compare to the custom silicon by
       | aws/google/tesla?
        
         | VHRanger wrote:
         | Much more efficient per watt for some specialized types of
         | operations.
         | 
         | Obviously immature ecosystem, expect a couple of years at
         | minimum before adoption if it's the real deal.
        
       | bschne wrote:
       | Been following this for a while b/c Jim Keller, but every time I
       | look at the arch [1; as linked by other commenter] as someone who
       | doesn't know the first thing about CPU/ASIC design it just looks
       | sort of... "wacky"? Does anyone who understands this have a good
       | explainer for the rationale behind having a grid of cores with
       | memory and IFs interspersed between and then something akin to a
       | network interconnecting them with that topology? What is it about
       | the target workloads here that makes this a good approach?
       | 
       | 1. https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware
        
         | tormeh wrote:
         | You want memory to be close to where it's used because at the
         | speeds of high-performance ICs, the latency caused by distance
         | is actually significant.
        
           | bschne wrote:
           | but isn't that aspect common among this, CPUs, GPUs, ...? And
           | it feels like the whole NOC thing would add quite some
           | overhead to moving things around.
           | 
           | Are you saying proximity here more than offsets this vs. e.g.
           | each core having its own cache as I think they do in a
           | "normal" CPU? And if so, is this more true of ML inference
           | workloads than other workloads, for some reason?
        
             | loeg wrote:
             | I think the NOC approach is fundamentally similar to
             | Intel's rings that they used for core interconnect back in
             | the mid-2010s. It works.
             | 
             | https://www.realworldtech.com/includes/images/articles/snbe
             | p...
             | 
             | https://en.wikichip.org/wiki/intel/microarchitectures/sandy
             | _...
        
               | paulmd wrote:
               | They still use this today, and the ring interconnect is
               | also the topology inside zen3/zen4 CCXs (with 8 cores).
               | Ring is one of the simplest and best systems for >4
               | cores, until you get to about 8-10 cores (at which point
               | you generally split it into multiple "tiers" like
               | multiple CCXs etc).
        
             | adgjlsfhk1 wrote:
             | I think the distinction with ML inference workloads is that
             | you often have very little control flow, so this type of
             | architecture lets you match layers to adjacent cores so
             | that each operation gets it's data directly from the step
             | before rather than from RAM.
        
           | imtringued wrote:
           | I honestly don't understand this latency obsession for LLMs.
           | You are loading millions of parameters sequentially for each
           | matrix. The access pattern is perfectly predictable. I just
           | ran llama.cpp in with profiling and 99.9% of the time is
           | spent in matrix multiplication. This shocked me, honestly,
           | because I genuinely thought that there is going to be much
           | more variety.
        
         | transitionnel wrote:
         | Definitely looks wacky! Has nice concept though, I like the
         | "Network On Chip" reversed toruses.
         | 
         | Hopefully some of y'all tinkerers with time and dough can bear
         | some of these ideas to fruition, keep Nvidia on their toes ;)
         | 
         | 64gb is a good RAM amount IMO, cheap yet still vastly
         | underutilized since we play to the LCD of users... guessing
         | Linux will be able to make that pivot much faster/so little
         | baggage.
         | 
         | Plus..."Grayskull"
        
           | bschne wrote:
           | > I like the "Network On Chip" reversed toruses
           | 
           | What about them do you like as a design decision? (genuinely
           | curious, as again, I don't understand it)
        
           | binarymax wrote:
           | > 64gb is a good RAM amount IMO
           | 
           | It doesn't have 64gb of RAM, it has 8. The system
           | requirements otherwise need 64gb of RAM for model compilation
        
         | BirAdam wrote:
         | This reminds me very much of transputers. The idea here is that
         | each cpu can context switch extremely quickly with minimal
         | latency to any resource and you have a topology that is great
         | for matrix maths as a result.
        
           | bschne wrote:
           | What makes the resulting topology great for matrix math, vs.
           | non-matrix math workloads? Naively if you know you're ,,only"
           | going to multiply matrices, what do you need the flexibility
           | and fast context-switching for? Is the end-game here that you
           | can lay out the workload s.t. you have a series of closely
           | colocated cores carrying out the operations of some linalg
           | expression one after the other and the memory for
           | intermediate results right in between, or something like
           | that?
        
             | cavisne wrote:
             | I suspect thats basically it (operations one by one and
             | then pipelining to saturate). Thats basically what Groq
             | does also AFAIK. From their website it seems the chips are
             | designed to be connected together into one big topology,
             | the "Galaxy" system. Also similar to TPU's, although they
             | use HBM with only a few very powerful "cores" vs DRAM with
             | low powered cores.
        
             | imtringued wrote:
             | Is this some kind of trick question?
             | 
             | Your core needs to be fully programmable so you can do
             | things like kernel fusion. The simplest form is to load
             | quantized weights and dequantize them to bfloat16 as you
             | go. Llama.cpp and it's gguf files support various types of
             | quantization and most of them require programmability to
             | efficiently support them.
        
         | Pet_Ant wrote:
         | Sounds like a manycore architecture. If you have played TIS-100
         | it is that exact same idea. If you haven't, but have played
         | Factorio think of instead of having a central area where all
         | the work happens you build a series of interconnected stations,
         | each doing their own part of calculation before passing it onto
         | the next.
         | 
         | Upside is each core has its own code and is fully Turing
         | complete and independent of eachother. You can handle
         | conditionals much better. And you lose the latency of having
         | network hops for workers.
         | 
         | Downside is you need to break down your process to map onto
         | specific nodes and flows.
         | 
         | (Assuming it is in fact manycore - which is not the same as
         | multicore)
        
           | bschne wrote:
           | > You can handle conditionals much better
           | 
           | Because you can have a core set up for each branch and just
           | pass to that vs. "context-switching" your core to execute the
           | branch that ends up being taken?
        
             | vrighter wrote:
             | because each core has its own instruction counter
        
           | pavlov wrote:
           | Sounds kind of like the IBM/Sony/Toshiba Cell? It made an
           | appearance in the PlayStation 3, but was supposed to be a
           | more general high-performance architecture. At some point IBM
           | sold blade servers with Cell processors.
        
             | amelius wrote:
             | Is that a dataflow architecture?
             | 
             | https://en.wikipedia.org/wiki/Dataflow_architecture
        
         | crq-yml wrote:
         | It's another iteration of the unit record machine [0] - batch
         | processes done with a physical arrangement of the steps in the
         | process.
         | 
         | CPU design moved away from this analogy a long while ago
         | because the tasks being done with CPUs involved more dynamic
         | control flow structures and arbitrary workloads. But workloads
         | that are linear batches of brute force math don't need that
         | kind of dynamism, so gridded designs become fashionable as a
         | way of expressing a configurable pipeline with clear semantics
         | - everything on the grid is at a linear, integer-scalable
         | distance, buffers will be of the same size, etc.
         | 
         | [0] https://en.m.wikipedia.org/wiki/Unit_record_equipment
        
         | imtringued wrote:
         | The real question is how do they plan to compete with say a
         | Ryzen 8700G with 32 GB of overclocked RAM and the Ryzen AI NPU.
         | 2x DDR5-6600 gives you more memory bandwidth than grayskull.
         | Their primary advantage appears to be the large SRAM and not
         | much else.
        
         | bgnn wrote:
         | In general putting memory physically close to compute is good.
         | If two cores need to share that memory doesn't it make sense to
         | place the memory at the interface?
        
         | adinb wrote:
         | Seeing the topology I had a flashback to college and the MasPar
         | [1] we were using in '92!
         | 
         | [1] https://en.wikipedia.org/wiki/MasPar
        
       | binarymax wrote:
       | The System Requirements on this page say 64GB RAM is a
       | prerequisite on the host system. Why? Wouldn't an inference
       | server be barebones aside from the Inference hardware?
       | 
       | https://tenstorrent.com/cards/
        
         | camel-cdr wrote:
         | From the tenstorrent discord:
         | 
         | > It has more to do with the memory resources required during
         | model compilation. The requirement will vary from model to
         | model, so 64GB is a safe limit for all.
        
           | VHRanger wrote:
           | And why do they specify PCIe 4.0?
           | 
           | I imagine it's just to leverage bandwidth correctly? Or is
           | there a necessary feature in there?
        
       | loeg wrote:
       | This article is extremely low detail. Does anyone have a source
       | with more information?
        
       | pella wrote:
       | Grayskull(tm) e150 = Tensix Cores: 120 * 5 RISC-V => 600 RISC-V
       | CPU core
       | 
       | ( 1 Tensix core = 5 RISC-V core :
       | https://docs.tenstorrent.com/tenstorrent/v/tt-buda/hardware )
        
         | hollerith wrote:
         | I don't think that is as clear as you think it is.
         | 
         | If you had to communicate it orally, what words would you add?
         | 
         | ADDED. I think I figured out what it means: a Grayskull e150
         | contains 120 'Tensix cores', each of which contains 5 RISC-V
         | cores.
        
         | loeg wrote:
         | Do you think this thing will run Linux (re: now removed comment
         | about Linux default core limit)?
        
           | pella wrote:
           | As I see this is : TT-Metalium [1] = a low level API [1]
           | 
           | "TT-Metalium is a platform for programming heterogeneous
           | collection of CPUs (host processors) and Tenstorrent
           | acceleration devices, composed of many RISC-V processors.
           | Target users of TT-Metal are expert parallel programmers
           | wanting to write parallel and efficient code with full access
           | to the Tensix hardware via low-level kernel APIs."
           | 
           | https://tenstorrent-metal.github.io/tt-
           | metal/latest/tt_metal...
           | 
           | [1]
           | 
           |  _" The software stacks come in two varieties - a high level
           | and a low level. The high-level is called TT-Buda, using
           | higher-level APIs to get things up and running, along with
           | interfaces into modern machine learning frameworks. The lower
           | level is TT-Metalium, which provides fine-grained control
           | over the hardware for custom operators, custom control, and
           | even non-machine learning code. Tenstorrent states that there
           | are no black boxes, no encrypted APIs, and no hidden
           | functions."_ https://morethanmoore.substack.com/p/unboxing-
           | the-tenstorren...
        
       | audiofish wrote:
       | This looks very similar to the Picochip designs used in a lot of
       | small cellular base stations for SDR. I hope it is similar,
       | because those were fantastic chips to program for. That influence
       | could have come via Intel's acquisition of Picochip.
        
       | bragr wrote:
       | The stated architecture reminds me of how the Intel Project
       | Larrabee GPU was supposed to work, except with RISC-V instead
       | stripped x86 cores.
        
         | usrusr wrote:
         | Did Larrabee contain anything that wasn't an x86 core?
         | According to other comments, this seems to consist of massive
         | highly specialized processing units with just a few tiny CPU
         | sprinkled in to keep the former fed. Quite the opposite of how
         | I remember Larabee.
        
           | VHRanger wrote:
           | Not really, Xeon Phis were basically the a x86 CPU abusing
           | AVX extensions to their limit
        
         | shrubble wrote:
         | Do you see any similarities with the Xeon Phi boards?
        
       | multiphonics wrote:
       | How is the networked multi CPU core Grayskull a different
       | approach compared to, say, Ampere's 100+ ARM core chip?
        
       | spintin wrote:
       | I bought a 3050 low profile / half length with DDR6 at 14GHz for
       | half this price.
       | 
       | And it can play games too.
        
         | wtallis wrote:
         | _G_ DDR6, 14 GT/s if we want to be accurate. So about double
         | the DRAM bandwidth, but less than 1/20th the SRAM capacity, and
         | fairly different architectures on-chip. I'm not sure what ML
         | workloads would benefit from the extra SRAM enough to overcome
         | the DRAM bandwidth deficit, but they probably exist or
         | Tenstorrent wouldn't be making such a SRAM-heavy chip.
        
       | _0ffh wrote:
       | I'm getting sick of this: Groq, Tenstorrent, all the promising
       | startups are offering inference-only solutions. I have it from
       | Groq official channels, that they do not plan to invest
       | development time into enabling training, because inference is
       | where the big money is at. Which I understand insofar, that
       | inference demand probably outstrips training demand by ~millions,
       | but I still can't help but find all of this so egreriously
       | disappointing!
        
       ___________________________________________________________________
       (page generated 2024-03-10 23:00 UTC)