[HN Gopher] Launch HN: Deepsilicon (YC S24) - Software and hardw...
___________________________________________________________________
Launch HN: Deepsilicon (YC S24) - Software and hardware for ternary
transformers
Hey Hacker News! We're Abhi and Alex from deepsilicon
(https://deepsilicon.com). We are building software and hardware
for training and inferencing ternary transformer models. Here's a
video of the software: https://www.youtube.com/watch?v=VqBn-I5D6pk.
Transformer-based models are getting bigger every generation,
making the inference hardware requirements more and more expensive.
Running large transformer models on device is even more
challenging. Usually, they require trillions of FLOPs to run at
decent speeds and use too much energy and space. Our solution is
to train ternary transformer models. There are two advantages to
using ternary values. The first is that the weights can now be
stored in two bits (or even less) from 16 bits. This represents an
almost 8x compression ratio for every weight matrix in the
transformer model (slightly less because of the float16 scaling
value and extra norm, but that's negligible). The second advantage
is a reduction in the arithmetic intensity. If we do a dot product
between ternary values and INT8 values, we either add the INT8 if
the ternary value is 1, subtract the INT8 if the ternary values is
-1, or do nothing if the ternary value is 0. There are numerous
ways to take advantage of this change in arithmetic, from look up
tables to bit mask reductions. As for why ternary and not
quaternary/binary, ternary hits a sweet spot of compression and
(symmetric) representational value for weights in our experiments.
Currently, hardware is not really optimized for extreme low bit-
width matrix operations (whether multiplication or otherwise).
We've tried various implementations of kernels on both CPUs/GPUs
(really only NVIDIA GPUs). We don't even come close to the
theoretical maximum speed for our kernels, and a large part of the
failure is because the architecture of existing hardware isn't
optimized for the operations we want them to do. Creating custom
silicon for ternary LLMs can accelerate inference by implementing
and designing algorithms/circuits that only work for ternary LLMs.
Unlike most hardware companies, which need silicon to show
improvements, we can already show improvements to active VRAM usage
and throughput with our custom kernels on existing hardware. This
sets pretty impressive lower bounds for custom silicon. We
originally started working on this after reading the BitNet paper
from Microsoft, and were disenchanted that we couldn't run SOTA
models on our consumer hardware (3090 and 3070M). Both Alex and I
worked on research at Dartmouth, I worked more on the ML/model
architecture side, while Alex worked on randNLA CUDA kernels to
accelerate training. The research experience, and opportunity to
talk to professors, made us realize that if we could pull off
ternary transformers, it could solve the large scale inference
problem on the edge and cloud. First, we must either retrain or
pretrain a model with our custom linear layers based on the Bitnet
1.58 layers (we're working on open sourcing our framework for
training, data labelling, and synthetic data generation here:
https://github.com/deepsilicon/Sila). The model is trained with
FP16 weights, but the weights are quantized and the quantization
function is detached from the computational graph to allow
gradients to flow, and the loss is measured w.r.t. the quantized
weights. Once the model converges, we can inference the model with
our custom kernels written for CPUs or GPUs (we are working on
Inferentia and TPU support). The end goal is to create purpose-
built custom silicon to work with the ternary weights, where we can
have better compression, throughput, latency, and energy
improvements compared to our kernels on existing hardware. We know
this is a highly challenging problem due to technical and market
difficulties. Plenty of hardware companies have tried to accelerate
inference, but most are not profitable. The biggest problem in the
ML hardware market, perplexingly, is software. It's challenging to
convince companies to switch to some new hardware when their entire
infrastructure and software stack has been configured for some
other hardware. On the technical side, we must support various
deployment options and model architectures to make large-scale
custom silicon production worthwhile. This is compounded by the
fact we want to have a single line of code handle everything,
abstracting what we're doing away from the ML engineers. So, we
need to handle everything on the technical side: compiling the
right kernels for your platform, generating the right bindings for
ONNX/TensorRT, tuning the kernels, setting the mode to training or
inference, etc. We'd love to hear your opinions about ASICs for
transformer inference - and if you know anyone who might be
interested in deploying these models, my email is
abhi@deepsilicon.com. We can't wait to hear what you all think!
Author : areddyyt
Score : 131 points
Date : 2024-09-09 16:30 UTC (6 hours ago)
| felarof wrote:
| Very interesting!
| sidcool wrote:
| Congrats on launching. This is inspiring. .
| anirudhrahul wrote:
| Can this run crysis?
| 0xDA7A wrote:
| Can this run Doom?
| danjl wrote:
| In my experience, trying to switch VFX companies from CPU-based
| rendering to GPU-based rendering 10+ years ago, a 2-5x
| performance improvement wasn't enough. We even provided a
| compatible renderer that accepted Renderman files and generated
| matching images. Given the rate of improvement of standard
| hardware (CPUs in our case, and GPU-based inference in yours), a
| 2-5x improvement will only last a few years, and the effort to
| get there is large (even larger in your case). Plus, I doubt
| you'll be able to get your HW everywhere (i.e. mobile) where
| inference is important, which means they'll need to support their
| existing and your new SW stack. The other issue is entirely non-
| technical, and may be an even bigger blocker -- switching the
| infrastructure of a major LLM provider to a new upstart is just
| plain risky. If you do a fantastic job, though, you should get
| aquahired, probably with a small individual bonus, not enough to
| pay off your investors.
| areddyyt wrote:
| We're targeting the edge market first, such as NVIDIA's Jetson
| line, because it's far less supported/focussed on. In our
| experience, whenever we did training runs on H100 clusters with
| x86, any pip package would be easily installable, and a wide
| array of software just worked. This is not the case in Jetson,
| where we constantly have to rebuild packages from source, and
| in general, NVIDIA will only release a better board every five
| years. As for the second part of your question, we agree. Much
| of our work has been trying to make switching to our software
| layer straightforward (a single line of code). The ideal
| endgame is that, given an ONNX file, we can parse the generated
| node tree and determine if our hardware supports all the nodes.
| Of course, this is assuming we have a large enough share of the
| market using our software, so we know what operations we need
| to support on the hardware side of things.
| danjl wrote:
| I cannot see any way of building HW profitably for the Jetson
| market. You are really competing with Raspberry PI, not
| Jetson, IMO. I mean, I'm no expert, but I would suggest doing
| a deep dive on your business plan if you intend to target the
| small hardware world rather than spending any time designing
| HW or SW. Then reduce your estimate by at least half since
| doing anything in that embedded/edge world has many more
| technical issues.
| areddyyt wrote:
| In general, Jetson has quite a large market. Vehicle
| companies use automotive-rated Jetson Orins, and defense
| companies also use Jetson Orins to power ML applications on
| the edge (Anduril). Many of the companies we currently talk
| to are robotics companies that are forced to use Jetsons
| because they are both the least of the bad options and the
| only edge compute provider with enough juice to run larger
| transformer models.
| danjl wrote:
| And the auto and Defense markets are so easy to enter! /s
|
| Both of these markets have long lead times, tight HW
| build times, and move incredibly slowly. They are not the
| kind of markets that like using stuff from new companies
| with no history. Again, I'm no expert, but I'd say you
| need to be concentrating on sales and market research
| now.
| areddyyt wrote:
| We are not under the illusion these markets are easy to
| enter. Still, we believe providing an effortless and
| compatible experience for edge ML computing is a strong
| competitive advantage. We have not met anyone who likes
| using Jetsons yet, unlike A100/H100s in the server
| market.
|
| Edit: I should note that if it weren't for Dusty and his
| docker image generating GitHub repo for Jetson, we would
| have spent weeks trying to get our kernels and optimized
| models shipped to customers.
| hedgehog wrote:
| With respect it doesn't sound like you know much about
| any of these businesses. This startup is extremely early,
| the road to silicon is long, and there is a lot of
| external change and learning by doing that will happen
| between here and there. This is them getting started and
| based on my related work experience I think it's pretty
| interesting.
| autoconfig wrote:
| What's your point? Is it that one shouldn't attempt to
| enter a market just because it's difficult? Or are you
| trying to educate the founders about something obvious
| that they likely have already spent 1000x more time
| thinking about than you?
| motoxpro wrote:
| This 1000%. Just because a business in a tangential area
| didn't work, doesn't mean innovation shouldn't happen
| danjl wrote:
| I think the only way this could work is if you had the
| backing of one of the major LLM providers who decided that
| your ideas are worth doing a PoC. That way you actually have
| a client on board before you spend all the money. I know you
| guys probably like the designing of the HW and SW, and maybe
| the implementation of both, but really, what you need now is
| to do sales.
| areddyyt wrote:
| Agreed. We don't plan on making hardware until there is
| enough demand from customers to make it economically
| viable.
| lumost wrote:
| There are multiple ways to run a business like this.
|
| 1. Go deep on the tech, there are funders who will want
| equity stakes in risky startups because they operate in
| adjacent markets. It's often cheaper to invest 1MM on a
| startup than internal R&D activities. If it has promising
| results, those same investors may ramp up their spend or
| pivot to an acquisition strategy.
|
| 2. Get early customers, if you have 1-10 large enterprises
| with a committed spend - then you are likely golden.
| However as nice as this option sounds, there are few
| avenues to get this type of commitment. If you are in the
| fortunate position of knowing the exec/founding/investor
| team of a large LLM provider - it's possible. But easier
| said than done.
|
| 3. Build it and they will come, business strategies take
| time to develop - maybe that time is poorly spent. Build
| the best version of your product and someone might take it
| up. There are a few investors who will take a flyer on this
| type of founder mentality. Benefit to the investor is that
| they can get a much larger equity stake/board position in
| exchange for the early creative freedom. If it works out,
| the investor can get a lot of alpha. A card which handled
| LLM inference at 1/100th the cost of an H100 could produce
| quite a bit of value for the right buyer.
| gchadwick wrote:
| Is GPU rendering used today for VFX? From a quick google it
| seems that yes GPU based rendering is definitely an option,
| even if there's various reasons to still prefer CPU. So in your
| case was it really what you were aiming to do was pointless or
| simply your particular solution failed to succeed?
|
| You're right that as a small player it's very hard to gain
| traction, even if the tech is fantastic because it's risky to
| switch your tech stack over. Though if you do do a good job
| with the tech I'd say you have a decent chance of an
| acquisition from a bigger player who wants a ready-made (or 90%
| of the way there) solution they can make their own. Perhaps you
| can call this an aquihire but I think you're significantly
| underplaying the potential upside of this exit. Imagine this
| startup is seen as having a great ternary transformer solution
| and ternary transformers are the way to go you could get
| multiple large players eyeing up an acquisition to get ahead
| pushing the price up.
|
| My feeling is custom ASICs for ternary transformers is a great
| area to look at. There is a genuine chance of providing a
| significant step up from GPUs in terms of power efficiency and
| potentially performance. Plenty of risk of course, ternary
| models might just not perform as well as the full fat
| equivalents and building custom silicon, especially as a start-
| up, comes with all kinds of issues.
| jsheard wrote:
| > Is GPU rendering used today for VFX?
|
| Yes by small studios with the agility to change their
| workflow without too much friction, and whose projects are
| small enough to fit into the constraints of GPU renderers,
| but largely not by huge studios who already have in-house CPU
| farms and whose projects need hundreds of gigs of RAM to
| render anyway.
| kridsdale3 wrote:
| The Unreal Engine I hear is getting a lot of work these
| days.
| jroesch wrote:
| Having been working in DL inference for now 7+ years (5 of
| which at startup) which makes me comparably ancient in the AI
| world at this point. The performance rat race/treadmill is
| never ending, and to your point a large (i.e 2x+) performance
| improvement is not enough of a "painkiller" for customers
| unless there is something that is impossible for them to
| achieve without your technology.
|
| The second problem is distribution: it is already hard enough
| to obtain good enough distribution with software, let alone
| software + hardware combinations. Even large silicon companies
| have struggled to get their HW into products across the world.
| Part of this is due to the actual purchase dynamics and cycle
| of people who buy chips, many design products and commit to N
| year production cycles of products built on certain hardware
| SKUs, meaning you have to both land large deals, and have
| opportune timing to catch them when they are evening shopping
| for a new platform. Furthermore the people with existing
| distribution i.e the Apple, Google, Nvidia, Intel, AMD,
| Qualcomms of the world already have distribution and their own
| offerings in this space and will not partner/buy from you.
|
| My framing (which has remained unchanged since 2018) is that
| for silicon platform to win you have to beat the incumbents
| (i.e Nvidia) on the 3Ps: Price (really TCO), Performance, and
| Programmability.
|
| Most hardware accelerators may win on one, but even then it is
| often theoretical performance because it assumes their existing
| software can/will work on your chip, which it often doesn't
| (see AMD and friends).
|
| There are many other threats that come in this form, for
| example if you have a fixed function accelerator and some part
| of the model code has to run on CPU the memory
| traffic/synchronization can completely negate any performance
| improvements you might offer.
|
| Even many of the existing silicon startups have been struggling
| with this since them middle of the last decade, the only thing
| that saved them is the consolidation to Transformers but it is
| very easy for a new model architecture to come out and require
| everyone to rework what they have built. This need for
| flexibility is what has given rise to the design ethos around
| GPGPU as flexibility in a changing world is a requirement not
| just a nice to have.
|
| Best of luck, but these things are worth thinking deeply about
| as when we started in this market we were already aware of many
| of these things but their importance and gravity in the AI
| market have only become more important, not less :)
| areddyyt wrote:
| We've spent a lot of time thinking about these things, in
| particular, the 3Ps.
|
| Part of making the one line of code work is addressing
| programmability. If you're on Jetson, we should load the CUDA
| kernels for Jetson's. If you're using a CPU, we should load
| the CPU kernels. CPU with AVX512, load the appropriate
| kernels with AVX512 instruction, etc.
|
| The end goal is that when we introduce our custom silicon,
| one line of code should make it far easier to bring customers
| over from Jetson/any other platform because we handle loading
| the correct backend for them.
|
| We know this will be bordering impossible, but it's critical
| to ensure we take on that burden rather than shifting it to
| the ML engineer.
| henning wrote:
| I applaud the chutzpah of doing a company where you develop both
| hardware and software for the hardware. If you execute well, you
| could build yourself a moat that is very difficult for would-be
| competitors to breach.
| 0xDA7A wrote:
| I think the part I find most interesting about this is the
| potential power implications. Ternary models may perform better
| in terms of RAM and that's great, but if you manage to build a
| multiplication-free accelerator in silicon, you can start
| thinking about running things like vision models in < 0.1W of
| power.
|
| This could have insane implications for edge capabilities, robots
| with massively better swarm dynamics, smart glasses with super
| low latency speech to text, etc.
|
| I think the biggest technical hurdle would be simulating the non
| linear layers in an efficient way, but you can also solve that
| since you already re-train your models and could use custom
| activation functions that better approximate a HW efficient non
| linear layer.
| areddyyt wrote:
| The non-linear layers, particularly the softmax(QK^T), will be
| crucial to getting ultra-low latency and high throughput. We're
| considering some custom silicon just for that portion of every
| transformer block
| cs702 wrote:
| Watching the video demo was key for me. I highly recommend
| everyone else here watches it.[a]
|
| From a software development standpoint, usability looks _great_ ,
| requiring only one import, import deepsilicon as
| ds
|
| and then, later on, a single line of Python,
| model = ds.convert(model)
|
| which takes care of converting all possible layers (e.g.,
| nn.Linear layers) in the model to use ternary values. Very nice!
|
| The question for which I don't have a good answer is whether the
| improvement in real-world performance, using your hardware, will
| be sufficient to entice developers to leave the comfortable
| garden of CUDA and Nvidia, given that the latter is continually
| improving the performance of its hardware.
|
| I, for one, hope you guys are hugely successful.
|
| ---
|
| [a] At the moment, the YouTube video demo has some cropping
| issues, but that can be easily fixed.
| areddyyt wrote:
| Thank you!
|
| CUDA and Nvidia are practically impenetrable on the server
| side. To be very concrete, we did training for our models on
| AWS with parallel cluster. We used P5 instances (8xH100) that
| were scheduled with SLURM. A problem we ran into however, was
| that our training jobs were containerized. Thankfully, pyxis
| and enroot exist to run containerized jobs on SLURM. And who
| else, but Nvidia, develop and maintain those plugins. For
| practically any weird niche use case, Nvidia seems to have some
| software solution - but only on x86.
|
| Jetson is a whole other beast. There is no guarantee any pip
| package you install has an aarch64/arm64 wheel. For example, we
| could not use torch_tensorrt, to compile to TensorRT via Torch
| Inductor. Why? Because the Bazel build system was only
| configured to build for Jetpack 4.6 or Jetpack 5.1, and we were
| using Jetpack 6. While Nvidia provides docker images for x86
| systems that come with torch_tensorrt installed, their L4T
| (Linux for Tegra) images do not. Instead we had to manually
| write out a new workspace file and compile for Jetpack6 to
| provide TensorRT compiling support.
|
| tl;dr: Nvidia and CUDA have a great walled garden on x86, not
| so much on their edge computing devices
| cs702 wrote:
| My understanding is that, so far, most deployments of AI on
| edge devices are on mass-market mobile and entertainment
| devices relying on software and hardware tightly controlled
| by a handful of mega-corporations, such as Apple (iOS),
| Google (Android), Samsung (phones, TVs, etc.), and Tesla
| (proprietary in-car chips for FSD), and so on. Aren't those
| mega-corporations, not Nvidia, the ones who have the actual
| walled gardens on AI edge computing?
|
| Do you think otherwise?
| areddyyt wrote:
| You're absolutely right about mobile devices (Apple,
| Google, etc.). However, most companies, with the exception
| of Tesla, do use Nvidia for edge computing capabilities. We
| know for a fact that most of the automotive industry uses
| automotive rated Orins (the 32GB unified RAM SKU) [1] and
| Anduril also use Orins. Our primary GTM is with robotics
| companies, and we have not met a single robotics company
| not using Jetson, I'm not exaggerating.
|
| [1] Particularly vehicles with advanced self driving
| capabilities. Qualcomm is another large vendor of hardware
| for vehicles (though they have even worse support)
| cs702 wrote:
| > Our primary GTM is with robotics companies, and we have
| not met a single robotics company not using Jetson, I'm
| not exaggerating.
|
| Huh. That's a really good sign. I'm rooting for you!
| areddyyt wrote:
| Video cropping issues should be fixed!
| _zoltan_ wrote:
| you might want to redo the video as it's cropped too much, and
| maybe it's only me but it's _really_ annoying to watch like this.
| areddyyt wrote:
| Oops, good catch. Will re upload shortly.
| dang wrote:
| Thanks! We've updated the youtube link at the top to the fixed
| version.
| nostrebored wrote:
| What do you think about the tension between inference accuracy
| and the types of edge applications used today?
|
| For instance, if you wanted to train a multimodal transformer to
| do inference on CCTV footage I think that this will have a big
| advantage over Jetson. And I think there are a lot of potentially
| novel use cases for a technology like that (eg. if I'm looking
| for a suspect wearing a red hoodie, I'm not training a new
| classifier to identify all possible candidates)
|
| But for sectors like automotive and defense, is the accuracy loss
| from quantization tolerable? If you're investing so much money in
| putting together a model, even considering procuring custom
| hardware and software, is the loss in precision worth it?
| areddyyt wrote:
| Great question. So a little bit of background about
| quantization (apologies if you are already familiar).
|
| There are two types of quantization (generally), post training
| quantization (PTQ) and quantization aware training (QAT).
|
| PTQ almost always suffers from some kind of accuracy
| degradation. This is because usually the loss is measured with
| respect to the FP16/BF16 parameters, and so the weights and
| distribution are selected to minimize the loss with those
| weights. Once the quantization function is applied, the weights
| and distribution change in some way (even if it's by a tiny
| amount), resulting in your model no longer being at minima.
|
| We do QAT to get around the problem of PTQ. We actually
| quantize the weights during the forward pass of training, and
| measure the loss with respect to the quantized weights. As a
| result, once we converge the model, we have converged the
| ternary weights as well, and the accuracy it achieved at the
| end of training is the accuracy of the quantized model. At ~3B
| parameters the accuracy on downstream task performance between
| FP16 and ternary weights is identical.
| jacobgorm wrote:
| I was part of a startup called Grazper that did the same thing
| for CNNs in 2016, using FPGAs. I left to found my own thing after
| realizing that new better architectures, SqueezeNet followed by
| MobileNets, could run even faster than our ternary nets on off-
| the-shelf hardware. I'd worry that a similar development might
| happen in the LLMs space.
| areddyyt wrote:
| It's always possible, but transformers have been around since
| 2017 and don't seem to be going anywhere. I was bullish on
| Mamba and researched extended context for structured state-
| space models at Dartmouth. However, no one cared. The bet we're
| taking is that Transformers will dominate for at least a few
| more years, but our bet could be wrong.
| lappa wrote:
| Great project, looking forward to seeing more as this develops.
|
| Also FYI, your mail server seems to be down.
| areddyyt wrote:
| Thank you, and good catch.
|
| We recently acquired deepsilicon.com, and it looks like the
| forwarding hasn't been registered yet. abhi@deepsilicon.net
| should work.
| mikewarot wrote:
| Since you're flexible on the silicon side, perhaps consider
| designing things so that the ternary weights are loaded from an
| external configuration rom into a shift register chain, instead
| of fixed. This would allow updating the weights without having to
| go through the whole production chain again.
| areddyyt wrote:
| We actually were thinking about this to flush the weights in at
| initialization
| mikewarot wrote:
| Cool.... if you want to make a general purpose compute engine
| out of it, you could go full BitGrid[1]. ;-)
|
| [1] https://bitgrid.blogspot.com/2005/03/bitgrid-story.html
| areddyyt wrote:
| This seems super cool. I'll have my cofounder look into it
| :)
| hy3na wrote:
| ternary transformers have existed for a long time before you guys
| TerDit, vision ones etc. Competing in the edge inference space is
| likely going to require a lot of capex and opex + breaking into
| markets like defense thatre hard asf without connections and a
| strong team. neither of you guys are chip architects either and
| taping out silicon requires a lot of foresight to changing market
| demands. good luck, hopefully it works out.
| nicoty wrote:
| Could the compression efficiency you're seeing somehow be related
| to 3 being the closest natural number to the number e, which also
| happens to be the optimal radix choice
| (https://en.wikipedia.org/wiki/Optimal_radix_choice) for storage
| efficiency?
| areddyyt wrote:
| We don't achieve peak compression efficiency because more
| complex weight unpacking mechanisms kill throughput.
|
| To be more explicit, the weight matrix's values belong to the
| set of -1, 0, and 1. When using two bits to encode these
| weights, we are not effectively utilizing one possible state:
|
| 10 => 1, 01 => 0, 00 =>-1, 11 => ?
|
| I think selecting the optimal radix economy will have more of a
| play on custom silicon, where we can implement silicon and
| instructions to rapidly decompress weights or work with the
| compressed weights directly.
| luke-stanley wrote:
| The most popular interfaces (human, API and network) I can
| imagine are ChatGPT, OpenAI compatible HTTP API, Transformers
| HuggingFace API and models, Llama.cpp / Ollama / Llamafile,
| Pytorch. USB C, USB A, RJ45, HDMI/video(?) If you can run a
| frontier model or a comparable model with the ChatGPT clone like
| Open UI, with a USB or LAN interface, that can work on private
| data quickly, securely and competitively to a used 3090 it would
| be super badass. It should be easy to plug in and be used for
| running chat or API use or fine-tune or use with raw primitives
| via Pytorch or a very similar compatible API. I've thought about
| this a bit. There's more I could say but I've got to sleep
| soon... Good luck, it's an awesome opportunity.
| areddyyt wrote:
| Have you sat in on my conversations with my cofounder?
|
| The end plan is to have a single chip and flush all weights
| onto the chip at initialization. Because we are a single line
| of code that is Torch compatible (hence HF compatible), every
| other part of the codebase shouldn't change.
| luke-stanley wrote:
| I've not but that sounds cool! I would point out though, in
| terms of mind share, how memorable, and how relatable and
| useful the products are: it might help to have ways that
| directly show the application for the kinds of people buying
| GPUs for inference and training or using cloud for this that
| would love to not have to fight their ATX case in a hot
| sweaty corner while repeatedly dropping screwdrivers and
| calculating how much RAM they need to buy for the 405B while
| llama.cpp is recompiling again... I think people would throw
| money at that. I'd be happy to listen in or have a chat some
| time!
| transfire wrote:
| Combine it with TOC, and then you'd really be off to the races!
|
| https://intapi.sciendo.com/pdf/10.2478/ijanmc-2022-0036#:~:t...
___________________________________________________________________
(page generated 2024-09-09 23:00 UTC)