[HN Gopher] Who uses Google TPUs for inference in production?
       ___________________________________________________________________
        
       Who uses Google TPUs for inference in production?
        
       I am really puzzled by TPUs. I've been reading everywhere that TPUs
       are powerful and a great alternative to NVIDIA.  I have been
       playing with TPUs for a couple of months now, and to be honest I
       don't understand how can people use them in production for
       inference:  - almost no resources online showing how to run modern
       generative models like Mistral, Yi 34B, etc. on TPUs - poor
       compatibility between JAX and Pytorch - very hard to understand the
       memory consumption of the TPU chips (no nvidia-smi equivalent) -
       rotating IP addresses on TPU VMs - almost impossible to get my
       hands on a TPU v5  Is it only me? Or did I miss something?  I
       totally understand that TPUs can be useful for training though.
        
       Author : arthurdelerue
       Score  : 81 points
       Date   : 2024-03-11 16:22 UTC (6 hours ago)
        
       | hiddencost wrote:
       | Google is using them in prod. I think they're so hungry for chips
       | internally that cloud isn't getting much support in selling them.
        
         | amelius wrote:
         | Or maybe they are just using nVidia. Who knows ...
        
           | eklitzke wrote:
           | Lots of people know.
        
             | nivekney wrote:
             | _marks as solved_
        
           | vineyardmike wrote:
           | Beyond the fact that this is hardly a secret, there's lots of
           | other signs.
           | 
           | 1. They have bought far less from NVidia than other hyper
           | scalers, and they literally can't vomit without saying "AI".
           | They have to be running those models on something. They have
           | purchased huge amounts of chips from fabs, and what else
           | would that be?
           | 
           | 2. They have said they use them. Should be pretty obvious
           | here.
           | 
           | 3. They maintain a whole software stack for them, they design
           | the chips, etc. Then they don't really try to sell the TPU.
           | Why else would they do this?
        
             | jeffbee wrote:
             | They have announced publicly using TPUs for inference, as
             | far back as 2016. They did not offer TPUs for Cloud
             | customers until 2017. The development is clearly driven by
             | internal use cases. One of the features they publicly
             | disclosed as TPU-based was Smart Reply and that launched in
             | 2015. So their internal use of TPUs for inference goes back
             | nearly a decade.
        
         | danjl wrote:
         | I would guess that Google's vertexAI managed solution uses
         | TPUs. Also Google uses them internally to train and infer for
         | all their research products.
        
           | sciencesama wrote:
           | 80 to 90% are consumed internally !! Only from version 5 it
           | is planned to be customer focussed !!
        
           | michaelt wrote:
           | While you _can_ use TPUs with vertexai, it 's just virtual
           | machines - you can have one with an nvidia card if you like.
        
         | VirusNewbie wrote:
         | They're getting swallowed up by Anthropic and the other huge
         | spenders:
         | 
         | https://www.prnewswire.com/news-releases/google-announces-ex...
         | 
         | "Partnership includes important new collaborations on AI safety
         | standards, committing to the highest standards of AI security,
         | and use of TPU v5e accelerators for AI inference "
        
         | _b wrote:
         | I think this is right, in part because I've been told exactly
         | this from people who work for Google and their job is to sell
         | me cloud stuff- i.e., they say they have so much internal
         | demand they aren't pushing TPUs for external use. Hence
         | external pricing and support just isn't that great right now.
         | But presumably when capacity catches up they'll start pushing
         | TPUs again.
        
           | natbobc wrote:
           | Feels like a bad point in the curve to try and sell them. "Oh
           | our internal hypecycle is done... we'll put them in the
           | market now that they're all worn out.
        
             | rj45jackattack wrote:
             | Sounds like ButterflyLabs.
        
       | emadm wrote:
       | https://pytorch.org/blog/high-performance-llama-2/
        
         | htrp wrote:
         | >Cheers,
         | 
         | > The PyTorch/XLA Team at Google
         | 
         | Meanwhile you have an issue from 5 years ago with 0 support
         | 
         | https://github.com/pytorch/xla/issues/202
        
       | htrp wrote:
       | We've previously tried and almost always regretted the decision.
       | I think the tech stack needs another 12-18 months to mature
       | (doesn't help that almost all work ex Google is being done in
       | torch).
        
         | mike_d wrote:
         | > I think the tech stack needs another 12-18 months to mature
         | 
         | Google has been doing AI before any other company even thought
         | about it. They are on the 6th generation of TPU hardware.
         | 
         | I don't think there is any maturity issue, just an availability
         | issue because they are all being used internally.
        
           | htrp wrote:
           | 100% agree, if I have access to the TPU team internally,
           | it'll be very easy to use in production.
           | 
           | If you aren't internal, the documentation, support, and even
           | just general bug fixing is impossible to get.
        
             | ethbr1 wrote:
             | (Has an expert team dedicated solely to optimizing for
             | exotic hardware) = an option
             | 
             | (Doesn't have a team like that) = stick to mass-use,
             | commodity hardware
             | 
             | That's generally been the trade-off since ~1970. And
             | usually, the performance isn't worth the people-salaries.
             | 
             | How many examples of successful hardware that isn't well-
             | documented and doesn't have drop-in 1:1 SDK coverage vs
             | (more popular solution) are there?
             | 
             | It seems like a heavy-lift to even get something that does
             | have parity in those ways adopted, given you're fighting
             | market inertia.
        
         | danielcampos93 wrote:
         | I feel like I have been hearing that since V1 TPU. I think
         | Google is the perfect solution because they are teams whose job
         | is to take a model and TPUify it. Elsewhere there is no team,
         | so it's no fun.
        
       | kccqzy wrote:
       | Apparently Midjourney uses it. GCP put out a press release a
       | while ago: https://www.prnewswire.com/news-releases/midjourney-
       | selects-...
        
         | trsohmers wrote:
         | The quote from the linked press release is that they do
         | training on TPUv4, while inference is running on GPUs. I have
         | also heard this separately from people associated with
         | Midjourney recently, and that they solely do training on TPUs.
        
       | derdrdirk wrote:
       | TPUs are tightly coupled to JAX and the XLA compiler. If your
       | model is based on Pytorch you can use a bridge to export your
       | model to StableHLO and then compile it to a TPU accelerator. In
       | theory the XLA compiler should be more performant than the
       | Pytorch Inductor.
        
       | ooterness wrote:
       | There's a cubesat using a Coral TPU for pose estimation.
       | 
       | https://aerospace.org/article/aerospaces-slingshot-1-demonst...
        
       | ChrisArchitect wrote:
       | Ask HN:
        
       | pogue wrote:
       | I've seen people connecting these to Raspberry Pis to run local
       | LLMs but I'm not sure how effective it is. Check YouTube for some
       | videos about it.
       | 
       | Speaking of SBCs, prior to the Raspberry Pi, I was looking at the
       | Orange Pi 5 which has a Rockchip RK3588S with an NPU (Neural
       | Processing Unit). This was the first I had heard of such a thing
       | but I was curious how/what exactly it does. Unfortunately,
       | there's very little support for Orange Pi & not a large community
       | for it so I couldn't find any feedback on how well it worked or
       | what it did.
       | 
       | http://www.orangepi.org/html/hardWare/computerAndMicrocontro...
        
       | mrwilliamchang wrote:
       | To see memory consumption on the TPU while running on GKE you can
       | look at kubernetes.io/node/accelerator/memory_used
       | 
       | https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#...
        
       ___________________________________________________________________
       (page generated 2024-03-11 23:01 UTC)