[HN Gopher] 'I paid for the whole GPU, I am going to use the who...
       ___________________________________________________________________
        
       'I paid for the whole GPU, I am going to use the whole GPU'
        
       Author : mooreds
       Score  : 63 points
       Date   : 2025-05-07 21:04 UTC (1 hours ago)
        
 (HTM) web link (modal.com)
 (TXT) w3m dump (modal.com)
        
       | cubefox wrote:
       | For anyone thinking this is about video games:
       | 
       | > We'll specifically focus on neural network inference workloads
        
         | keybored wrote:
         | It's hard to forget the neural network application these days.
        
       | mooreds wrote:
       | The subtitle (which is important but was too long for the HN
       | submission) is "A high-level guide to GPU utilization".
        
       | mwilcox wrote:
       | Understandable
        
       | Mockapapella wrote:
       | This is a good article on the "fog of war" for GPU inference.
       | Modal has been doing a great job of aggregating and disseminating
       | info on how to think about high quality AI inference. Learned
       | some fun stuff -- thanks for posting it.
       | 
       | > the majority of organizations achieve less than 70% GPU
       | Allocation Utilization when running at peak demand -- to say
       | nothing of aggregate utilization. This is true even of
       | sophisticated players, like the former Banana serverless GPU
       | platform, which operated at an aggregate utilization of around
       | 20%.
       | 
       | Saw this sort of thing at my last job. Was very frustrating
       | pointing this out to people only for them to respond with
       | -\\_(tsu)_/-. I posted a much less tactful article (read: rant)
       | than the one by Modal, but I think it still touches on a lot of
       | the little things you need to consider when deploying AI models:
       | https://thelisowe.substack.com/p/you-suck-at-deploying-ai-mo...
        
         | charles_irl wrote:
         | Nice article! I had to restrain myself from ranting on our blog
         | :)
        
       | awesome_dude wrote:
       | I'm old enough to remember when people would be concerned if
       | their CPU usage went to 100%
        
         | bdangubic wrote:
         | back in those days you weren't renting them :)
        
           | XorNot wrote:
           | Well also people are pretty bad at logistical reasoning
           | though.
           | 
           | From a capital expenditure perspective, you _are_ renting the
           | CPU you bought in terms of opportunity cost.
           | 
           | What people have some sense of is that there's an ascribable
           | value to having a capability in reserve versus discovering
           | you don't have it when you need it.
        
         | twoodfin wrote:
         | You'd worry about 100% CPU because even if the OS was
         | successfully optimizing for throughput (as Linux is very good
         | at), latency/p99 is certain to suffer as spare cycles
         | disappear.
         | 
         | That's not a concern with typical GPU workloads, which are
         | batch/throughput-oriented.
        
           | calaphos wrote:
           | There's still a throughput/latency tradeoff curve, at least
           | for any sort of interactive models.
           | 
           | One of the reasons why inference providers sell batch
           | discounts.
        
       | charles_irl wrote:
       | Oh, I wrote this! Thanks for sharing it.
        
       ___________________________________________________________________
       (page generated 2025-05-07 23:00 UTC)