https://www.theregister.com/2024/08/23/3090_ai_benchmark/ # # Sign in / up The Register(r) -- Biting the hand that feeds IT # # # Topics Security Security All SecurityCyber-crimePatchesResearchCSO (X) Off-Prem Off-Prem All Off-PremEdge + IoTChannelPaaS + IaaSSaaS (X) On-Prem On-Prem All On-PremSystemsStorageNetworksHPCPersonal TechCxOPublic Sector (X) Software Software All SoftwareAI + MLApplicationsDatabasesDevOpsOSesVirtualization (X) Offbeat Offbeat All OffbeatDebatesColumnistsScienceGeek's GuideBOFHLegalBootnotesSite NewsAbout Us (X) Special Features Special Features All Special Features VMware Explore Blackhat and DEF CON Cloud Infrastructure Month Malware Month The Reg in Space Spotlight on RSA Vendor Voice Vendor Voice Vendor Voice All Vendor Voice Amazon Web Services (AWS) New Horizon in Cloud Computing Google Gemini Hewlett Packard Enterprise: Edge-to-Cloud Platform Intel vPro VMware (X) Resources Resources Whitepapers Webinars & Events Newsletters [systems] Systems comment bubble on black Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands comment bubble on black For 100 concurrent users, the card delivered 12.88 tokens per second--just slightly faster than average human reading speed icon Tobias Mann Fri 23 Aug 2024 // 21:00 UTC # If you want to scale a large language model (LLM) to a few thousand users, you might think a beefy enterprise GPU is a hard requirement. However, at least according to Backprop, all you actually need is a four-year-old graphics card. In a recent post, the Estonian GPU cloud startup demonstrated how a single Nvidia RTX 3090, debuted in late 2020, could serve a modest LLM like Llama 3.1 8B at FP16 serving upwards of 100 concurrent requests while maintaining acceptable throughputs. Since only a small fraction of users are likely to be making requests at any given moment, Backprop contends that a single 3090 could actually support thousands of end users. The startup has been renting GPU resources for the past three years and recently transitioned into a self-service cloud offering. [systems] While powering a cloud using consumer hardware might seem like an odd choice, Backprop is hardly the first to do it. German infrastructure-as-a-service provider Hetzner has long offered bare metal servers based on AMD's Ryzen processor family. [systems] [systems] As a GPU, the RTX 3090 isn't a bad card for running LLMs. In terms of performance, it boasts 142 teraFLOPS of dense FP16 performance and offers 936GB/s of memory bandwidth, the latter being a key decider of performance in LLM inferencing workloads. "3090s are actually very capable cards. If you want to get the datacenter equivalent of a 3090 in terms of teraFLOPS power, then you would need to go for something that is significantly more expensive," Backprop co-founder Kristo Ojasaar told The Register. [systems] Where the card does fall short of more premium workstations and enterprise cards from the Ampere generation is memory capacity. With 24GB of GDDR6x memory, you aren't going to be running models like Llama 3 70B or Mistral Large even if you quantized them to four or eight bit precisions. So, it's not surprising that Backprop opted for a smaller model like Llama 3.1 8B, since it fits nicely within the card's memory and leaves plenty of space left over for key value caching. The testing was done with the popular vLLM framework, which is widely used to serve LLMs across multiple GPUs or nodes at scale. But before you get too excited, these results aren't without a few caveats. With 100 concurrent users, the per-user throughput falls to just 12.88 tokens a second. Source: Backprop With 100 concurrent users, the per-user throughput falls to just 12.88 tokens a second. Source: Backprop - Click to enlarge In a benchmark simulating 100 concurrent users, Backprop found the card was able to serve the model to each user at 12.88 tokens per second. While that's faster than the average person can read, generally said to be about five words per second, that's not exactly fast. With that said, it's still more than the 10 tokens per second that is generally considered the minimum acceptable generation rate for AI chatbots and services. It's also worth noting that Backprop's testing was done using relatively short prompts and a maximum output of just 100 tokens. This means these results are more indicative of the kind of performance you might expect from a customer service chatbot than a summarization app. [systems] However, in further testing with the --use_long_context flag in the vLLM benchmark suite set to true, and prompts ranging from 200-300 tokens in length, Ojasaar found that the 3090 could still achieve acceptable generation rates of about 11 tokens per second while serving 50 concurrent requests. It's also worth noting that these figures were measured while running Llama 3.1-8B at FP16. Quantizing the model to eight or even four bits would theoretically double or quadruple the throughput of these models, allowing the card to serve a large number of concurrent requests or serve the same number at a higher generation rate. But, as we discussed in our recent quantization guide, compressing models to lower precision can come at the cost of accuracy, which may or may not be acceptable for a given use case. * Gamers who find Ryzen 9000s disappointingly slow are testing it wrong, says AMD * Another GPU cloud emerges. This time, upstart Foundry * AMD hopes to unlock MI300's full potential with fresh code * What AI bubble? Groq rakes in $640M to grow inference cloud If anything, Backprop's testing demonstrates the importance of performance analysis and right sizing workloads to a given task. "I guess what the excellent marketing of bigger clouds is doing is saying that you really need some managed offering if you want to scale... or you really need to invest in this specific technology if you want to serve a bunch of users, but clear this shows that's not necessarily true," Ojasaar said. For users who need to scale to larger models, higher throughputs or batch sizes, Ojasaar told us Backprop is in the process of deploying A100 PCIe cards with 40GB HBM2e. While also an older card, he says the availability of multi-instance-GPU to dice up a single accelerator into multiple presents an opportunity to lower costs further for enthusiasts and tinkers. If you're curious how your old gaming card might fair in a similar test, you can find Backprop's vLLM benchmark here. (r) Get our Tech Resources # Share More about * AI * GPU * Large Language Model More like these x More about * AI * GPU * Large Language Model * Nvidia * Public Cloud Narrower topics * Amazon Bedrock * Anthropic * Apple M1 * ChatGPT * Gemini * Google AI * GPT-3 * GPT-4 * GTC * Machine Learning * MCubed * Neural Networks * NLP * Omniverse * Star Wars * Tensor Processing Unit * TOPS Broader topics * Alder Lake * Cloud Computing * Hardware * Self-driving Car More about # Share comment bubble on black POST A COMMENT More about * AI * GPU * Large Language Model More like these x More about * AI * GPU * Large Language Model * Nvidia * Public Cloud Narrower topics * Amazon Bedrock * Anthropic * Apple M1 * ChatGPT * Gemini * Google AI * GPT-3 * GPT-4 * GTC * Machine Learning * MCubed * Neural Networks * NLP * Omniverse * Star Wars * Tensor Processing Unit * TOPS Broader topics * Alder Lake * Cloud Computing * Hardware * Self-driving Car TIP US OFF Send us news --------------------------------------------------------------------- Other stories you might like Nvidia's latest AI climate model takes aim at severe weather That tornado warning couldn't possibly be a hallucination... could it? Science19 Aug 2024 | 7 Nvidia's subscription software empire is taking shape Comment $4,500 per GPU per year adds up pretty quick - even faster when you pay by the hour Cloud Infrastructure Month6 Aug 2024 | 23 Another GPU cloud emerges. This time, upstart Foundry Biz set sights beyond just another rent-an-accelerator cluster provider Systems13 Aug 2024 | 1 When building the future, the past is no longer a guide Tomorrow's engineering challenges are more than a core problem Sponsored Feature [systems] Delays? We're still shipping 'small quantities' of Nvidia's GB200 in Q4, Foxconn insists Production ramp won't kick off until Q1 2025 Systems14 Aug 2024 | 1 LiquidStack says its new CDU can chill more than 1MW of AI compute So what's that good for? Like eight of Nvidia's NVL-72s? Systems22 Aug 2024 | 4 Alibaba and Tencent clouds see demand for CPUs level off, GPUs accelerate Lenovo also cashes in on AI demand, without being able to turn it into profit Off-Prem20 Aug 2024 | What's going on with AMD funding a CUDA translation layer, then nuking it? Analysis We guess the House of Zen wants all you HIP kids to ROCm out with its own runtimes instead Software9 Aug 2024 | 10 All that new AI-fueled datacenter space? Yeah, that's mostly ours - cloud giants Construction surged 70% ... with 80% already snapped up On-Prem21 Aug 2024 | 3 Google-commissioned report claims early adopters already enjoying fruits of gen-AI labor Analysis 43% of the time, it really, really works 45% of the time AI + ML12 Aug 2024 | 46 Nvidia reportedly delays Blackwell GPUs until 2025 over packaging issues Updated Backdrop of multi-billion dollar orders to support AI services, but unlikely to hurt NVDA long term Systems5 Aug 2024 | 4 AI or bust? Only one part of US tech economy keeps growing, says analyst Investors still shovelling money into AI but 'path to monetization' still far AI + ML14 Aug 2024 | 7 The Register icon Biting the hand that feeds IT About Us* * Contact us * Advertise with us * Who we are Our Websites* * The Next Platform * DevClass * Blocks and Files Your Privacy* * Cookies Policy * Privacy Policy * Ts & Cs * Do not sell my personal information Situation Publishing Copyright. All rights reserved (c) 1998-2024 no-js