[HN Gopher] Without benchmarking LLMs, you're likely overpaying
       ___________________________________________________________________
        
       Without benchmarking LLMs, you're likely overpaying
        
       Author : lorey
       Score  : 123 points
       Date   : 2026-01-20 19:03 UTC (1 days ago)
        
 (HTM) web link (karllorey.com)
 (TXT) w3m dump (karllorey.com)
        
       | petcat wrote:
       | > He's a non-technical founder building an AI-powered business.
       | 
       | It sounds like he's building some kind of ai support chat bot.
       | 
       | I despise these things.
        
         | r_lee wrote:
         | And the whole article is about promoting his benchmarking
         | service, of course.
        
         | montroser wrote:
         | The whole post is just an advert for this person's startup.
         | Their "friend" doesn't exist...
        
         | lorey wrote:
         | Totally agree with your point. While I can't say specifically,
         | it's a traditional (German) business he's doing vertically
         | integrated with AI. Customer support is really bad in this
         | traditional niche and by leveraging AI on top of doing the
         | support himself 24/7, he was able to make it his competitive
         | edge.
        
       | verdverm wrote:
       | I'd second this wholeheartedly
       | 
       | Since building a custom agent setup to replace copilot,
       | adopting/adjusting Claude Code prompts, and giving it basic
       | tools, gemini-3-flash is my go-to model unless I know it's a big
       | and involved task. The model is really good at 1/10 the cost of
       | pro, super fast by comparison, and some basic a/b testing shows
       | little to no difference in output on the majority of tasks I used
       | 
       | Cut all my subs, spend less money, don't get rate limited
        
         | r_lee wrote:
         | Plus I've found that overall with "thinking" models, it's more
         | like for memory, not even actual perf boost, it might even be
         | worse because if it goes even slightly wrong on the "thinking"
         | part, it'll then commit to that for the actual response
        
           | verdverm wrote:
           | for sure, the difference in the most recent model generations
           | makes them far more useful for many daily tasks. This is the
           | first gen with thinking as a significant mid-training focus
           | and it shows
           | 
           | gemini-3-flash stands well above gemini-2.5-pro
        
         | dpoloncsak wrote:
         | Yeah, one of my first projects one of my buddies asked "Why
         | aren't you using [ChatGPT 4.0] nano? It's 99% the effectiveness
         | with 10% the price."
         | 
         | I've been using the smaller models ever since. Nano/mini,
         | flash, etc.
        
           | phainopepla2 wrote:
           | I have been benchmarking many of my use cases, and the GPT
           | Nano models have fallen completely flat one every single
           | except for very short summaries. I would call them 25%
           | effectiveness at best.
        
             | verdverm wrote:
             | Flash is not a small model, it's still over 1T parameters.
             | It's a hyper MoE aiui
             | 
             | I have yet to go back to small models, waiting for the
             | upstream feature / GPU provider has been seeing capacity
             | issues, so I am sticking with the gemini family for now
        
           | walthamstow wrote:
           | Flash Lite 2.5 is an unbelievably good model for the price
        
           | sixtyj wrote:
           | Yup.
           | 
           | I have found out recently that Grok-4.1-fast has similar
           | pricing (in cents) but 10x larger context window (2M tokens
           | instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination,
           | lowest in blind tests in LLM arena.
        
         | PunchyHamster wrote:
         | LLM bubble will burst the second investors figure out how much
         | well managed local model can do
        
       | andy99 wrote:
       | Depends on what you're doing. Using the smaller / cheaper LLMs
       | will generally make it way more fragile. The article appears to
       | focus on creating a benchmark dataset with real examples. For
       | lots of applications, especially if you're worried about people
       | messing with it, about weird behavior on edge cases, about
       | stability, you'd have to do a bunch of robustness testing as
       | well, and bigger models will be better.
       | 
       | Another big problem is it's hard to set objectives is many cases,
       | and for example maybe your customer service chat still passes but
       | comes across worse for a smaller model.
       | 
       | Id be careful is all.
        
         | candiddevmike wrote:
         | One point in favor of smaller/self-hosted LLMs: more consistent
         | performance, and you control your upgrade cadence, not the
         | model providers.
         | 
         | I'd push everyone to self-host models (even if it's on a shared
         | compute arrangement), as no enterprise I've worked with is
         | prepared for the churn of keeping up with the hosted model
         | release/deprecation cadence.
        
           | andy99 wrote:
           | How much you value control is one part of the optimization
           | problem. Obviously self hosting gives you more but it costs
           | more, and re evals, I trust GPT, Gemini, and Claude a lot
           | more than some smaller thing I self host, and would end up
           | wanting to do way more evals if I self hosted a smaller
           | model.
           | 
           | (Potentially interesting aside: I'd say I trust new GLM
           | models similarly to the big 3, but they're too big for most
           | people to self host)
        
           | blharr wrote:
           | Where can I find information on self-hosting models success
           | stories? All of it seems like throwing tens of thousands away
           | on compute for it to work worse than the standard providers.
           | The self-hosted models seem to get out of date, too. Or there
           | ends up being good reasons (improved performance) to replace
           | them
        
         | jmathai wrote:
         | You may also be getting a worse result for higher cost.
         | 
         | For a medical use case, we tested multiple Anthropic and OpenAI
         | models as well as MedGemma. Pleasantly surprised when the LLM
         | as Judge scored gpt5-mini as the clear winner. I don't think I
         | would have considered using it for the specific use cases -
         | assuming higher reasoning was necessary.
         | 
         | Still waiting on human evaluation to confirm the LLM Judge was
         | correct.
        
           | andy99 wrote:
           | You obviously know what you're looking for better than me,
           | but personally I'd want to see a narrative that made sense
           | before accepting that a smaller model somehow just performs
           | better, even if the benchmarks say so. There may be such an
           | explanation, it feels very dicey without one.
        
             | jmathai wrote:
             | Volume and statistical significance? I'm not sure what kind
             | of narrative I would trust beyond the actual data.
             | 
             | It's the hard part of using LLMs and a mistake I think many
             | people make. The only way to really understand or know is
             | to have repeatable and consistent frameworks to validate
             | your hypothesis (or in my case, have my hypothesis be
             | proved wrong).
             | 
             | You can't get to 100% confidence with LLMs.
        
           | lorey wrote:
           | That's interesting. Similarly, we found out that for very
           | simple tasks the older Haiku models are interesting as
           | they're cheaper than the latest Haiku models and often
           | perform equally well.
        
         | lorey wrote:
         | You're right. We did a few use cases and I have to admit that
         | while customer service is easiest to explain, its where I'd
         | also not choose the cheapest model for said reasons.
        
       | epolanski wrote:
       | The author of this post should benchmark his own blog for
       | accessibility metrics, text contrast is dreadful..
       | 
       | On the other hand, this would be interesting for measuring agents
       | in coding tasks, but there's quite a lot of context to provide
       | here, both input and output would be massive.
        
         | lorey wrote:
         | Appreciate the feedback, will work on that.
        
           | faeyanpiraat wrote:
           | One more vote on fixing contrast from me.
        
             | lorey wrote:
             | Will fix, thanks :)
        
               | faeyanpiraat wrote:
               | Tried Evalry, its a really nice concept, thanks for
               | sharing it!
        
           | epolanski wrote:
           | Do you have any insights on the platform evaluation for
           | coding tasks?
        
         | lorey wrote:
         | Pushed a fix. Could you check, please?
         | 
         | Any resources you can recommend to properly tackle this going
         | forward?
        
       | gridspy wrote:
       | Wow, this was some slick long form sales work. I hope your SaaS
       | goes well. Nice one!
        
       | hamiltont wrote:
       | Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use
       | boolean criteria instead, then weight manually e.g.
       | 
       | - Did it cite the 30-day return policy? Y/N - Tone professional
       | and empathetic? Y/N - Offered clear next steps? Y/N
       | 
       | Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps
       | 
       | Why: Reduces volatility of responses while still maintaining
       | creativeness (temperature) needed for good intuition
        
         | pocketarc wrote:
         | I use this approach for a ticket based customer support agent.
         | There are a bunch of boolean checks that the LLM must pass
         | before its response is allowed through. Some are hard fails,
         | others, like you brought up, are just a weighted ding to the
         | response's final score.
         | 
         | Failures are fed back to the LLM so it can regenerate taking
         | that feedback into account. People are much happier with it
         | than I could have imagined, though it's definitely not cheap
         | (but the cost difference is very OK for the tradeoff).
        
         | Imustaskforhelp wrote:
         | This actually seems really good advice. I am interested how you
         | might tweak this to things like programming languages
         | benchmarks?
         | 
         | By having independent tests and then seeing if it passes them
         | (yes or no) and then evaluating and having some (more
         | complicated tasks) be valued more than not or how exactly.
        
           | hamiltont wrote:
           | Not sure I'm fully following your question, but maybe this
           | helps:
           | 
           | IME deep thinking hgas moved from upfront architecture to
           | post-prototype analysis.
           | 
           | Pre-LLM: Think hard - design carefully - write deterministic
           | code - minor debugging
           | 
           | With LLMs: Prototype fast - evaluate failures - think hard
           | about prompts/task decomposition - iterate
           | 
           | When your system logic is probabilistic, you can't fully
           | architect in advance--you need empirical feedback. So I spend
           | most time analyzing failure cases: "this prompt generated X
           | which failed because Y, how do I clarify requirements?" Often
           | I use an LLM to help debug the LLM.
           | 
           | The shift: from "design away problems" to "evaluate into
           | solutions."
        
         | lorey wrote:
         | Yes, absolutely. This aligns with what we found. It seems to be
         | necessary to be very clear on scoring (at least for Opus 4.5).
        
         | 46493168 wrote:
         | Isn't this just rubrics?
        
           | 8note wrote:
           | its a weighted decision matrix.
        
         | piskov wrote:
         | How come accuracy has only 50% weight?
         | 
         | "You're absolutely right! Nice catch how I absolutely fooled
         | you"
        
         | tomjakubowski wrote:
         | Funny, this move is exactly what YouTube did to their system of
         | human-as-judge video scoring, which was a 1-5 scale before they
         | made it thumbs up/thumbs down in 2010.
        
           | jorvi wrote:
           | I hate thumbs up/down. 2 values is too little. I understand
           | that 5 was maybe too much, but thumbs up/down systems need an
           | explicit third "eh, it's okay" value for things I don't hate,
           | don't want to save to my library, but I would like the system
           | to know I have an opinion on.
           | 
           | I know that consuming something and not thumbing it up/down
           | sort-of does that, but it's a vague enough signal (that could
           | also mean "not close enough to keyboard / remote to thumbs
           | up/down) that recommendation systems can't count it as an
           | explicit choice.
        
             | steveklabnik wrote:
             | Here's the discussion from back in the day when this
             | changed: https://news.ycombinator.com/item?id=837698
             | 
             | In practice, people generally didn't even vote with two
             | options, they voted with one!
             | 
             | IIRC youtube did even get rid of downvotes for a while, as
             | they were mostly used for brigading.
        
               | PunchyHamster wrote:
               | > IIRC youtube did even get rid of downvotes for a while,
               | as they were mostly used for brigading.
               | 
               | No, they got rid of them most likely because advertisers
               | complained that when they dropped some flop they got
               | negative press from media going "lmao 90% dislike rate on
               | new trailer of <X>".
               | 
               | Stuff disliked to oblivion was either just straight out
               | bad, wrong (in case of just bad tutorials/info) and
               | brigading was very tiny percentage of it.
        
       | deepsquirrelnet wrote:
       | This is just evaluation, not "benchmarking". If you haven't setup
       | evaluation on something you're putting into production then what
       | are you even doing.
       | 
       | Stop prompt engineering, put down the crayons. Statistical model
       | outputs need to be evaluated.
        
         | andy99 wrote:
         | What does that look like in your opinion, what do you use?
        
         | lorey wrote:
         | This went straight to prod, even earlier than I'd opted for.
         | What do you mean?
        
           | deepsquirrelnet wrote:
           | I'm totally in alignment with your blog post (other than
           | terminology). I meant it more as a plea to all these projects
           | that are trying to go into production without any measures of
           | performance behind them.
           | 
           | It's shocking to me how often it happens. Aside from just the
           | necessity to be able to prove something works, there are so
           | many other benefits.
           | 
           | Cost and model commoditization are part of it like you point
           | out. There's also the potential for degraded performance
           | because of the shelf benchmarks aren't generalizing how you
           | expect. Add to that an inability to migrate to newer models
           | as they come out, potentially leaving performance on the
           | table. There's like 95 serverless models in bedrock now, and
           | as soon as you can evaluate them on your task they
           | immediately become a commodity.
           | 
           | But fundamentally you can't even justify any time spent on
           | prompt engineering if you don't have a framework to evaluate
           | changes.
           | 
           | Evaluation has been a critical practice in machine learning
           | for years. IMO is no less imperative when building with llms.
        
       | nickphx wrote:
       | ah yes... nothing like using another nondeterministic black box
       | of nonsense to judge / rate the output of another.. then charge
       | others for it.. lol
        
         | coredog64 wrote:
         | Amazon Bedrock Guardrails uses a purpose-built model to look
         | for safety issues in the model inputs/outputs. While you won't
         | get any specific guarantees from AWS, they will point you at
         | datasets that you can use to evaluate the product and then
         | determine if it's fit for purpose according to your risk
         | tolerance.
        
       | OutOfHere wrote:
       | You don't need a fancy UI to try the mini model first.
        
       | empiko wrote:
       | I do not disagree with the post, but I am surprised that a post
       | that is basically explaining very basic dataset construction is
       | so high up here. But I guess most people just read the headline?
        
       | ebla wrote:
       | Aren't you supposed to customize the prompts to the specific
       | models?
        
         | lorey wrote:
         | I've skipped that in the article, but absolutely!
        
       | tantalor wrote:
       | > it's the default: You have the API already
       | 
       | Sorry, this just makes no sense to start off with. What do you
       | mean?
        
         | lorey wrote:
         | Fixed, thanks. Not a native speaker.
        
       | iFire wrote:
       | I love the user experience for your product. You're giving a free
       | demo with results within 5 minutes and then encourage the
       | customer to "sign in" for more than 10 prompts.
       | 
       | Presumably that'll be some sort of funnel for a paid upload of
       | prompts.
        
         | iFire wrote:
         | https://evalry.com/question-benchmarks/game-engine-assistant...
         | 
         | Here's a bug report, by switching the model group the api hangs
         | in private mode.
        
           | lorey wrote:
           | Thanks. Will take a look.
        
           | iFire wrote:
           | Headsup I think I broke the site.
        
             | lorey wrote:
             | It's not you, it's the HN hug of death. There's so much
             | load on the server, I'm barely able to download the redis
             | image I need for caching...
        
       | dizhn wrote:
       | I paid a total of 13 US Dollars for all my llm usage in about 3
       | years. Should I analyze my providers and see if there's room for
       | improvement?
        
         | lorey wrote:
         | Depends on your remaining budget ;)
        
           | dizhn wrote:
           | That is absolutely right. :)
        
         | regenschutz wrote:
         | How? All LLM-as-a-Servive's are prohibitively expensive for me.
         | $13 over 3 years sounds too-good-to-be-true.
        
           | dizhn wrote:
           | All local CLIs with free to use models. CLIs are opencode,
           | iflow, qwen, gemini.
           | 
           | What I did splurge on was brief openai access for some
           | subtitle translator program and when I used the deepseek api.
           | Actually I think that $13 includes some as yet unused
           | credits. :D
           | 
           | I'd be happy to provide details if CLIs are an option and you
           | don't m ind some sweatshop agent. :)
           | 
           | (I am just now noticing I meant to type 2 years not 3 above.
           | Sorry about that.)
        
       | Havoc wrote:
       | I'm also collecting the data my side with the hopes of later
       | using it to fine tuning a tiny model later. Unsure whether it'll
       | work but if I'm using APIs anyway may as well gather it and try
       | to bottle some of that magic of using bigger models
        
       | wolttam wrote:
       | I'm consistently amazed at how much some individuals spend on
       | LLMs.
       | 
       | I get a good amount of non-agentic use out of them, and pay
       | literally less than $1/month for GLM-4.7 on deepinfra.
       | 
       | I can imagine my costs might rise to $20-ish/month if I used that
       | model for agentic tasks... still a very far cry from the
       | $1000-$1500 some spend.
        
         | lorey wrote:
         | Doesn't this depend a lot on private vs company usage? There's
         | no way I could spend more than a few hundreds alone, but when
         | you run prompts on 1M entities in some corporate use case, this
         | will incur costs, no matter how cheap the model usage.
        
       ___________________________________________________________________
       (page generated 2026-01-21 23:00 UTC)