[HN Gopher] Without benchmarking LLMs, you're likely overpaying
___________________________________________________________________
Without benchmarking LLMs, you're likely overpaying
Author : lorey
Score : 123 points
Date : 2026-01-20 19:03 UTC (1 days ago)
(HTM) web link (karllorey.com)
(TXT) w3m dump (karllorey.com)
| petcat wrote:
| > He's a non-technical founder building an AI-powered business.
|
| It sounds like he's building some kind of ai support chat bot.
|
| I despise these things.
| r_lee wrote:
| And the whole article is about promoting his benchmarking
| service, of course.
| montroser wrote:
| The whole post is just an advert for this person's startup.
| Their "friend" doesn't exist...
| lorey wrote:
| Totally agree with your point. While I can't say specifically,
| it's a traditional (German) business he's doing vertically
| integrated with AI. Customer support is really bad in this
| traditional niche and by leveraging AI on top of doing the
| support himself 24/7, he was able to make it his competitive
| edge.
| verdverm wrote:
| I'd second this wholeheartedly
|
| Since building a custom agent setup to replace copilot,
| adopting/adjusting Claude Code prompts, and giving it basic
| tools, gemini-3-flash is my go-to model unless I know it's a big
| and involved task. The model is really good at 1/10 the cost of
| pro, super fast by comparison, and some basic a/b testing shows
| little to no difference in output on the majority of tasks I used
|
| Cut all my subs, spend less money, don't get rate limited
| r_lee wrote:
| Plus I've found that overall with "thinking" models, it's more
| like for memory, not even actual perf boost, it might even be
| worse because if it goes even slightly wrong on the "thinking"
| part, it'll then commit to that for the actual response
| verdverm wrote:
| for sure, the difference in the most recent model generations
| makes them far more useful for many daily tasks. This is the
| first gen with thinking as a significant mid-training focus
| and it shows
|
| gemini-3-flash stands well above gemini-2.5-pro
| dpoloncsak wrote:
| Yeah, one of my first projects one of my buddies asked "Why
| aren't you using [ChatGPT 4.0] nano? It's 99% the effectiveness
| with 10% the price."
|
| I've been using the smaller models ever since. Nano/mini,
| flash, etc.
| phainopepla2 wrote:
| I have been benchmarking many of my use cases, and the GPT
| Nano models have fallen completely flat one every single
| except for very short summaries. I would call them 25%
| effectiveness at best.
| verdverm wrote:
| Flash is not a small model, it's still over 1T parameters.
| It's a hyper MoE aiui
|
| I have yet to go back to small models, waiting for the
| upstream feature / GPU provider has been seeing capacity
| issues, so I am sticking with the gemini family for now
| walthamstow wrote:
| Flash Lite 2.5 is an unbelievably good model for the price
| sixtyj wrote:
| Yup.
|
| I have found out recently that Grok-4.1-fast has similar
| pricing (in cents) but 10x larger context window (2M tokens
| instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination,
| lowest in blind tests in LLM arena.
| PunchyHamster wrote:
| LLM bubble will burst the second investors figure out how much
| well managed local model can do
| andy99 wrote:
| Depends on what you're doing. Using the smaller / cheaper LLMs
| will generally make it way more fragile. The article appears to
| focus on creating a benchmark dataset with real examples. For
| lots of applications, especially if you're worried about people
| messing with it, about weird behavior on edge cases, about
| stability, you'd have to do a bunch of robustness testing as
| well, and bigger models will be better.
|
| Another big problem is it's hard to set objectives is many cases,
| and for example maybe your customer service chat still passes but
| comes across worse for a smaller model.
|
| Id be careful is all.
| candiddevmike wrote:
| One point in favor of smaller/self-hosted LLMs: more consistent
| performance, and you control your upgrade cadence, not the
| model providers.
|
| I'd push everyone to self-host models (even if it's on a shared
| compute arrangement), as no enterprise I've worked with is
| prepared for the churn of keeping up with the hosted model
| release/deprecation cadence.
| andy99 wrote:
| How much you value control is one part of the optimization
| problem. Obviously self hosting gives you more but it costs
| more, and re evals, I trust GPT, Gemini, and Claude a lot
| more than some smaller thing I self host, and would end up
| wanting to do way more evals if I self hosted a smaller
| model.
|
| (Potentially interesting aside: I'd say I trust new GLM
| models similarly to the big 3, but they're too big for most
| people to self host)
| blharr wrote:
| Where can I find information on self-hosting models success
| stories? All of it seems like throwing tens of thousands away
| on compute for it to work worse than the standard providers.
| The self-hosted models seem to get out of date, too. Or there
| ends up being good reasons (improved performance) to replace
| them
| jmathai wrote:
| You may also be getting a worse result for higher cost.
|
| For a medical use case, we tested multiple Anthropic and OpenAI
| models as well as MedGemma. Pleasantly surprised when the LLM
| as Judge scored gpt5-mini as the clear winner. I don't think I
| would have considered using it for the specific use cases -
| assuming higher reasoning was necessary.
|
| Still waiting on human evaluation to confirm the LLM Judge was
| correct.
| andy99 wrote:
| You obviously know what you're looking for better than me,
| but personally I'd want to see a narrative that made sense
| before accepting that a smaller model somehow just performs
| better, even if the benchmarks say so. There may be such an
| explanation, it feels very dicey without one.
| jmathai wrote:
| Volume and statistical significance? I'm not sure what kind
| of narrative I would trust beyond the actual data.
|
| It's the hard part of using LLMs and a mistake I think many
| people make. The only way to really understand or know is
| to have repeatable and consistent frameworks to validate
| your hypothesis (or in my case, have my hypothesis be
| proved wrong).
|
| You can't get to 100% confidence with LLMs.
| lorey wrote:
| That's interesting. Similarly, we found out that for very
| simple tasks the older Haiku models are interesting as
| they're cheaper than the latest Haiku models and often
| perform equally well.
| lorey wrote:
| You're right. We did a few use cases and I have to admit that
| while customer service is easiest to explain, its where I'd
| also not choose the cheapest model for said reasons.
| epolanski wrote:
| The author of this post should benchmark his own blog for
| accessibility metrics, text contrast is dreadful..
|
| On the other hand, this would be interesting for measuring agents
| in coding tasks, but there's quite a lot of context to provide
| here, both input and output would be massive.
| lorey wrote:
| Appreciate the feedback, will work on that.
| faeyanpiraat wrote:
| One more vote on fixing contrast from me.
| lorey wrote:
| Will fix, thanks :)
| faeyanpiraat wrote:
| Tried Evalry, its a really nice concept, thanks for
| sharing it!
| epolanski wrote:
| Do you have any insights on the platform evaluation for
| coding tasks?
| lorey wrote:
| Pushed a fix. Could you check, please?
|
| Any resources you can recommend to properly tackle this going
| forward?
| gridspy wrote:
| Wow, this was some slick long form sales work. I hope your SaaS
| goes well. Nice one!
| hamiltont wrote:
| Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use
| boolean criteria instead, then weight manually e.g.
|
| - Did it cite the 30-day return policy? Y/N - Tone professional
| and empathetic? Y/N - Offered clear next steps? Y/N
|
| Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps
|
| Why: Reduces volatility of responses while still maintaining
| creativeness (temperature) needed for good intuition
| pocketarc wrote:
| I use this approach for a ticket based customer support agent.
| There are a bunch of boolean checks that the LLM must pass
| before its response is allowed through. Some are hard fails,
| others, like you brought up, are just a weighted ding to the
| response's final score.
|
| Failures are fed back to the LLM so it can regenerate taking
| that feedback into account. People are much happier with it
| than I could have imagined, though it's definitely not cheap
| (but the cost difference is very OK for the tradeoff).
| Imustaskforhelp wrote:
| This actually seems really good advice. I am interested how you
| might tweak this to things like programming languages
| benchmarks?
|
| By having independent tests and then seeing if it passes them
| (yes or no) and then evaluating and having some (more
| complicated tasks) be valued more than not or how exactly.
| hamiltont wrote:
| Not sure I'm fully following your question, but maybe this
| helps:
|
| IME deep thinking hgas moved from upfront architecture to
| post-prototype analysis.
|
| Pre-LLM: Think hard - design carefully - write deterministic
| code - minor debugging
|
| With LLMs: Prototype fast - evaluate failures - think hard
| about prompts/task decomposition - iterate
|
| When your system logic is probabilistic, you can't fully
| architect in advance--you need empirical feedback. So I spend
| most time analyzing failure cases: "this prompt generated X
| which failed because Y, how do I clarify requirements?" Often
| I use an LLM to help debug the LLM.
|
| The shift: from "design away problems" to "evaluate into
| solutions."
| lorey wrote:
| Yes, absolutely. This aligns with what we found. It seems to be
| necessary to be very clear on scoring (at least for Opus 4.5).
| 46493168 wrote:
| Isn't this just rubrics?
| 8note wrote:
| its a weighted decision matrix.
| piskov wrote:
| How come accuracy has only 50% weight?
|
| "You're absolutely right! Nice catch how I absolutely fooled
| you"
| tomjakubowski wrote:
| Funny, this move is exactly what YouTube did to their system of
| human-as-judge video scoring, which was a 1-5 scale before they
| made it thumbs up/thumbs down in 2010.
| jorvi wrote:
| I hate thumbs up/down. 2 values is too little. I understand
| that 5 was maybe too much, but thumbs up/down systems need an
| explicit third "eh, it's okay" value for things I don't hate,
| don't want to save to my library, but I would like the system
| to know I have an opinion on.
|
| I know that consuming something and not thumbing it up/down
| sort-of does that, but it's a vague enough signal (that could
| also mean "not close enough to keyboard / remote to thumbs
| up/down) that recommendation systems can't count it as an
| explicit choice.
| steveklabnik wrote:
| Here's the discussion from back in the day when this
| changed: https://news.ycombinator.com/item?id=837698
|
| In practice, people generally didn't even vote with two
| options, they voted with one!
|
| IIRC youtube did even get rid of downvotes for a while, as
| they were mostly used for brigading.
| PunchyHamster wrote:
| > IIRC youtube did even get rid of downvotes for a while,
| as they were mostly used for brigading.
|
| No, they got rid of them most likely because advertisers
| complained that when they dropped some flop they got
| negative press from media going "lmao 90% dislike rate on
| new trailer of <X>".
|
| Stuff disliked to oblivion was either just straight out
| bad, wrong (in case of just bad tutorials/info) and
| brigading was very tiny percentage of it.
| deepsquirrelnet wrote:
| This is just evaluation, not "benchmarking". If you haven't setup
| evaluation on something you're putting into production then what
| are you even doing.
|
| Stop prompt engineering, put down the crayons. Statistical model
| outputs need to be evaluated.
| andy99 wrote:
| What does that look like in your opinion, what do you use?
| lorey wrote:
| This went straight to prod, even earlier than I'd opted for.
| What do you mean?
| deepsquirrelnet wrote:
| I'm totally in alignment with your blog post (other than
| terminology). I meant it more as a plea to all these projects
| that are trying to go into production without any measures of
| performance behind them.
|
| It's shocking to me how often it happens. Aside from just the
| necessity to be able to prove something works, there are so
| many other benefits.
|
| Cost and model commoditization are part of it like you point
| out. There's also the potential for degraded performance
| because of the shelf benchmarks aren't generalizing how you
| expect. Add to that an inability to migrate to newer models
| as they come out, potentially leaving performance on the
| table. There's like 95 serverless models in bedrock now, and
| as soon as you can evaluate them on your task they
| immediately become a commodity.
|
| But fundamentally you can't even justify any time spent on
| prompt engineering if you don't have a framework to evaluate
| changes.
|
| Evaluation has been a critical practice in machine learning
| for years. IMO is no less imperative when building with llms.
| nickphx wrote:
| ah yes... nothing like using another nondeterministic black box
| of nonsense to judge / rate the output of another.. then charge
| others for it.. lol
| coredog64 wrote:
| Amazon Bedrock Guardrails uses a purpose-built model to look
| for safety issues in the model inputs/outputs. While you won't
| get any specific guarantees from AWS, they will point you at
| datasets that you can use to evaluate the product and then
| determine if it's fit for purpose according to your risk
| tolerance.
| OutOfHere wrote:
| You don't need a fancy UI to try the mini model first.
| empiko wrote:
| I do not disagree with the post, but I am surprised that a post
| that is basically explaining very basic dataset construction is
| so high up here. But I guess most people just read the headline?
| ebla wrote:
| Aren't you supposed to customize the prompts to the specific
| models?
| lorey wrote:
| I've skipped that in the article, but absolutely!
| tantalor wrote:
| > it's the default: You have the API already
|
| Sorry, this just makes no sense to start off with. What do you
| mean?
| lorey wrote:
| Fixed, thanks. Not a native speaker.
| iFire wrote:
| I love the user experience for your product. You're giving a free
| demo with results within 5 minutes and then encourage the
| customer to "sign in" for more than 10 prompts.
|
| Presumably that'll be some sort of funnel for a paid upload of
| prompts.
| iFire wrote:
| https://evalry.com/question-benchmarks/game-engine-assistant...
|
| Here's a bug report, by switching the model group the api hangs
| in private mode.
| lorey wrote:
| Thanks. Will take a look.
| iFire wrote:
| Headsup I think I broke the site.
| lorey wrote:
| It's not you, it's the HN hug of death. There's so much
| load on the server, I'm barely able to download the redis
| image I need for caching...
| dizhn wrote:
| I paid a total of 13 US Dollars for all my llm usage in about 3
| years. Should I analyze my providers and see if there's room for
| improvement?
| lorey wrote:
| Depends on your remaining budget ;)
| dizhn wrote:
| That is absolutely right. :)
| regenschutz wrote:
| How? All LLM-as-a-Servive's are prohibitively expensive for me.
| $13 over 3 years sounds too-good-to-be-true.
| dizhn wrote:
| All local CLIs with free to use models. CLIs are opencode,
| iflow, qwen, gemini.
|
| What I did splurge on was brief openai access for some
| subtitle translator program and when I used the deepseek api.
| Actually I think that $13 includes some as yet unused
| credits. :D
|
| I'd be happy to provide details if CLIs are an option and you
| don't m ind some sweatshop agent. :)
|
| (I am just now noticing I meant to type 2 years not 3 above.
| Sorry about that.)
| Havoc wrote:
| I'm also collecting the data my side with the hopes of later
| using it to fine tuning a tiny model later. Unsure whether it'll
| work but if I'm using APIs anyway may as well gather it and try
| to bottle some of that magic of using bigger models
| wolttam wrote:
| I'm consistently amazed at how much some individuals spend on
| LLMs.
|
| I get a good amount of non-agentic use out of them, and pay
| literally less than $1/month for GLM-4.7 on deepinfra.
|
| I can imagine my costs might rise to $20-ish/month if I used that
| model for agentic tasks... still a very far cry from the
| $1000-$1500 some spend.
| lorey wrote:
| Doesn't this depend a lot on private vs company usage? There's
| no way I could spend more than a few hundreds alone, but when
| you run prompts on 1M entities in some corporate use case, this
| will incur costs, no matter how cheap the model usage.
___________________________________________________________________
(page generated 2026-01-21 23:00 UTC)