[HN Gopher] Coping with dumb LLMs using classic ML
___________________________________________________________________
Coping with dumb LLMs using classic ML
Author : fzliu
Score : 155 points
Date : 2025-01-22 09:25 UTC (2 days ago)
(HTM) web link (softwaredoug.com)
(TXT) w3m dump (softwaredoug.com)
| Vampiero wrote:
| Wake me up when LLMs are good at Problog because it's the day we
| can finally rest
| kvgr wrote:
| The amount of hallucination I get when trying to write code is
| amazing. I mean it can get the core concepts of language, can
| create structure/algo. But it often makes up objects/values
| when I ask questions. Exampe: It suggested
| TextLayoutResult.size - which is Int value. I asked if it is
| width and height. And it wrote it has size.height and also
| size.width. Which it does not. I am now writing production code
| and also evaluating the LLMs, that our management thinks will
| save us shit load of time. We will get there sometimes, but the
| push from management is not compatible with the state of the
| LLMs. (I use Claude 3.5 sonnet now, as it is also built in some
| of the "AI IDEs".)
| antihipocrat wrote:
| You're not alone. In my experience the senior executive are
| enamoured by the possibility of halving headcount. The
| engineers reporting honestly about the limitations of
| connecting it to core systems (or using it to generate
| complex code running on core systems) are at risk of being
| perceived as blocking progress. So everyone keeps quiet,
| tries to find a quick and safe use case for the tech to
| present to management, and make sure that they aren't
| involved in any project that will be the big one to fail
| spectacularly and bring it all crashing down.
| ZaoLahma wrote:
| What irks me is how LLMs won't just say "no, it won't work"
| or "it's beyond my capabilities" and instead just give you
| "solutions" that are wrong.
|
| Codeium for example will absolutely bend over backwards to
| provide you with solutions to requests that can't be
| satisfied, producing more and more garbage for every attempt.
| I don't think I've ever seen it just say no.
|
| ChatGPT is marginally better and will sometimes tell you
| straight up that an algorithm can't be rewritten as you
| suggest, because of ... But sometimes it too will produce
| garbage in its attempts at doing something impossible that
| you ask it to do.
| epcoa wrote:
| > ChatGPT is marginally better and will sometimes tell you
| straight up that an algorithm can't be rewritten as you
| suggest
|
| Unfortunately this very often it gets wrong, especially if
| it involves some multistep process.
| dingnuts wrote:
| >What irks me is how LLMs won't just say "no, it won't
| work" or "it's beyond my capabilities" and instead just
| give you "solutions" that are wrong.
|
| This is one of the clearest ways to demonstrate that an LLM
| doesn't "know" anything, and isn't "intelligence." Until an
| LLM can determine whether its own output is based on
| something or completely made up, it's not intelligent. I
| find them downright infuriating to use because of this
| property.
|
| I'm glad to see other people are waking up
| genewitch wrote:
| Intelligence doesn't imply knowing when you're wrong
| though.
|
| Hackernews has Intelligent people...
|
| Q. E. D.
|
| % LLMs can RAG incorrect PDF citations too
| scarface_74 wrote:
| That's an easily solvable problem for programming. Today
| ChatGPT has an embedded Python runtime that it can use to
| verify its own code and I have seen times that it will
| try different techniques if the code doesn't give the
| expected answer. The one time I can remember is with
| generating regex.
|
| I don't see any reason that an IDE especially with a
| statically typed language can't have an AI integrated
| that at least will never hallucinate classes/functions
| that don't exist.
|
| Modern IDEs can already give you real time errors across
| large solutions for code that won't compile.
|
| Tools need to mature.
| genewitch wrote:
| Two notes: I've never had any say no for code related
| stuff, but I have it disagree that something exists _all
| the time_. In fact I just one deny a Subaru brat exists,
| twice.
|
| Secondly, if an llm is giving you the runaround it does not
| have a solution for the prompt you asked and you need
| either another prompt or another model or another approach
| to using the model (for vendor lock in like openai)
| swells34 wrote:
| This is a good representation of my experience as well.
|
| At the end of the day, this is because it isn't "writing
| code" in the sense that you or I do. It is a fancy
| regurgitation engine, that will output bits of stuff it's
| seen before that seem related to your question. LLMs are
| incredibly good at this, but that it also why you can never
| trust their output.
| Vampiero wrote:
| ... I just realized that I would be waking up just to go back
| to resting.
| AJRF wrote:
| My takeaway is that he didn't solve anything, he just changed the
| shape of the problem into one that was familiar to him.
| ebiester wrote:
| That's how we all solve problems. If this was novel, it would
| be a paper rather than a blog post.
|
| The meta-strategy of combining LLM and non-LLM techniques is
| going to be key for getting good results for some time.
| AJRF wrote:
| No I don't think I agree. There is lots of effort wasted
| shuffling problems around laterally but not solving for the
| actual goal, that's what I am saying.
| ccortes wrote:
| > he just changed the shape of the problem into one that was
| familiar to him
|
| that's a classic strategy to solve problems
| devvvvvvv wrote:
| Using AI as a way to flag things for humans to look at and make
| final decisions on seems like the way to go
| internet_points wrote:
| I've worked on some projects that used ML and such to half-
| automate things, thinking that we'd get the computer to do most
| of the work and people would check over things and it would be
| quality controlled.
|
| Three problems with this:
|
| * salespeople constantly try to sell the automation as more
| complete than it is
|
| * product owners try to push us developers into making it more
| fully automated
|
| * users get lulled into thinking it's more complete than it is
| (and accepting suggestions instead of deeply thinking through
| the issues like they would if they had to think things from
| scratch)
| liontwist wrote:
| And all of these are management problems.
| GardenLetter27 wrote:
| Almost all deployed ML systems work like this.
|
| I.e. for classification you can judge "certainty" by the soft-
| max outputs of the classifier, then in the less certain cases
| can refuse to classify and send it to humans.
|
| And also do random sampling of outputs by humans to verify
| accuracy over time.
|
| It's just that humans are really expensive and slow though, so
| it can be hard to maintain.
|
| But if humans have to review everything anyway (like with the
| EU's AI act for many applications) then you don't really gain
| much - even though the humans would likely just do a cursory
| rubber-stamp review anyway, as anyone who has seen Pull Request
| reviews can attest to.
| frankc wrote:
| I have the same experience but I am still 5 to 10 times more
| productive using claude. I'll have it write a class, have it
| write tests for the class and give it the output of the
| tests, from which it usually figures out problems like "oops
| those methods don't exist". Along the way I am guiding it on
| the approach and architecture. Sometimes it does get stuck
| and it needs very specific intervention. You need to be a
| senior engineer to do this well, In the end I usually get
| what I want with way more tests than I would have the
| patience to write and a fraction of the time. Importantly
| since it now has the context loaded, I can have it write
| nicely formatted documentation and add bells and whistles
| like a pretty cli, with minimal effort. In the end I usually
| get what I want with better tests, docs and polish in a
| fraction of the time, especially with cursor which makes the
| iteration process so much faster.
| Joker_vD wrote:
| But... that can possibly only make things more expensive than
| they are now, with dubious improvements in quality?
| Terr_ wrote:
| One of the big subtle problems is designing the broader
| interaction so that the humans in the loop are both capable
| _and_ motivated to do a proper review of every item that will
| occur.
|
| LLMs are able to counterfeit a truly impressive number of
| indirect signals which humans currently use to make snap-
| judgements and mental-shortcuts, and somehow reviewers need to
| be shielded from that.
| GardenLetter27 wrote:
| The example here isn't great, but the idea of using an ensemble
| of LLMs when compute is cheaper is cool.
|
| As the foundational models can parse super complex stuff like
| dense human language, music, etc. with context - like a really
| good pre-built auto-encoder, which would be a nightmare with
| classic machine learning feature selection (remember bag of
| words? and word2vec?).
|
| I wonder how such an approach would compare to just fine-tuning
| one model though? And how the cost of fine-tuning vs. greater
| inference cost for an ensemble compares?
| dailykoder wrote:
| This is interesting. I think I did not entirely understand OPs
| problem, but we are going more and more in a direction where we
| try to come up with things how to "program" LLMs, because human
| language is not sufficient enough. (Atleast I thought) the goal
| was to make things simple and "just" ask your question to an LLM
| and get the answer, but normal language does not work for complex
| tasks.
|
| Especially in programming it is fun. People spent hours over
| hours to come up with a prompt that can (kinda-of) reliably
| produce code. So they try to hack/program some weird black box so
| that they can do their actual programming tasks. On some areas
| there might be a speed up, but I still don't know if it's worth
| it. It feels like we are creating more problems than solutions
| flessner wrote:
| I feel the same way about programming, but there are plenty of
| people that don't enjoy it.
|
| I recently was chatting with my friend that wanted to automate
| one of his tasks by writing a python script with AI -> because
| all the influencers said it was "so easy" and "no programming
| knowledge" required.
|
| That might have been the single funniest piece of code I have
| seen in a long time. Didn't install the dependencies, didn't
| fill in the Twitter API key, instead of searching for a keyword
| on Twitter it just looked up 3 random accounts, 25 functions on
| like 120 lines of code?
|
| Also, the line numbers in the errors weren't helpful because
| the whole thing lived in Windows notepad. That was a flagship
| AI and a (in my opinion) capable human not being able to
| assemble a simple script.
| PaulHoule wrote:
| If you have some idea of what good code looks like you can
| sometimes give feedback to something like Cursor or Windsurf.
| For small greenfield projects (that kind of downloader
| script) they succeed maybe 50% of the time.
|
| If you had no idea of what code looks like and poor critical
| thinking abilities God help you.
| cyanydeez wrote:
| Possible bug on uber query? --- Which of these product
| descriptions (if either) is more relevant to the furniture
| e-commerce search query:
|
| Query: entrance table Product LHS name: aleah coffee table
| Product LHS description: You'll love this table from lazy boy. It
| goes in your living room. And you'll find ... ... Or Product LHS
| name: marta coffee table Product RHS description: This coffee
| table is great for your entrance, use it to put in your
| doorway... ... Or Neither / Need more product attributes
|
| Only respond 'LHS' or 'RHS' if you are confident in your decision
|
| RESPONSE: RHS --- LHS is include. Hopefully this is a bug in the
| blog and not the code
| outofpaper wrote:
| With or without the bug it's a horid prompt. Prompts work best
| when they resemble content LLMs have in their training data.
| People use first and second far more often then LHS and RHS
| when talking about options. First or second, 1 or 2, a or b or
| neither.
|
| LLMs are narrative machines. They make up stories which often
| make sense.
| cyanydeez wrote:
| LHS might trigger a better parsing window and that window
| would be model dependent.
| softwaredoug wrote:
| This is a copy/pasted typo, the real prompt begins
|
| > Which of these furniture products is more relevant to the
| furniture e-commerce search query:
|
| Fixed in the post. Thanks
| MichaelMoser123 wrote:
| classical ML always runs into the knowledge representation
| problem - the task is to find some general representation of
| knowledge suitable for computer reasoning. That's something of a
| philosophers stone - they keep searching for it for seventy years
| already.
|
| I think agents will run into the same problem - if they will try
| to find a classical ML solution to verify what comes out of the
| LLM.
| blueflow wrote:
| And like the philosophers stone it does not exist. Remember the
| "Map vs Territory" discussion: you cannot have generic maps,
| only maps specialized for a purpose.
| outofpaper wrote:
| Yes. All too easily we forget that the maps are not the
| territories.
|
| LLMs are amazing we are creating better and better
| hyperdimentional maps of language but until we have systems
| that are not just crystallized maps of the language they were
| trained on we will never have something that can really
| think, let alone AGI or whatever new term we come up with.
| Matthyze wrote:
| That's essentially the No Free Lunch (NFL) theorem, right?
|
| The thing about the NFL theorem is that it assumes an equal
| weight or probability over each problem/task. It's impossible
| to find a search/learning algorithm that performs superiorly
| over another, 'averaged' over all tasks. But--and this is
| purely my intuition--the problems that humans want to solve,
| are a very small subset of all possible search/learning
| problems. And this imbalance allows us to find algorithms
| that work particularly well on the subset of problems we want
| to solve.
|
| Coming back to representation and maps. Human
| understanding/worldview is a good example. Human
| understanding and worldview is itself a map of reality. This
| map models certain facts of the world well and other facts
| poorly. It is optimized for human cognition. But it's still
| broad enough to be useful for a variety of problems. If this
| map wasn't useful, we probably wouldn't have evolved it.
|
| The point is, I do think there's a philosopher's pebble, and
| I do think there's a few free bites of lunch. These can be
| found in the discrepancy between all theoretically possible
| tasks and the tasks that we actually want to do.
| MichaelMoser123 wrote:
| I don't know. Maps can vary in quality and expressiveness.
|
| Language itself is a kind of map, and it has pretty
| universal reach.
|
| "No Free Lunch (NFL) theorem" isn't quite mathematics, it
| is more in the domain of philosophy.
| Matthyze wrote:
| The NFL theorem (for optimization) has a mathematical
| proof, FYI. But I agree that there's a lot of room for
| interpretation.
| hackerwr wrote:
| hello do you have a place for me im haking the school now
| sgt101 wrote:
| Options:
|
| Finetune the models to be better
|
| Optimise the prompts to be better
|
| Train better models
| Matthyze wrote:
| So, if I understand the approach correctly: we're essentially
| doing very advanced feature engineering with LLMs. We find that
| direct classification by LLMs performs worse than LLM feature
| engineering followed by decision trees. Am I right?
|
| The finding surprises me. I would expect modern LLMs to be
| powerful enough to do well at the task. Given how much the data
| is processed before the decision trees, I wouldn't expect
| decision trees to add much. I can see value in this approach if
| you're unable to optimize the LLM. But, if you can, I think end-
| to-end training with a pre-trained LLM is likely to work better.
| softwaredoug wrote:
| TBH I'm not sure its better, but the decision tree structure is
| pretty handy for problem exploration
|
| (However 'better' might be defined, I care more about the
| precision / recall tradeoff)
| ellisv wrote:
| This resonates with my experience. Use LLMs for feature
| engineering, then use traditional ML for your inference models.
| Matthyze wrote:
| Perhaps the reason that this approach works well is that,
| while the LLM gives you good general-purpose language
| processing, the decision tree learns about the specific
| dataset. And that combination is more powerful than either
| component.
| ellisv wrote:
| It's the same reason LLMs don't perform well on tabular
| data. (They can do fine but usually not was well as other
| models)
|
| Performing feature engineering with LLMs and then storing
| the embeddings in a vector database also allows you to
| reuse the embeddings for multiple tasks (eg clustering,
| nearest neighbor).
|
| Generally no one uses plain decision trees since random
| forest or gradient boosted trees perform better and are
| more robust.
| gerad wrote:
| It seems like a really easy way to overfit your model to your
| data, even while using LLMs.
| jncfhnb wrote:
| If you're going to use classic ML why not just train a model
| based on the vector representations of the product descriptions?
| softwaredoug wrote:
| Yes that's a great idea, and maybe something I would try next
| in this series.
| lewisl9029 wrote:
| I had a somewhat similar experience trying to use LLMs to do OCR.
|
| All the models I've tried (Sonnet 3.5, GPT 4o, Llama 3.2, Qwen2
| VL) have been pretty good at extracting text, but they failed
| miserably at finding bounding boxes, usually just making up
| random coordinates. I thought this might have been due to
| internal resizing of images so tried to get them to use relative
| % based coordinates, but no luck there either.
|
| Eventually gave up and went back to good old PP-OCR models (are
| these still state of the art? would love to try out some better
| ones). The actual extraction feels a bit less accurate than the
| best LLMs, but bounding box detection is pretty much spot on all
| the time, and it's literally several orders of magnitude more
| efficient in terms of memory and overall energy use.
|
| My conclusion was that current gen models still just aren't
| capable enough yet, but I can't help but feel like I might be
| missing something. How the heck did Anthropic and OpenAI manage
| to build computer use if their models can't give them accurate
| coordinates of objects in screenshots?
| HanClinto wrote:
| Maybe still worth it to separate the tasks, and use a
| traditional text detection model to find bounding boxes, then
| crop the images. In a second stage, send those cropped samples
| to the higher-power LLMs to do the actual text extraction, and
| don't worry about them for bounding boxes at all.
|
| There are some VLLMs that seem to be specifically trained to do
| bounding box detection (Moondream comes to mind as one that
| advertises this?), but in general I wouldn't be surprised if
| none of them work as well as traditional methods.
| owkman wrote:
| I think people have had success with using PaliGemma for this.
| The computer use type use cases probably use fine tuned
| versions of LLMs for their use cases rather than the base ones.
| ahzhou wrote:
| LLMs are inherently bad at this due to tokenization, scaling,
| and lack of training on the task. Anthropic's computer use
| feature has a specialized model for pixel-counting: > Training
| Claude to count pixels accurately was critical. Without this
| skill, the model finds it difficult to give mouse commands. [1]
| For a VLM trained on identifying bounding boxes, check out
| PaliGemma [2]
|
| You may also be able to get the computer use API to draw
| bounding boxes if the costs make sense.
|
| That said, I think the correct solution is likely to use a non-
| VLM to draw bounding boxes. Depends on the dataset and problem.
|
| 1. https://www.anthropic.com/news/developing-computer-use 2.
| https://huggingface.co/blog/paligemma
| nostrebored wrote:
| PaliGemma on computer use data is absolutely not good. The
| difference between a FT YOLO model and a FT PaliGemma model
| is huge if generic bboxes are what you need. Microsoft's
| OmniParser also winds up using a YOLO backbone [1]. All of
| the browser use tools (like our friends at browser-use [2])
| wind up trying to get a generic set of bboxes using the DOM
| and then applying generative models.
|
| PaliGemma seems to fit into a completely different niche
| right now (VQA and Segmentation) that I don't really see
| having practical applications for computer use.
|
| [1]
| https://huggingface.co/microsoft/OmniParser?language=python
| [2] https://github.com/browser-use/browser-use
| bob1029 wrote:
| > they failed miserably at finding bounding boxes, usually just
| making up random coordinates.
|
| This makes sense to me. These LLMs likely have no statistics
| about the spatial relationships of tokens in a 2D raster space.
| nostrebored wrote:
| The spatial awareness is what grounding models try to
| achieve, e.g. UGround [1]
|
| [1]
| https://huggingface.co/osunlp/UGround-V1-7B?language=python
| aaronharnly wrote:
| Relatedly, we find LLM vision models absolutely atrocious at
| _counting things_. We build school curricula, and one basic
| task for our activities is counting - blocks, pictures of
| ducks, segments in a chart, whatever. Current LLM models can 't
| reliably count four or five squares in an image.
| nyrikki wrote:
| IMHO, that is expected, at least for the general case.
|
| That is one of the implications of transformers being
| DLOGTIME-uniform TC0, they don't have access to counter
| analogs.
|
| You would need to move to log depth circuits, add mod-p_n
| gates etc... unless someone finds some new mathematics.
|
| Proposition 6.14 in Immerman is where this is lost if you
| want a cite.
|
| It will be counterintuitive that division is in TC0, but
| (general) counting is not.
| whiplash451 wrote:
| 1. You need to look into the OCR-specific literature of DL
| (e.g. udop) or segmentation-based (e.g. segment-anything)
|
| 2. BigTech and SmallTech train their fancy bounding box /
| detection models on large datasets that have been built using
| classical detectors and a ton of manual curation
| KTibow wrote:
| Gemini 2 can purportedly do this, you can test it with the
| Spatial Understanding Starter App inside AI Studio. Only caveat
| is that it's not production ready yet.
| DougBTX wrote:
| AFAIK none of those models have been trained to produce
| bounding boxes. On the other hand Gemini Pro has, so it may be
| worth looking at for your use case:
|
| https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...
| vonneumannstan wrote:
| Yeah I really struggle when I use my hammer to screw pieces of
| wood together too.
| prettyblocks wrote:
| Have you played with moondream? Pretty cool small vision model
| that did a good job with bounding boxes when I palyed with it.
| jonnycoder wrote:
| I am doing OCR on hundreds of PDFs using AWS Textract. It
| requires me to convert each page of the pdf to an image and
| then analyze the image and it works good for converting to
| markdown format (which requires custom code). I want to try
| using some vision models and compare how they do, for example
| Phi-3.5-vision-instruct.
| ausbah wrote:
| this is sort of what tooling is suppose to be like right? llm
| isn't great at X task so regulates it to a proven capable tool
| like a calendar
| raghavbali wrote:
| Maybe I missed something but this is a round about way of doing
| things where an embedding + ML classifier would have done the
| job. We don't have to use an LLM just because it can be used IMO
| napsternxg wrote:
| We often ignore the importance of using good baseline systems and
| jump to the latest shiny thing.
|
| I had a similar experience few years back when participating in a
| ML competitions [1,2] for detecting and typing phrases in a text.
| I submitted an approach based on Named Enttiy Recognition using
| Conditional Random Field (CRF) which has been quite robust and
| well known in the community and my solution beat most of tuned
| Deep learning solutions by quite a large margin [1].
|
| I think a lot of folks underestimate the complexity of using some
| of these models (DL, LLM) and just throw them at the problem or
| don't compare it well against well established baselines.
|
| [1]
| https://scholar.google.com/citations?view_op=view_citation&h...
| [2]
| https://scholar.google.com/citations?view_op=view_citation&h...
| PaulHoule wrote:
| As I see it, you need a model you can train quickly so you can
| do tuning, model selection, and all that.
|
| I have a BERT + SVM + Logistic Regression (for calibration)
| model that can train 20 models for automatic model selection
| and calibration in about 3 minutes. I feel like I understand
| the behavior of it really well.
|
| I've tried fine tuning a BERT for the same task and the
| shortest model builds take 30 minutes, the training curves make
| no sense (back in the day I used to be able to train networks
| with early stopping and get a good one _every time_ ) and if I
| look at arXiv papers it is rare for anyone to have a model
| selection process with any discipline at all, mainly people use
| a recipe that sorta-kinda seemed to work in some other paper.
| People scoff at you if you ask the engineering-oriented
| question "What training procedure can I use to get a good model
| consistently?"
|
| Because of that I like classical ML.
| korkybuchek wrote:
| There's a reason xgboost is still king in large companies.
___________________________________________________________________
(page generated 2025-01-24 23:01 UTC)