[HN Gopher] Coping with dumb LLMs using classic ML
       ___________________________________________________________________
        
       Coping with dumb LLMs using classic ML
        
       Author : fzliu
       Score  : 155 points
       Date   : 2025-01-22 09:25 UTC (2 days ago)
        
 (HTM) web link (softwaredoug.com)
 (TXT) w3m dump (softwaredoug.com)
        
       | Vampiero wrote:
       | Wake me up when LLMs are good at Problog because it's the day we
       | can finally rest
        
         | kvgr wrote:
         | The amount of hallucination I get when trying to write code is
         | amazing. I mean it can get the core concepts of language, can
         | create structure/algo. But it often makes up objects/values
         | when I ask questions. Exampe: It suggested
         | TextLayoutResult.size - which is Int value. I asked if it is
         | width and height. And it wrote it has size.height and also
         | size.width. Which it does not. I am now writing production code
         | and also evaluating the LLMs, that our management thinks will
         | save us shit load of time. We will get there sometimes, but the
         | push from management is not compatible with the state of the
         | LLMs. (I use Claude 3.5 sonnet now, as it is also built in some
         | of the "AI IDEs".)
        
           | antihipocrat wrote:
           | You're not alone. In my experience the senior executive are
           | enamoured by the possibility of halving headcount. The
           | engineers reporting honestly about the limitations of
           | connecting it to core systems (or using it to generate
           | complex code running on core systems) are at risk of being
           | perceived as blocking progress. So everyone keeps quiet,
           | tries to find a quick and safe use case for the tech to
           | present to management, and make sure that they aren't
           | involved in any project that will be the big one to fail
           | spectacularly and bring it all crashing down.
        
           | ZaoLahma wrote:
           | What irks me is how LLMs won't just say "no, it won't work"
           | or "it's beyond my capabilities" and instead just give you
           | "solutions" that are wrong.
           | 
           | Codeium for example will absolutely bend over backwards to
           | provide you with solutions to requests that can't be
           | satisfied, producing more and more garbage for every attempt.
           | I don't think I've ever seen it just say no.
           | 
           | ChatGPT is marginally better and will sometimes tell you
           | straight up that an algorithm can't be rewritten as you
           | suggest, because of ... But sometimes it too will produce
           | garbage in its attempts at doing something impossible that
           | you ask it to do.
        
             | epcoa wrote:
             | > ChatGPT is marginally better and will sometimes tell you
             | straight up that an algorithm can't be rewritten as you
             | suggest
             | 
             | Unfortunately this very often it gets wrong, especially if
             | it involves some multistep process.
        
             | dingnuts wrote:
             | >What irks me is how LLMs won't just say "no, it won't
             | work" or "it's beyond my capabilities" and instead just
             | give you "solutions" that are wrong.
             | 
             | This is one of the clearest ways to demonstrate that an LLM
             | doesn't "know" anything, and isn't "intelligence." Until an
             | LLM can determine whether its own output is based on
             | something or completely made up, it's not intelligent. I
             | find them downright infuriating to use because of this
             | property.
             | 
             | I'm glad to see other people are waking up
        
               | genewitch wrote:
               | Intelligence doesn't imply knowing when you're wrong
               | though.
               | 
               | Hackernews has Intelligent people...
               | 
               | Q. E. D.
               | 
               | % LLMs can RAG incorrect PDF citations too
        
               | scarface_74 wrote:
               | That's an easily solvable problem for programming. Today
               | ChatGPT has an embedded Python runtime that it can use to
               | verify its own code and I have seen times that it will
               | try different techniques if the code doesn't give the
               | expected answer. The one time I can remember is with
               | generating regex.
               | 
               | I don't see any reason that an IDE especially with a
               | statically typed language can't have an AI integrated
               | that at least will never hallucinate classes/functions
               | that don't exist.
               | 
               | Modern IDEs can already give you real time errors across
               | large solutions for code that won't compile.
               | 
               | Tools need to mature.
        
             | genewitch wrote:
             | Two notes: I've never had any say no for code related
             | stuff, but I have it disagree that something exists _all
             | the time_. In fact I just one deny a Subaru brat exists,
             | twice.
             | 
             | Secondly, if an llm is giving you the runaround it does not
             | have a solution for the prompt you asked and you need
             | either another prompt or another model or another approach
             | to using the model (for vendor lock in like openai)
        
           | swells34 wrote:
           | This is a good representation of my experience as well.
           | 
           | At the end of the day, this is because it isn't "writing
           | code" in the sense that you or I do. It is a fancy
           | regurgitation engine, that will output bits of stuff it's
           | seen before that seem related to your question. LLMs are
           | incredibly good at this, but that it also why you can never
           | trust their output.
        
         | Vampiero wrote:
         | ... I just realized that I would be waking up just to go back
         | to resting.
        
       | AJRF wrote:
       | My takeaway is that he didn't solve anything, he just changed the
       | shape of the problem into one that was familiar to him.
        
         | ebiester wrote:
         | That's how we all solve problems. If this was novel, it would
         | be a paper rather than a blog post.
         | 
         | The meta-strategy of combining LLM and non-LLM techniques is
         | going to be key for getting good results for some time.
        
           | AJRF wrote:
           | No I don't think I agree. There is lots of effort wasted
           | shuffling problems around laterally but not solving for the
           | actual goal, that's what I am saying.
        
         | ccortes wrote:
         | > he just changed the shape of the problem into one that was
         | familiar to him
         | 
         | that's a classic strategy to solve problems
        
       | devvvvvvv wrote:
       | Using AI as a way to flag things for humans to look at and make
       | final decisions on seems like the way to go
        
         | internet_points wrote:
         | I've worked on some projects that used ML and such to half-
         | automate things, thinking that we'd get the computer to do most
         | of the work and people would check over things and it would be
         | quality controlled.
         | 
         | Three problems with this:
         | 
         | * salespeople constantly try to sell the automation as more
         | complete than it is
         | 
         | * product owners try to push us developers into making it more
         | fully automated
         | 
         | * users get lulled into thinking it's more complete than it is
         | (and accepting suggestions instead of deeply thinking through
         | the issues like they would if they had to think things from
         | scratch)
        
           | liontwist wrote:
           | And all of these are management problems.
        
         | GardenLetter27 wrote:
         | Almost all deployed ML systems work like this.
         | 
         | I.e. for classification you can judge "certainty" by the soft-
         | max outputs of the classifier, then in the less certain cases
         | can refuse to classify and send it to humans.
         | 
         | And also do random sampling of outputs by humans to verify
         | accuracy over time.
         | 
         | It's just that humans are really expensive and slow though, so
         | it can be hard to maintain.
         | 
         | But if humans have to review everything anyway (like with the
         | EU's AI act for many applications) then you don't really gain
         | much - even though the humans would likely just do a cursory
         | rubber-stamp review anyway, as anyone who has seen Pull Request
         | reviews can attest to.
        
           | frankc wrote:
           | I have the same experience but I am still 5 to 10 times more
           | productive using claude. I'll have it write a class, have it
           | write tests for the class and give it the output of the
           | tests, from which it usually figures out problems like "oops
           | those methods don't exist". Along the way I am guiding it on
           | the approach and architecture. Sometimes it does get stuck
           | and it needs very specific intervention. You need to be a
           | senior engineer to do this well, In the end I usually get
           | what I want with way more tests than I would have the
           | patience to write and a fraction of the time. Importantly
           | since it now has the context loaded, I can have it write
           | nicely formatted documentation and add bells and whistles
           | like a pretty cli, with minimal effort. In the end I usually
           | get what I want with better tests, docs and polish in a
           | fraction of the time, especially with cursor which makes the
           | iteration process so much faster.
        
         | Joker_vD wrote:
         | But... that can possibly only make things more expensive than
         | they are now, with dubious improvements in quality?
        
         | Terr_ wrote:
         | One of the big subtle problems is designing the broader
         | interaction so that the humans in the loop are both capable
         | _and_ motivated to do a proper review of every item that will
         | occur.
         | 
         | LLMs are able to counterfeit a truly impressive number of
         | indirect signals which humans currently use to make snap-
         | judgements and mental-shortcuts, and somehow reviewers need to
         | be shielded from that.
        
       | GardenLetter27 wrote:
       | The example here isn't great, but the idea of using an ensemble
       | of LLMs when compute is cheaper is cool.
       | 
       | As the foundational models can parse super complex stuff like
       | dense human language, music, etc. with context - like a really
       | good pre-built auto-encoder, which would be a nightmare with
       | classic machine learning feature selection (remember bag of
       | words? and word2vec?).
       | 
       | I wonder how such an approach would compare to just fine-tuning
       | one model though? And how the cost of fine-tuning vs. greater
       | inference cost for an ensemble compares?
        
       | dailykoder wrote:
       | This is interesting. I think I did not entirely understand OPs
       | problem, but we are going more and more in a direction where we
       | try to come up with things how to "program" LLMs, because human
       | language is not sufficient enough. (Atleast I thought) the goal
       | was to make things simple and "just" ask your question to an LLM
       | and get the answer, but normal language does not work for complex
       | tasks.
       | 
       | Especially in programming it is fun. People spent hours over
       | hours to come up with a prompt that can (kinda-of) reliably
       | produce code. So they try to hack/program some weird black box so
       | that they can do their actual programming tasks. On some areas
       | there might be a speed up, but I still don't know if it's worth
       | it. It feels like we are creating more problems than solutions
        
         | flessner wrote:
         | I feel the same way about programming, but there are plenty of
         | people that don't enjoy it.
         | 
         | I recently was chatting with my friend that wanted to automate
         | one of his tasks by writing a python script with AI -> because
         | all the influencers said it was "so easy" and "no programming
         | knowledge" required.
         | 
         | That might have been the single funniest piece of code I have
         | seen in a long time. Didn't install the dependencies, didn't
         | fill in the Twitter API key, instead of searching for a keyword
         | on Twitter it just looked up 3 random accounts, 25 functions on
         | like 120 lines of code?
         | 
         | Also, the line numbers in the errors weren't helpful because
         | the whole thing lived in Windows notepad. That was a flagship
         | AI and a (in my opinion) capable human not being able to
         | assemble a simple script.
        
           | PaulHoule wrote:
           | If you have some idea of what good code looks like you can
           | sometimes give feedback to something like Cursor or Windsurf.
           | For small greenfield projects (that kind of downloader
           | script) they succeed maybe 50% of the time.
           | 
           | If you had no idea of what code looks like and poor critical
           | thinking abilities God help you.
        
       | cyanydeez wrote:
       | Possible bug on uber query? --- Which of these product
       | descriptions (if either) is more relevant to the furniture
       | e-commerce search query:
       | 
       | Query: entrance table Product LHS name: aleah coffee table
       | Product LHS description: You'll love this table from lazy boy. It
       | goes in your living room. And you'll find ... ... Or Product LHS
       | name: marta coffee table Product RHS description: This coffee
       | table is great for your entrance, use it to put in your
       | doorway... ... Or Neither / Need more product attributes
       | 
       | Only respond 'LHS' or 'RHS' if you are confident in your decision
       | 
       | RESPONSE: RHS --- LHS is include. Hopefully this is a bug in the
       | blog and not the code
        
         | outofpaper wrote:
         | With or without the bug it's a horid prompt. Prompts work best
         | when they resemble content LLMs have in their training data.
         | People use first and second far more often then LHS and RHS
         | when talking about options. First or second, 1 or 2, a or b or
         | neither.
         | 
         | LLMs are narrative machines. They make up stories which often
         | make sense.
        
           | cyanydeez wrote:
           | LHS might trigger a better parsing window and that window
           | would be model dependent.
        
         | softwaredoug wrote:
         | This is a copy/pasted typo, the real prompt begins
         | 
         | > Which of these furniture products is more relevant to the
         | furniture e-commerce search query:
         | 
         | Fixed in the post. Thanks
        
       | MichaelMoser123 wrote:
       | classical ML always runs into the knowledge representation
       | problem - the task is to find some general representation of
       | knowledge suitable for computer reasoning. That's something of a
       | philosophers stone - they keep searching for it for seventy years
       | already.
       | 
       | I think agents will run into the same problem - if they will try
       | to find a classical ML solution to verify what comes out of the
       | LLM.
        
         | blueflow wrote:
         | And like the philosophers stone it does not exist. Remember the
         | "Map vs Territory" discussion: you cannot have generic maps,
         | only maps specialized for a purpose.
        
           | outofpaper wrote:
           | Yes. All too easily we forget that the maps are not the
           | territories.
           | 
           | LLMs are amazing we are creating better and better
           | hyperdimentional maps of language but until we have systems
           | that are not just crystallized maps of the language they were
           | trained on we will never have something that can really
           | think, let alone AGI or whatever new term we come up with.
        
           | Matthyze wrote:
           | That's essentially the No Free Lunch (NFL) theorem, right?
           | 
           | The thing about the NFL theorem is that it assumes an equal
           | weight or probability over each problem/task. It's impossible
           | to find a search/learning algorithm that performs superiorly
           | over another, 'averaged' over all tasks. But--and this is
           | purely my intuition--the problems that humans want to solve,
           | are a very small subset of all possible search/learning
           | problems. And this imbalance allows us to find algorithms
           | that work particularly well on the subset of problems we want
           | to solve.
           | 
           | Coming back to representation and maps. Human
           | understanding/worldview is a good example. Human
           | understanding and worldview is itself a map of reality. This
           | map models certain facts of the world well and other facts
           | poorly. It is optimized for human cognition. But it's still
           | broad enough to be useful for a variety of problems. If this
           | map wasn't useful, we probably wouldn't have evolved it.
           | 
           | The point is, I do think there's a philosopher's pebble, and
           | I do think there's a few free bites of lunch. These can be
           | found in the discrepancy between all theoretically possible
           | tasks and the tasks that we actually want to do.
        
             | MichaelMoser123 wrote:
             | I don't know. Maps can vary in quality and expressiveness.
             | 
             | Language itself is a kind of map, and it has pretty
             | universal reach.
             | 
             | "No Free Lunch (NFL) theorem" isn't quite mathematics, it
             | is more in the domain of philosophy.
        
               | Matthyze wrote:
               | The NFL theorem (for optimization) has a mathematical
               | proof, FYI. But I agree that there's a lot of room for
               | interpretation.
        
       | hackerwr wrote:
       | hello do you have a place for me im haking the school now
        
       | sgt101 wrote:
       | Options:
       | 
       | Finetune the models to be better
       | 
       | Optimise the prompts to be better
       | 
       | Train better models
        
       | Matthyze wrote:
       | So, if I understand the approach correctly: we're essentially
       | doing very advanced feature engineering with LLMs. We find that
       | direct classification by LLMs performs worse than LLM feature
       | engineering followed by decision trees. Am I right?
       | 
       | The finding surprises me. I would expect modern LLMs to be
       | powerful enough to do well at the task. Given how much the data
       | is processed before the decision trees, I wouldn't expect
       | decision trees to add much. I can see value in this approach if
       | you're unable to optimize the LLM. But, if you can, I think end-
       | to-end training with a pre-trained LLM is likely to work better.
        
         | softwaredoug wrote:
         | TBH I'm not sure its better, but the decision tree structure is
         | pretty handy for problem exploration
         | 
         | (However 'better' might be defined, I care more about the
         | precision / recall tradeoff)
        
         | ellisv wrote:
         | This resonates with my experience. Use LLMs for feature
         | engineering, then use traditional ML for your inference models.
        
           | Matthyze wrote:
           | Perhaps the reason that this approach works well is that,
           | while the LLM gives you good general-purpose language
           | processing, the decision tree learns about the specific
           | dataset. And that combination is more powerful than either
           | component.
        
             | ellisv wrote:
             | It's the same reason LLMs don't perform well on tabular
             | data. (They can do fine but usually not was well as other
             | models)
             | 
             | Performing feature engineering with LLMs and then storing
             | the embeddings in a vector database also allows you to
             | reuse the embeddings for multiple tasks (eg clustering,
             | nearest neighbor).
             | 
             | Generally no one uses plain decision trees since random
             | forest or gradient boosted trees perform better and are
             | more robust.
        
         | gerad wrote:
         | It seems like a really easy way to overfit your model to your
         | data, even while using LLMs.
        
       | jncfhnb wrote:
       | If you're going to use classic ML why not just train a model
       | based on the vector representations of the product descriptions?
        
         | softwaredoug wrote:
         | Yes that's a great idea, and maybe something I would try next
         | in this series.
        
       | lewisl9029 wrote:
       | I had a somewhat similar experience trying to use LLMs to do OCR.
       | 
       | All the models I've tried (Sonnet 3.5, GPT 4o, Llama 3.2, Qwen2
       | VL) have been pretty good at extracting text, but they failed
       | miserably at finding bounding boxes, usually just making up
       | random coordinates. I thought this might have been due to
       | internal resizing of images so tried to get them to use relative
       | % based coordinates, but no luck there either.
       | 
       | Eventually gave up and went back to good old PP-OCR models (are
       | these still state of the art? would love to try out some better
       | ones). The actual extraction feels a bit less accurate than the
       | best LLMs, but bounding box detection is pretty much spot on all
       | the time, and it's literally several orders of magnitude more
       | efficient in terms of memory and overall energy use.
       | 
       | My conclusion was that current gen models still just aren't
       | capable enough yet, but I can't help but feel like I might be
       | missing something. How the heck did Anthropic and OpenAI manage
       | to build computer use if their models can't give them accurate
       | coordinates of objects in screenshots?
        
         | HanClinto wrote:
         | Maybe still worth it to separate the tasks, and use a
         | traditional text detection model to find bounding boxes, then
         | crop the images. In a second stage, send those cropped samples
         | to the higher-power LLMs to do the actual text extraction, and
         | don't worry about them for bounding boxes at all.
         | 
         | There are some VLLMs that seem to be specifically trained to do
         | bounding box detection (Moondream comes to mind as one that
         | advertises this?), but in general I wouldn't be surprised if
         | none of them work as well as traditional methods.
        
         | owkman wrote:
         | I think people have had success with using PaliGemma for this.
         | The computer use type use cases probably use fine tuned
         | versions of LLMs for their use cases rather than the base ones.
        
         | ahzhou wrote:
         | LLMs are inherently bad at this due to tokenization, scaling,
         | and lack of training on the task. Anthropic's computer use
         | feature has a specialized model for pixel-counting: > Training
         | Claude to count pixels accurately was critical. Without this
         | skill, the model finds it difficult to give mouse commands. [1]
         | For a VLM trained on identifying bounding boxes, check out
         | PaliGemma [2]
         | 
         | You may also be able to get the computer use API to draw
         | bounding boxes if the costs make sense.
         | 
         | That said, I think the correct solution is likely to use a non-
         | VLM to draw bounding boxes. Depends on the dataset and problem.
         | 
         | 1. https://www.anthropic.com/news/developing-computer-use 2.
         | https://huggingface.co/blog/paligemma
        
           | nostrebored wrote:
           | PaliGemma on computer use data is absolutely not good. The
           | difference between a FT YOLO model and a FT PaliGemma model
           | is huge if generic bboxes are what you need. Microsoft's
           | OmniParser also winds up using a YOLO backbone [1]. All of
           | the browser use tools (like our friends at browser-use [2])
           | wind up trying to get a generic set of bboxes using the DOM
           | and then applying generative models.
           | 
           | PaliGemma seems to fit into a completely different niche
           | right now (VQA and Segmentation) that I don't really see
           | having practical applications for computer use.
           | 
           | [1]
           | https://huggingface.co/microsoft/OmniParser?language=python
           | [2] https://github.com/browser-use/browser-use
        
         | bob1029 wrote:
         | > they failed miserably at finding bounding boxes, usually just
         | making up random coordinates.
         | 
         | This makes sense to me. These LLMs likely have no statistics
         | about the spatial relationships of tokens in a 2D raster space.
        
           | nostrebored wrote:
           | The spatial awareness is what grounding models try to
           | achieve, e.g. UGround [1]
           | 
           | [1]
           | https://huggingface.co/osunlp/UGround-V1-7B?language=python
        
         | aaronharnly wrote:
         | Relatedly, we find LLM vision models absolutely atrocious at
         | _counting things_. We build school curricula, and one basic
         | task for our activities is counting - blocks, pictures of
         | ducks, segments in a chart, whatever. Current LLM models can 't
         | reliably count four or five squares in an image.
        
           | nyrikki wrote:
           | IMHO, that is expected, at least for the general case.
           | 
           | That is one of the implications of transformers being
           | DLOGTIME-uniform TC0, they don't have access to counter
           | analogs.
           | 
           | You would need to move to log depth circuits, add mod-p_n
           | gates etc... unless someone finds some new mathematics.
           | 
           | Proposition 6.14 in Immerman is where this is lost if you
           | want a cite.
           | 
           | It will be counterintuitive that division is in TC0, but
           | (general) counting is not.
        
         | whiplash451 wrote:
         | 1. You need to look into the OCR-specific literature of DL
         | (e.g. udop) or segmentation-based (e.g. segment-anything)
         | 
         | 2. BigTech and SmallTech train their fancy bounding box /
         | detection models on large datasets that have been built using
         | classical detectors and a ton of manual curation
        
         | KTibow wrote:
         | Gemini 2 can purportedly do this, you can test it with the
         | Spatial Understanding Starter App inside AI Studio. Only caveat
         | is that it's not production ready yet.
        
         | DougBTX wrote:
         | AFAIK none of those models have been trained to produce
         | bounding boxes. On the other hand Gemini Pro has, so it may be
         | worth looking at for your use case:
         | 
         | https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...
        
         | vonneumannstan wrote:
         | Yeah I really struggle when I use my hammer to screw pieces of
         | wood together too.
        
         | prettyblocks wrote:
         | Have you played with moondream? Pretty cool small vision model
         | that did a good job with bounding boxes when I palyed with it.
        
         | jonnycoder wrote:
         | I am doing OCR on hundreds of PDFs using AWS Textract. It
         | requires me to convert each page of the pdf to an image and
         | then analyze the image and it works good for converting to
         | markdown format (which requires custom code). I want to try
         | using some vision models and compare how they do, for example
         | Phi-3.5-vision-instruct.
        
       | ausbah wrote:
       | this is sort of what tooling is suppose to be like right? llm
       | isn't great at X task so regulates it to a proven capable tool
       | like a calendar
        
       | raghavbali wrote:
       | Maybe I missed something but this is a round about way of doing
       | things where an embedding + ML classifier would have done the
       | job. We don't have to use an LLM just because it can be used IMO
        
       | napsternxg wrote:
       | We often ignore the importance of using good baseline systems and
       | jump to the latest shiny thing.
       | 
       | I had a similar experience few years back when participating in a
       | ML competitions [1,2] for detecting and typing phrases in a text.
       | I submitted an approach based on Named Enttiy Recognition using
       | Conditional Random Field (CRF) which has been quite robust and
       | well known in the community and my solution beat most of tuned
       | Deep learning solutions by quite a large margin [1].
       | 
       | I think a lot of folks underestimate the complexity of using some
       | of these models (DL, LLM) and just throw them at the problem or
       | don't compare it well against well established baselines.
       | 
       | [1]
       | https://scholar.google.com/citations?view_op=view_citation&h...
       | [2]
       | https://scholar.google.com/citations?view_op=view_citation&h...
        
         | PaulHoule wrote:
         | As I see it, you need a model you can train quickly so you can
         | do tuning, model selection, and all that.
         | 
         | I have a BERT + SVM + Logistic Regression (for calibration)
         | model that can train 20 models for automatic model selection
         | and calibration in about 3 minutes. I feel like I understand
         | the behavior of it really well.
         | 
         | I've tried fine tuning a BERT for the same task and the
         | shortest model builds take 30 minutes, the training curves make
         | no sense (back in the day I used to be able to train networks
         | with early stopping and get a good one _every time_ ) and if I
         | look at arXiv papers it is rare for anyone to have a model
         | selection process with any discipline at all, mainly people use
         | a recipe that sorta-kinda seemed to work in some other paper.
         | People scoff at you if you ask the engineering-oriented
         | question "What training procedure can I use to get a good model
         | consistently?"
         | 
         | Because of that I like classical ML.
        
           | korkybuchek wrote:
           | There's a reason xgboost is still king in large companies.
        
       ___________________________________________________________________
       (page generated 2025-01-24 23:01 UTC)