[HN Gopher] Low responsiveness of ML models to critical or deter...
       ___________________________________________________________________
        
       Low responsiveness of ML models to critical or deteriorating health
       conditions
        
       Author : PaulHoule
       Score  : 81 points
       Date   : 2025-03-26 14:43 UTC (3 days ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | wswope wrote:
       | I work in the ICU monitoring field, on the R&D team of a company
       | with live systems at dozens of hospitals and multiple FDA
       | approvals. We use extended Kalman filters (i.e. non-blackbox
       | "ML") to estimate certain lab values of patients that are highly
       | indicative of them crashing, based on live data from whatever set
       | of monitors they're hooked up to - and it's highly robust.
       | 
       | What the authors of this paper are doing is throwing stuff at the
       | wall to see if it works, and publishing results. That's not
       | necessarily a bad thing at all, but I say this to underline that
       | their results are not at all reflective of SOTA capabilities, and
       | they're not doing much exploration of prior art.
        
         | jvanderbot wrote:
         | Parameter estimation is ML now?
        
           | thenobsta wrote:
           | Why not? LLMs, vision models, and kalman filters all learn
           | parameters based on data.
        
             | PaulHoule wrote:
             | A linear regression model can be written and trained as a
             | neural net, has a loss function, all of that. Most if not
             | all ML problems can be formulated as modelling a
             | probability distribution
        
               | klodolph wrote:
               | That's too reductive--ML models are statistical models.
               | Statistical models have parameters, and in general cases,
               | you choose the parameters with some kind of optimization
               | algorithm.
               | 
               | If you play fast and loose with your definition of "ML",
               | you'll end up defining it so that any statistical model
               | is an ML model... in which case, why even bother using
               | two different terms?
               | 
               | ML models are, broadly speaking, the more complicated
               | ones with more parameters, where the behavior of the
               | models is not really known without training. I'm sure you
               | could nitpick to death any definition I give, but that's
               | fine.
        
               | PaulHoule wrote:
               | I am sure there are people teach data science classes who
               | look at it in that "reductive" way.
               | 
               | From the viewpoint of engineering, scikit-learn provides
               | the same interface to linear regression that it supplies
               | to many other models. Huggingface provides an interface
               | to models that is similar in a lot of ways but I think a
               | 'regression' in that it doesn't provide the bare minimum
               | of model selection facilities needed to reliably make
               | calibrated models. There are many problems where you
               | could use either linear regression or a much more complex
               | model. When it comes to "not known without training" I'm
               | not sure how much of that is the limit of what we know
               | right now and how much is fundamental as in the problem
               | of "we can't really know what a computer program with
               | free loops with do" (Halting problem) or "we can't
               | predict what side of the side Pluto is going to be on in
               | 30 million years" (Deterministic chaos)
               | 
               | (The first industrial model trainer I built was a
               | simultaneously over and under engineered mess like most
               | things in this industry... I didn't appreciate scikit-
               | learn's model selection facilities and even though the
               | data sci's I worked with had a book understanding of
               | them, they didn't really put them to work.)
        
               | klodolph wrote:
               | There's a pedagogical reason to teach things with a kind
               | of reductive definition. It makes a lot of sense.
               | 
               | I remember getting cornered by somebody in a statistics
               | class and interrogated about whether I thought neural
               | networks were statistical techniques. In that situation
               | I'll only answer yes, they are statistical techniques. As
               | far as I can tell, a big chunk of what we do with machine
               | learning is create complicated models with a large number
               | of parameters. We're not creating something _other_ than
               | statistical models. We just draw a kind of cultural line
               | between traditional statistics and machine learning
               | techniques.
               | 
               | Now that I think about it, maybe if you asked me about
               | the line between traditional statistics and machine
               | learning, I would say that in traditional statistics, you
               | can understand the parameters.
        
               | blackbear_ wrote:
               | > in traditional statistics, you can understand the
               | parameters.
               | 
               | I also think that this is the key differentiator between
               | ML and stats.
               | 
               | Statistical models can be understood _formally_ , which
               | means that not only we know how each parameter affects
               | the predictions, we also know what their estimation
               | uncertainties are, under which assumptions, and how to
               | check that these assumptions are satisfied. Usually, we
               | value these models not only because they're predictive
               | but also because they're interpretable.
               | 
               | In ML there is neither the luxury nor the interest in
               | doing this, all we want is something that predicts as
               | well as possible.
               | 
               | So the difference is not the model itself but what you
               | want to get out of it.
        
           | klodolph wrote:
           | I think ML is in quotes for a reason--the reason is because
           | the usage is not typical.
        
           | windsignaling wrote:
           | Neural networks are not ML now?
        
         | AlotOfReading wrote:
         | Calling EKFs "ML" is certainly a choice.
        
           | Epa095 wrote:
           | Is it less ML than linear regression?
        
             | klodolph wrote:
             | If you want to draw the line between ML and not ML, I think
             | you'll have to put Kalman filters and linear regression on
             | the non-ML side. You can put support vector machines and
             | neural networks on the ML side.
             | 
             | In some sense the exact place you draw the distinction is
             | arbitrary. You could try to characterize where the
             | distinction is by saying that models with fewer parameters
             | and lower complexity tend to be called "not ML", and models
             | with more parameters and higher complexity tend to be
             | called "ML".
        
               | wrsh07 wrote:
               | Linear regression is literally the second lecture of the
               | Stanford ML class. https://cs229.stanford.edu/
               | 
               | If you want to say "not neural networks" or not dnn or
               | not llm, sure. But it's obviously machine learning
        
               | klodolph wrote:
               | When you say it's "obviously machine learning", how could
               | that statement possibly be correct? There's not even
               | broad consensus here... so you don't get to say that your
               | definition is obviously correct.
               | 
               | There are pedagogical reasons why you'd include linear
               | regression in a machine learning course. This is pretty
               | clear to me--they have properties which are extremely
               | important to the field of machine learning field, such as
               | differentiability.
        
               | computerex wrote:
               | Linear regression is ML. You are off base.
        
               | genewitch wrote:
               | That's cool. Can you explain what machines they were
               | using in the 1800s to do learning on?
        
               | wrsh07 wrote:
               | Does this mean if I simulate a neural network on pen and
               | paper that stops being machine learning?
               | 
               | All of these are machine learning techniques. Doing it by
               | hand doesn't change anything. Today we use machines so
               | it's machine learning
        
               | windsignaling wrote:
               | You're using the word to define the concept rather than
               | the concept to define the word. Wrong order.
               | 
               | See: "Program".
        
               | Retric wrote:
               | If you want to draw a line you could say anything that's
               | not full AGI isn't machine learning because of
               | philosophical reasons.
               | 
               | But, other than that there's the only clear line is when
               | the programmer isn't hard coding results which puts
               | Linear regression over the ML line. I guess you could
               | argue about supervised vs unsupervised algorithms, but
               | that's going to exclude a lot of what is generally
               | described as ML.
        
               | wrsh07 wrote:
               | It is obviously machine learning. It is a machine
               | learning algorithm. It is taught in machine learning
               | classes. It is described as a machine learning algorithm
               | literally everywhere. Here's wiki:
               | 
               | > Linear regression is also a type of machine learning
               | algorithm, more specifically a supervised algorithm, that
               | learns from the labelled datasets and maps the data
               | points to the most optimized linear functions that can be
               | used for prediction on new datasets
               | 
               | You can pretend it's not because it's not a sophisticated
               | machine learning algorithm, but you are wrong.
        
               | windsignaling wrote:
               | After spending over a decade in both statistics and
               | machine learning I'd say the only reason there isn't a
               | "broad consensus" is because statisticians like to gate-
               | keep, whether that's linear regression, Monte Carlo
               | methods, or Kalman Filters.
               | 
               | Linear regression appears in pretty much every ML
               | textbook. Can you confidently say, "this model that
               | appears in every ML textbook is the only model in the ML
               | textbook that isn't an ML model"?
               | 
               | Kalman Filters are like a continuous-state HMM. So why
               | are HMMs considered ML and Kalman Filters not considered
               | ML?
               | 
               | IMO it's an ego thing. They spent decades rigorously
               | analyzing everything about linear models and here come
               | these CS cowboys producing amazing results without any of
               | the careful rigor that statisticians normally apply. It's
               | difficult to argue against real results so the
               | inflexible, hard-nosed statisticians just hang on to
               | whatever they can.
        
           | idiotsecant wrote:
           | It's machine learning until you understand how it works, then
           | it's just control theory and filters again.
        
             | whatshisface wrote:
             | Diffusion models are a happy middle ground. :-)
        
           | wswope wrote:
           | Hence the quotes ;).
        
           | getnormality wrote:
           | It is a reasonable choice, and especially with the quotes
           | around it, completely understandable.
           | 
           | The distinction between statistical inference and machine
           | learning is too blurry to police Kalman filters onto one
           | side.
        
           | CamperBob2 wrote:
           | EKFs work by 'learning' the covariance matrix on the fly, so
           | I don't see why not?
        
         | nyrikki wrote:
         | As an intuition on why many people see this as different.
         | 
         | PAC Learning is about compression, KF/EKF is more like Taylor
         | expansion.
         | 
         | The specific types of PAC Learning that this paper covers has
         | problems with a simplicity bias, and fairly low sensitivity.
         | 
         | While based on UHATs, this paper may provide some insights.
         | 
         | https://arxiv.org/abs/2502.02393
         | 
         | Obviously LLM and LRMs are the most studied, but even the
         | recent posts on here from anthropic show that without a few
         | high probability entries in the k-top results, confabulations
         | are difficult for transformers.
         | 
         | Obviously there are PAC Learning methods that target anomaly
         | detection, but they are very different than even EKF + Mc
         | 
         | You will note in this paper that even highly weighted features
         | exhibited low sensitivity.
         | 
         | While the industry may find some pathological cases that make
         | the approach usable, autograd and the need for parallelism make
         | the application of this papers methods to tiny variations to
         | multivariate problem ambitious.
         | 
         | They also only trained on medical data. Part of the reason the
         | foundation models do so well is that they encode verifiers from
         | a huge corpus that invalidates the traditional bias variance
         | tradeoffs from the early 90's papers.
         | 
         | But they are still selecting from the needles and don't have
         | access to the hay in the haystack.
         | 
         | The following paper is really not related except it shows how
         | compression exacerbates that problem.
         | 
         | https://arxiv.org/abs/2205.06977
         | 
         | Chaitin's constant encoding the Halting problem, and that it is
         | normal and uncomputable is the extreme top end of
         | computability, but relates to the compression idea.
         | 
         | EKFs have access to the computable reals, and while non-linear,
         | KF and EKFs can be thought of linearization of the
         | approximations as a lens.
         | 
         | If the diagnostic indicators were both ergodic and Markovian,
         | this paper's approach would probably be fairly reliable.
         | 
         | But these efforts are really about finding a many to one
         | reduction that works.
         | 
         | I am skeptical about it in this case for PAC ML, but perhaps
         | they will find a pathological case.
         | 
         | But the tradeoffs between statistical learning and expansive
         | methods are quite different.
         | 
         | Obviously hype cycles drive efforts, I encourage you to look at
         | this years AAAI conference report and see that you are not
         | alone with the frustration on the single minded approach.
         | 
         | IMHO this paper is a net positive, showing that we are moving
         | from a broad exploration to targeted applications.
         | 
         | But that is just my opinion.
        
       | anthk wrote:
       | LLM models are organic? They somehow obbey the laws of
       | Thermodinamycs by some parallel on algorythms and underlying
       | Math? It would be amazing if it were some parallel between
       | Biologhycs (specially Funghi with emergent properties) and neural
       | networks...
        
       | magicalhippo wrote:
       | _For IHM prediction, LSTM models and transformer models were
       | trained for 100 epochs using the MIMIC-III and eICU datasets
       | separately._
       | 
       | I might be blind, but I don't see any mention of loss. Did they
       | stop at 100 because it was a nice round number or because it was
       | a good place to stop?
       | 
       | The LSTM model they used had 7k trainable parameters, the CW-LSTM
       | model 153k while the transformer model had 800k parameters (300k
       | trainable parameters and 600k optimizer parameters as they say).
       | 
       | I don't follow the field close enough, but is it reasonable these
       | models all converged at the same time, given the large difference
       | in size?
       | 
       | They mention the transformer model outperforming the LSTMs, but I
       | wonder if it could have done a lot better.
        
         | PaulHoule wrote:
         | What ever happened to early stopping?
         | 
         | I see so many papers where people train neural networks with
         | half-baked recipes. I think I saw early stopping first around
         | 1990 but it is so often for people to pick some arbitrary
         | number of epochs to run. I have to admit I never liked the term
         | "early stopping", I think people should have called it just
         | "stopping", because it makes it seem optional.
         | 
         | Back when I was training LSTM networks it was straightforward
         | to train nets reliably with early stopping...
        
           | Al-Khwarizmi wrote:
           | I'm also annoyed about this. I suppose the main reason is
           | because 20 years ago, if you didn't use early stopping,
           | typically your accuracy would plummet. In earlier, smaller
           | neural networks, overfitting was a huge issue; and lack of
           | dropout, batch normalization, etc. made learning much more
           | brittle.
           | 
           | Now the young'uns don't bother because you can just set 100
           | epochs, or whatever, and the result might not be optimal but
           | it will generally be fine. Still, it's a pity because you're
           | often wasting computational resources that could be spent in
           | trying alternative architectures, exploring hyperparameters
           | or whatever.
           | 
           | BTW, I also think "early stopping" is a terrible name. If you
           | don't know the term, it suggests that you're going to
           | undertrain the network, sacrificing some accuracy for
           | efficiency. No one wants undertrained networks. I think it's
           | not an overstatement to say that if it were called "adaptive
           | stopping", "validation guided stopping", or even something
           | more catchy like "smart stopping", probably more people would
           | use it.
        
             | PaulHoule wrote:
             | I have a smart RSS reader YOShInOn which uses BERT + a
             | probability calibrated SVM as its main model, I want to
             | make a general purpose model trainer for text
             | classification that is able to do harder problems.
             | 
             | People who hold court on ML forums will tell you fine-tuned
             | BERT is the way to go but BERT fine-tuning doesn't seem to
             | be compatible with early stopping with anything like the
             | training recipes I see in the literature. Compared to old
             | days these networks soak up knowledge like a sponge, my
             | hunch is that with N=10,000 samples or so you don't benefit
             | from running more than one epoch because the the network
             | doesn't have the capacity to learn from that many samples.
             | 
             | I find it depressing to find arXiv papers where people copy
             | a training recipe from other papers for BERT and compare it
             | 5-15 different text classification problems with maybe
             | N=500 samples. My BERT experiments take about 30 minutes so
             | it's no small thing to do parametric scans on them,
             | particularly when the epoch count is one of the parameters.
             | With "smart stopping" I'm not afraid of undertraining
             | models so I could run trainings all night and believe I'm
             | seeing representative performance as I vary parameters.
             | 
             | My plan is to couple ModernBERT to a LSTM or Bi-LSTM model
             | as the literature seems to show that this frequently ties
             | or beats fine-tuned BERT and my experience so far as I can
             | build reliable trainers for LSTM whereas team fined tuned
             | BERT is indifferent to the very idea of "reliable".
             | 
             | Another pet peeve is all the papers with N=500 samples
             | where I regularly get N=10,000+ in systems that I use
             | everyday and on a rainy weekend I can lay in bed with my
             | iPad and switch to an Android tablet when the battery runs
             | out and get N=5000 samples. [1] When I wrote my first text
             | classification paper we found we needed N=10,000 to get
             | really good models, sure the world knowledge in BERT helps
             | models learn fast and that's great (a problem I worried
             | about in 2005 and still worry about because I think the
             | average person wants good results at N<10!) but I need
             | calibrated usable accuracy and look at AUC-ROC as my
             | metric, not "accuracy", F1 or anything like that.
             | 
             | Then there's the effort people waste with things that can't
             | possibly work like Word2Vec, seems like people can read a
             | lot of papers and not see it in front of them that Word2Vec
             | is useless! I want to write a meta-analysis but instead I'm
             | writing a diatribe and I'm not going to be happy until I
             | repeat the paradigm with methods that are... _repeatable,_
             | not for the science but for the engineering.
             | 
             | [1] with hallucinations as a side effect if it is a visual
             | task but so what
        
           | rakejake wrote:
           | Nowadays, even the definition of an "epoch" is not well
           | defined. Traditionally it meant a pass over the entire
           | training set, but datasets are so massive today that many now
           | define an epoch as X steps - where a step is a minibatch (of
           | whatever size) from the training set. So 1 epoch is a random
           | sample of X minibatches from the training set. I'd guess the
           | logic is that datasets are so massive that you pick as much
           | data as you can fit in VRAM.
           | 
           | Karpathy's Zero To Hero series also uses this.
        
         | rakejake wrote:
         | A 7k param LSTM is very tiny. Not sure if LSTMs would even work
         | at that scale although someone with more theoretical knowledge
         | can correct me on this.
         | 
         | As an aside, I'm trying to train transformers for some
         | classification tasks on audio data. The models are "small"
         | (like 1M-15M params at most) and I find they are very finicky
         | to train. Below 1M parameters I find them hard to train at all.
         | I have thrown all sorts of learning rate schedules at them and
         | the best I can get is the network learns for a bit and then
         | plateaus, after which I can't do anything to get them out of
         | that minima. Training an LSTM/GRU on the same data gives me a
         | much better loss value.
         | 
         | I couldn't find many papers on training transformers at that
         | scale. The only one I was able to find was MS's TinyStories
         | [0], but that paper didn't delve much into how they trained the
         | models and whether they trained from scratch or distilled from
         | a larger model.
         | 
         | At those scales, I find LSTMs and CNNs are a lot more stable.
         | The few online threads I've found comparing LSTMs and
         | Transformers had the same thing to say - Transformers need a
         | lot more data and model size to achieve parity and exceed
         | LSTMs/GRUs/CNNs, maybe because the inductive bias provided is
         | hard to beat at those scales. Others can comment on what
         | they've seen.
         | 
         | [0] - https://arxiv.org/abs/2305.07759
        
           | Al-Khwarizmi wrote:
           | I don't have much help to offer, but just to echo your
           | experience... at my group we have tried to train Transformers
           | from scratch for various NLP tasks and we always have been
           | hit with them being extremely brittle, and BiLSTMs working
           | better. We only succeeded by following a pre-established
           | recipe (e.g. training a BERT model from scratch for a new
           | language, where the architecture, parameters and tasks are as
           | in BERT), or of course by fine-tuning existing models, but
           | just throwing some layers at a problem and training them from
           | scratch... nope, won't work without arcane knowledge that
           | doesn't seem to be written anywhere accessible. This is one
           | of the reasons why I dislike Transformers and I root for the
           | likes of RWKV to take the throne.
        
             | rakejake wrote:
             | I think the "arcane knowledge" is true for LLMs (billions).
             | But there are lots of people who train models in the open
             | in the hundreds of millions realm, but never below. Maybe
             | transformers simply don't work as well below a size and
             | data threshold.
        
       | amelius wrote:
       | Maybe start a Kaggle competition?
        
       | MeteorMarc wrote:
       | Maybe the training set had too many zeroshot patients.
        
       | bbstats wrote:
       | Am I missing something or is this just "We built models that are
       | bad"?
        
         | ohgr wrote:
         | Bad model, bad method or bad paper?
        
       | timewizard wrote:
       | All of this seems designed to make the hospital more labor
       | efficient. None of this seems designed to improve long term
       | outcomes for patients.
       | 
       | My continuing suspicion that this technology gets the hype it
       | does as part of an effort to reduce wages for all workers grows.
        
       ___________________________________________________________________
       (page generated 2025-03-29 23:01 UTC)