[HN Gopher] Low responsiveness of ML models to critical or deter...
___________________________________________________________________
Low responsiveness of ML models to critical or deteriorating health
conditions
Author : PaulHoule
Score : 81 points
Date : 2025-03-26 14:43 UTC (3 days ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| wswope wrote:
| I work in the ICU monitoring field, on the R&D team of a company
| with live systems at dozens of hospitals and multiple FDA
| approvals. We use extended Kalman filters (i.e. non-blackbox
| "ML") to estimate certain lab values of patients that are highly
| indicative of them crashing, based on live data from whatever set
| of monitors they're hooked up to - and it's highly robust.
|
| What the authors of this paper are doing is throwing stuff at the
| wall to see if it works, and publishing results. That's not
| necessarily a bad thing at all, but I say this to underline that
| their results are not at all reflective of SOTA capabilities, and
| they're not doing much exploration of prior art.
| jvanderbot wrote:
| Parameter estimation is ML now?
| thenobsta wrote:
| Why not? LLMs, vision models, and kalman filters all learn
| parameters based on data.
| PaulHoule wrote:
| A linear regression model can be written and trained as a
| neural net, has a loss function, all of that. Most if not
| all ML problems can be formulated as modelling a
| probability distribution
| klodolph wrote:
| That's too reductive--ML models are statistical models.
| Statistical models have parameters, and in general cases,
| you choose the parameters with some kind of optimization
| algorithm.
|
| If you play fast and loose with your definition of "ML",
| you'll end up defining it so that any statistical model
| is an ML model... in which case, why even bother using
| two different terms?
|
| ML models are, broadly speaking, the more complicated
| ones with more parameters, where the behavior of the
| models is not really known without training. I'm sure you
| could nitpick to death any definition I give, but that's
| fine.
| PaulHoule wrote:
| I am sure there are people teach data science classes who
| look at it in that "reductive" way.
|
| From the viewpoint of engineering, scikit-learn provides
| the same interface to linear regression that it supplies
| to many other models. Huggingface provides an interface
| to models that is similar in a lot of ways but I think a
| 'regression' in that it doesn't provide the bare minimum
| of model selection facilities needed to reliably make
| calibrated models. There are many problems where you
| could use either linear regression or a much more complex
| model. When it comes to "not known without training" I'm
| not sure how much of that is the limit of what we know
| right now and how much is fundamental as in the problem
| of "we can't really know what a computer program with
| free loops with do" (Halting problem) or "we can't
| predict what side of the side Pluto is going to be on in
| 30 million years" (Deterministic chaos)
|
| (The first industrial model trainer I built was a
| simultaneously over and under engineered mess like most
| things in this industry... I didn't appreciate scikit-
| learn's model selection facilities and even though the
| data sci's I worked with had a book understanding of
| them, they didn't really put them to work.)
| klodolph wrote:
| There's a pedagogical reason to teach things with a kind
| of reductive definition. It makes a lot of sense.
|
| I remember getting cornered by somebody in a statistics
| class and interrogated about whether I thought neural
| networks were statistical techniques. In that situation
| I'll only answer yes, they are statistical techniques. As
| far as I can tell, a big chunk of what we do with machine
| learning is create complicated models with a large number
| of parameters. We're not creating something _other_ than
| statistical models. We just draw a kind of cultural line
| between traditional statistics and machine learning
| techniques.
|
| Now that I think about it, maybe if you asked me about
| the line between traditional statistics and machine
| learning, I would say that in traditional statistics, you
| can understand the parameters.
| blackbear_ wrote:
| > in traditional statistics, you can understand the
| parameters.
|
| I also think that this is the key differentiator between
| ML and stats.
|
| Statistical models can be understood _formally_ , which
| means that not only we know how each parameter affects
| the predictions, we also know what their estimation
| uncertainties are, under which assumptions, and how to
| check that these assumptions are satisfied. Usually, we
| value these models not only because they're predictive
| but also because they're interpretable.
|
| In ML there is neither the luxury nor the interest in
| doing this, all we want is something that predicts as
| well as possible.
|
| So the difference is not the model itself but what you
| want to get out of it.
| klodolph wrote:
| I think ML is in quotes for a reason--the reason is because
| the usage is not typical.
| windsignaling wrote:
| Neural networks are not ML now?
| AlotOfReading wrote:
| Calling EKFs "ML" is certainly a choice.
| Epa095 wrote:
| Is it less ML than linear regression?
| klodolph wrote:
| If you want to draw the line between ML and not ML, I think
| you'll have to put Kalman filters and linear regression on
| the non-ML side. You can put support vector machines and
| neural networks on the ML side.
|
| In some sense the exact place you draw the distinction is
| arbitrary. You could try to characterize where the
| distinction is by saying that models with fewer parameters
| and lower complexity tend to be called "not ML", and models
| with more parameters and higher complexity tend to be
| called "ML".
| wrsh07 wrote:
| Linear regression is literally the second lecture of the
| Stanford ML class. https://cs229.stanford.edu/
|
| If you want to say "not neural networks" or not dnn or
| not llm, sure. But it's obviously machine learning
| klodolph wrote:
| When you say it's "obviously machine learning", how could
| that statement possibly be correct? There's not even
| broad consensus here... so you don't get to say that your
| definition is obviously correct.
|
| There are pedagogical reasons why you'd include linear
| regression in a machine learning course. This is pretty
| clear to me--they have properties which are extremely
| important to the field of machine learning field, such as
| differentiability.
| computerex wrote:
| Linear regression is ML. You are off base.
| genewitch wrote:
| That's cool. Can you explain what machines they were
| using in the 1800s to do learning on?
| wrsh07 wrote:
| Does this mean if I simulate a neural network on pen and
| paper that stops being machine learning?
|
| All of these are machine learning techniques. Doing it by
| hand doesn't change anything. Today we use machines so
| it's machine learning
| windsignaling wrote:
| You're using the word to define the concept rather than
| the concept to define the word. Wrong order.
|
| See: "Program".
| Retric wrote:
| If you want to draw a line you could say anything that's
| not full AGI isn't machine learning because of
| philosophical reasons.
|
| But, other than that there's the only clear line is when
| the programmer isn't hard coding results which puts
| Linear regression over the ML line. I guess you could
| argue about supervised vs unsupervised algorithms, but
| that's going to exclude a lot of what is generally
| described as ML.
| wrsh07 wrote:
| It is obviously machine learning. It is a machine
| learning algorithm. It is taught in machine learning
| classes. It is described as a machine learning algorithm
| literally everywhere. Here's wiki:
|
| > Linear regression is also a type of machine learning
| algorithm, more specifically a supervised algorithm, that
| learns from the labelled datasets and maps the data
| points to the most optimized linear functions that can be
| used for prediction on new datasets
|
| You can pretend it's not because it's not a sophisticated
| machine learning algorithm, but you are wrong.
| windsignaling wrote:
| After spending over a decade in both statistics and
| machine learning I'd say the only reason there isn't a
| "broad consensus" is because statisticians like to gate-
| keep, whether that's linear regression, Monte Carlo
| methods, or Kalman Filters.
|
| Linear regression appears in pretty much every ML
| textbook. Can you confidently say, "this model that
| appears in every ML textbook is the only model in the ML
| textbook that isn't an ML model"?
|
| Kalman Filters are like a continuous-state HMM. So why
| are HMMs considered ML and Kalman Filters not considered
| ML?
|
| IMO it's an ego thing. They spent decades rigorously
| analyzing everything about linear models and here come
| these CS cowboys producing amazing results without any of
| the careful rigor that statisticians normally apply. It's
| difficult to argue against real results so the
| inflexible, hard-nosed statisticians just hang on to
| whatever they can.
| idiotsecant wrote:
| It's machine learning until you understand how it works, then
| it's just control theory and filters again.
| whatshisface wrote:
| Diffusion models are a happy middle ground. :-)
| wswope wrote:
| Hence the quotes ;).
| getnormality wrote:
| It is a reasonable choice, and especially with the quotes
| around it, completely understandable.
|
| The distinction between statistical inference and machine
| learning is too blurry to police Kalman filters onto one
| side.
| CamperBob2 wrote:
| EKFs work by 'learning' the covariance matrix on the fly, so
| I don't see why not?
| nyrikki wrote:
| As an intuition on why many people see this as different.
|
| PAC Learning is about compression, KF/EKF is more like Taylor
| expansion.
|
| The specific types of PAC Learning that this paper covers has
| problems with a simplicity bias, and fairly low sensitivity.
|
| While based on UHATs, this paper may provide some insights.
|
| https://arxiv.org/abs/2502.02393
|
| Obviously LLM and LRMs are the most studied, but even the
| recent posts on here from anthropic show that without a few
| high probability entries in the k-top results, confabulations
| are difficult for transformers.
|
| Obviously there are PAC Learning methods that target anomaly
| detection, but they are very different than even EKF + Mc
|
| You will note in this paper that even highly weighted features
| exhibited low sensitivity.
|
| While the industry may find some pathological cases that make
| the approach usable, autograd and the need for parallelism make
| the application of this papers methods to tiny variations to
| multivariate problem ambitious.
|
| They also only trained on medical data. Part of the reason the
| foundation models do so well is that they encode verifiers from
| a huge corpus that invalidates the traditional bias variance
| tradeoffs from the early 90's papers.
|
| But they are still selecting from the needles and don't have
| access to the hay in the haystack.
|
| The following paper is really not related except it shows how
| compression exacerbates that problem.
|
| https://arxiv.org/abs/2205.06977
|
| Chaitin's constant encoding the Halting problem, and that it is
| normal and uncomputable is the extreme top end of
| computability, but relates to the compression idea.
|
| EKFs have access to the computable reals, and while non-linear,
| KF and EKFs can be thought of linearization of the
| approximations as a lens.
|
| If the diagnostic indicators were both ergodic and Markovian,
| this paper's approach would probably be fairly reliable.
|
| But these efforts are really about finding a many to one
| reduction that works.
|
| I am skeptical about it in this case for PAC ML, but perhaps
| they will find a pathological case.
|
| But the tradeoffs between statistical learning and expansive
| methods are quite different.
|
| Obviously hype cycles drive efforts, I encourage you to look at
| this years AAAI conference report and see that you are not
| alone with the frustration on the single minded approach.
|
| IMHO this paper is a net positive, showing that we are moving
| from a broad exploration to targeted applications.
|
| But that is just my opinion.
| anthk wrote:
| LLM models are organic? They somehow obbey the laws of
| Thermodinamycs by some parallel on algorythms and underlying
| Math? It would be amazing if it were some parallel between
| Biologhycs (specially Funghi with emergent properties) and neural
| networks...
| magicalhippo wrote:
| _For IHM prediction, LSTM models and transformer models were
| trained for 100 epochs using the MIMIC-III and eICU datasets
| separately._
|
| I might be blind, but I don't see any mention of loss. Did they
| stop at 100 because it was a nice round number or because it was
| a good place to stop?
|
| The LSTM model they used had 7k trainable parameters, the CW-LSTM
| model 153k while the transformer model had 800k parameters (300k
| trainable parameters and 600k optimizer parameters as they say).
|
| I don't follow the field close enough, but is it reasonable these
| models all converged at the same time, given the large difference
| in size?
|
| They mention the transformer model outperforming the LSTMs, but I
| wonder if it could have done a lot better.
| PaulHoule wrote:
| What ever happened to early stopping?
|
| I see so many papers where people train neural networks with
| half-baked recipes. I think I saw early stopping first around
| 1990 but it is so often for people to pick some arbitrary
| number of epochs to run. I have to admit I never liked the term
| "early stopping", I think people should have called it just
| "stopping", because it makes it seem optional.
|
| Back when I was training LSTM networks it was straightforward
| to train nets reliably with early stopping...
| Al-Khwarizmi wrote:
| I'm also annoyed about this. I suppose the main reason is
| because 20 years ago, if you didn't use early stopping,
| typically your accuracy would plummet. In earlier, smaller
| neural networks, overfitting was a huge issue; and lack of
| dropout, batch normalization, etc. made learning much more
| brittle.
|
| Now the young'uns don't bother because you can just set 100
| epochs, or whatever, and the result might not be optimal but
| it will generally be fine. Still, it's a pity because you're
| often wasting computational resources that could be spent in
| trying alternative architectures, exploring hyperparameters
| or whatever.
|
| BTW, I also think "early stopping" is a terrible name. If you
| don't know the term, it suggests that you're going to
| undertrain the network, sacrificing some accuracy for
| efficiency. No one wants undertrained networks. I think it's
| not an overstatement to say that if it were called "adaptive
| stopping", "validation guided stopping", or even something
| more catchy like "smart stopping", probably more people would
| use it.
| PaulHoule wrote:
| I have a smart RSS reader YOShInOn which uses BERT + a
| probability calibrated SVM as its main model, I want to
| make a general purpose model trainer for text
| classification that is able to do harder problems.
|
| People who hold court on ML forums will tell you fine-tuned
| BERT is the way to go but BERT fine-tuning doesn't seem to
| be compatible with early stopping with anything like the
| training recipes I see in the literature. Compared to old
| days these networks soak up knowledge like a sponge, my
| hunch is that with N=10,000 samples or so you don't benefit
| from running more than one epoch because the the network
| doesn't have the capacity to learn from that many samples.
|
| I find it depressing to find arXiv papers where people copy
| a training recipe from other papers for BERT and compare it
| 5-15 different text classification problems with maybe
| N=500 samples. My BERT experiments take about 30 minutes so
| it's no small thing to do parametric scans on them,
| particularly when the epoch count is one of the parameters.
| With "smart stopping" I'm not afraid of undertraining
| models so I could run trainings all night and believe I'm
| seeing representative performance as I vary parameters.
|
| My plan is to couple ModernBERT to a LSTM or Bi-LSTM model
| as the literature seems to show that this frequently ties
| or beats fine-tuned BERT and my experience so far as I can
| build reliable trainers for LSTM whereas team fined tuned
| BERT is indifferent to the very idea of "reliable".
|
| Another pet peeve is all the papers with N=500 samples
| where I regularly get N=10,000+ in systems that I use
| everyday and on a rainy weekend I can lay in bed with my
| iPad and switch to an Android tablet when the battery runs
| out and get N=5000 samples. [1] When I wrote my first text
| classification paper we found we needed N=10,000 to get
| really good models, sure the world knowledge in BERT helps
| models learn fast and that's great (a problem I worried
| about in 2005 and still worry about because I think the
| average person wants good results at N<10!) but I need
| calibrated usable accuracy and look at AUC-ROC as my
| metric, not "accuracy", F1 or anything like that.
|
| Then there's the effort people waste with things that can't
| possibly work like Word2Vec, seems like people can read a
| lot of papers and not see it in front of them that Word2Vec
| is useless! I want to write a meta-analysis but instead I'm
| writing a diatribe and I'm not going to be happy until I
| repeat the paradigm with methods that are... _repeatable,_
| not for the science but for the engineering.
|
| [1] with hallucinations as a side effect if it is a visual
| task but so what
| rakejake wrote:
| Nowadays, even the definition of an "epoch" is not well
| defined. Traditionally it meant a pass over the entire
| training set, but datasets are so massive today that many now
| define an epoch as X steps - where a step is a minibatch (of
| whatever size) from the training set. So 1 epoch is a random
| sample of X minibatches from the training set. I'd guess the
| logic is that datasets are so massive that you pick as much
| data as you can fit in VRAM.
|
| Karpathy's Zero To Hero series also uses this.
| rakejake wrote:
| A 7k param LSTM is very tiny. Not sure if LSTMs would even work
| at that scale although someone with more theoretical knowledge
| can correct me on this.
|
| As an aside, I'm trying to train transformers for some
| classification tasks on audio data. The models are "small"
| (like 1M-15M params at most) and I find they are very finicky
| to train. Below 1M parameters I find them hard to train at all.
| I have thrown all sorts of learning rate schedules at them and
| the best I can get is the network learns for a bit and then
| plateaus, after which I can't do anything to get them out of
| that minima. Training an LSTM/GRU on the same data gives me a
| much better loss value.
|
| I couldn't find many papers on training transformers at that
| scale. The only one I was able to find was MS's TinyStories
| [0], but that paper didn't delve much into how they trained the
| models and whether they trained from scratch or distilled from
| a larger model.
|
| At those scales, I find LSTMs and CNNs are a lot more stable.
| The few online threads I've found comparing LSTMs and
| Transformers had the same thing to say - Transformers need a
| lot more data and model size to achieve parity and exceed
| LSTMs/GRUs/CNNs, maybe because the inductive bias provided is
| hard to beat at those scales. Others can comment on what
| they've seen.
|
| [0] - https://arxiv.org/abs/2305.07759
| Al-Khwarizmi wrote:
| I don't have much help to offer, but just to echo your
| experience... at my group we have tried to train Transformers
| from scratch for various NLP tasks and we always have been
| hit with them being extremely brittle, and BiLSTMs working
| better. We only succeeded by following a pre-established
| recipe (e.g. training a BERT model from scratch for a new
| language, where the architecture, parameters and tasks are as
| in BERT), or of course by fine-tuning existing models, but
| just throwing some layers at a problem and training them from
| scratch... nope, won't work without arcane knowledge that
| doesn't seem to be written anywhere accessible. This is one
| of the reasons why I dislike Transformers and I root for the
| likes of RWKV to take the throne.
| rakejake wrote:
| I think the "arcane knowledge" is true for LLMs (billions).
| But there are lots of people who train models in the open
| in the hundreds of millions realm, but never below. Maybe
| transformers simply don't work as well below a size and
| data threshold.
| amelius wrote:
| Maybe start a Kaggle competition?
| MeteorMarc wrote:
| Maybe the training set had too many zeroshot patients.
| bbstats wrote:
| Am I missing something or is this just "We built models that are
| bad"?
| ohgr wrote:
| Bad model, bad method or bad paper?
| timewizard wrote:
| All of this seems designed to make the hospital more labor
| efficient. None of this seems designed to improve long term
| outcomes for patients.
|
| My continuing suspicion that this technology gets the hype it
| does as part of an effort to reduce wages for all workers grows.
___________________________________________________________________
(page generated 2025-03-29 23:01 UTC)