[HN Gopher] The Truth About Linear Regression (2015)
___________________________________________________________________
The Truth About Linear Regression (2015)
Author : sebg
Score : 176 points
Date : 2024-07-30 16:38 UTC (6 hours ago)
(HTM) web link (www.stat.cmu.edu)
(TXT) w3m dump (www.stat.cmu.edu)
| eachro wrote:
| I'd love to see linear regression taught by say a quant
| researcher from Citadel. How do these guys use it? What do they
| particularly care about? Any theoretical results that
| meaningfully change the way they view problems? And so on.
| mikaeluman wrote:
| I have some experience. Variants of regularization are a must.
| There are just too few samples and too much noise per sample.
|
| In a related problem, covariance matrix estimation, variants of
| shrinkage is popular. The most straight forward one being
| Linear Shrinkage (Ledoit, Wolf).
|
| Excepting neural nets, I think most people doing regression
| simply use linear regression with above type touches based on
| the domain.
|
| Particularly in finance you fool yourself too much with more
| complex models.
| fasttriggerfish wrote:
| Yes these are good points and probably the most important
| ones as far as the maths is concerned, though I would say
| regularisations methods are really standard things one learns
| in any ML / stat course. Ledoit, Wolf shrinkage is indeed
| more exotic and very useful.
| Ntrails wrote:
| > There are just too few samples and too much noise per
| sample.
|
| Call it 2000 liquid products on the US exchanges. Many years
| of data. Even if you approximate it down from per tick to 1
| minutely, that doesn't feel like you're struggling for a
| large in sample period?
| bormaj wrote:
| They may have been referring to (for example) reported
| financial results or news events which are more
| infrequent/rare but may have outsized impact on market
| prices.
| aquafox wrote:
| Most people don't appreciate linear regression. 1) All common
| statistical tests are linear models:
| https://lindeloev.github.io/tests-as-linear/ 2) Linear models are
| linear in the parameters, not the response! E.g. y =
| a*sin(x)+bx^2 is a linear model. 3) By choosing an appropriate
| spline basis, many non-linear relationships between the
| predictors and the response can be modelled by linear models. 4)
| And if that flexibility isn't enough, by virtue of the Taylor
| Theorem, linear relations are often a good approximation of non-
| linear ones.
| crystal_revenge wrote:
| These are all fantastic points, and I strongly agree that most
| people don't appreciate linear models nearly enough.
|
| Another one I would add that is _very_ important: Human beings,
| especially in groups, can only reasonably make _linear_
| decisions.
|
| That is, when we are in a meeting making decisions for the
| direction of the company we can only say things like "we need
| to increase ad spend, while reducing the other costs of
| acquisition such as discount vouchers". If you want to find the
| balance between "increasing ad spend" while "decreasing other
| costs" that's a simple linear model.
|
| Even if you have a great non-linear model, it's not even a
| matter of "interpretability" so much as "actionability". You
| can bring the results of a regression analysis to a meeting and
| very quickly model different strategies with reasonable
| directional confidence.
|
| I struggled communicating actionable insights upward until I
| started to really understand regression analysis. After that it
| became amazingly simple to quickly crack open and understand
| fairly complex business processes.
| nextos wrote:
| If you add a multilevel structure to shrink your
| (generalized) linear predictors, this framework becomes
| incredibly powerful.
|
| There are entire statistics textbooks devoted to multilevel
| linear models, you can get really far with these.
|
| Shrinking through information sharing is really important to
| avoid overly optimistic predictions in the case of little
| data.
| aquafox wrote:
| If you like shrinkage (I do), I highly recommend the work
| of Matthew Stephens, e.g. ashr [1] and vash [2] for
| shrinkage based on an empirically derived prior.
|
| [1] https://cran.r-project.org/web/packages/ashr/index.html
| [2] https://github.com/mengyin/vashr
| nextos wrote:
| Yes, the article linked to ashr is quite famous.
| madrox wrote:
| I have a degree in statistics yet I've never thought about
| the relationship between linear models and business decisions
| in this way. You're absolutely right. This is the best
| comment I've read all month.
| addaon wrote:
| > Human beings, especially in groups, can only reasonably
| make linear decisions.
|
| There are absolutely decisions that need to get made, and do
| get made, that are not linear. Step functions are a great
| example. "We need to decide if we are going to accept this
| acquisition offer" is an example of a decision with step
| function utility. You can try to "linearize" it and then
| apply a threshold -- "let's agree on a model for the value at
| which we would accept an acquisition offer" -- but in many
| ways that obscures that the utility function can be
| arbitrarily non-linear.
| mturmon wrote:
| A single decision could still be easily modeled by a 0/1
| variable (as an input) and a real variable (as an output,
| like revenue for example).
|
| That 0/1 input variable could also have arbitrary
| interactions with other variables, which would also amount
| to "step function " input effects.
|
| See for example the autism/age setup down thread.
| borroka wrote:
| For point (3), in most of my academic research and work in
| industry, I have used Generalized Additive Models with great
| technical success (i.e., they fit the data well). Still, I have
| noticed that they have been rarely understood or given the
| proper appreciation by--it is a broad category--stakeholders.
| Out of laziness and habit, mostly.
| SubiculumCode wrote:
| I've looked at additive models, but I have so far shied away
| because I've read that they are not super equipped to deal
| with non-additive interactions.
| SubiculumCode wrote:
| Do you have a useful reference for "3)"?
|
| A common problem I encounter in the literature is authors over-
| interpreting the slopes of a model with quadratic terms (e.g. Y
| = age + age^2) at the lowest and highest ages. In variably the
| plot (not the confidence intervals) will seem to indicate
| declines (for example) at the oldest ages (example: random
| example off internet [1]), when really the apparent negative
| slope is due to the limitations of quadratic models not being
| able to model an asymptote.
|
| The approach I've used, when I do not have a theoretically
| driven choice to work with) is using fractionated polynomials
| [2], e.g. x^s where s = {-2, -1, -0.5, 0, 0.5, 1, 2, 3}, and
| then picking a strategy to pick the best fitting polynomial
| while avoiding overfitting.
|
| Its not a bad technique; I've tried others like piecewise
| polynomial regression, knots, etc [3],but I could not figure
| out how to test (for example) for a group interaction between
| two knotted splines). Also additive models.
|
| [1] https://www.researchgate.net/figure/Scatter-plot-of-the-
| quad...) [2]
| https://journal.r-project.org/articles/RN-2005-017/RN-2005-0...
| [3] https://bookdown.org/ssjackson300/Machine-Learning-
| Lecture-N...
| aquafox wrote:
| For my applications, using natural cubic splines provided by
| the 'ns' function in R, combined with trying out where knots
| should be positioned, is sufficient. Maybe have a look at the
| gratia package [1] for plotting lots of diagnostics around
| spline fits.
|
| [1] https://cran.r-project.org/web/packages/gratia/vignettes/
| gra...
| esafak wrote:
| Re. 2) Then you end up doing feature engineering. For
| applications where you don't know the data generating process
| it is often better to just throw everything at the model let it
| extract the features.
| waveBidder wrote:
| An SVM is purely a linear model from the right perspective, and
| if you're being really reductive, RELU neural networks are
| piecewise linear. I think this may be obscuring more than it
| helps; picking the right transformation for your particular
| case is a highly nontrivial problem; why sin(x) and x^2, rather
| than, say, tanh(x) and x^(1/2).
| parpfish wrote:
| if you want to convert people into loving linear models (and
| you should), we need to make sure that they learn the
| difference between 'linear models' and 'linear models fit using
| OLS'
|
| i've met smart people that cant wrap their head around how it's
| possible to create linear model where the number of parameters
| exceeds the number of data points (that's an OLS restriction).
|
| or they're worried about how they can apply their formula for
| calculating the std error on the parameters. bruh, it's the
| future and we have big computers. just bootstrap em and don't
| make any assumptions.
| yu3zhou4 wrote:
| This looks very interesting, do you know a way to transform this
| PDF to a mobile-optimized form?
| SubiculumCode wrote:
| The most important skill in regression is to RECOGNIZE the
| intercept. It sounds trivial, and is, until you start including
| interactions between terms. The number of times I've found a
| young graduate student screw this up...
|
| Take a simple linear model involving a test score, their age in
| years (age range 7-16 years), and a binary categorical variable
| autism diagnosis (0=control,1=autism): score = age + diagnosis +
| age:diagnosis score = (X1)age + (X2)diagnosis +
| (X3)age:diagnosis.
|
| If the X2 is significant, the naive student would say, "look a
| group difference!!", not realizing this is the predicted group
| difference at the intercept, which is when participants were 0
| years old. [[ You center age by the mean, or median, or better
| yet, the age you are most interested in. Once interactions are in
| the equation, all "lower order" parameter estimates are in
| reference to the intercept.]]
|
| They might also note a significant effect of age, and then assume
| it applies to both groups, but the parameter X1 only tells you
| what the predicted slope is for the reference group (controls),
| while the interaction tests if the age slopes differ between
| groups...moreover, even if the interaction isn't significant, the
| age effect in the autism group might not significantly differ
| from zero...the data is in the wish washy zone, and you have to
| be careful in how one interprets the data.
|
| To some here all this will seem obvious, but to many, getting
| their head firmly into the conditional space of parameters when
| their are interaction terms takes work. (note: for now I am
| ignoring other ways of coding groups (grand mean vs one group
| being the reference) but the lesson still remains. Understand
| what the intercept means and to whom/what it refers.
| SubiculumCode wrote:
| If I said something stupid above, please let me know. I'm
| always learning. If you are a strong Bayesian who doesn't like
| p-values, that is also fine. I get it. I just wanted to provide
| my observations about a great number of bright students I've
| worked with who have nevertheless struggled to fluidly
| interpret models with interaction terms, and point them in the
| right direction.
| mturmon wrote:
| I think this is accurate.
|
| A significant loading on diagnosis (X2) does not tell you
| anything about the effect of diagnosis at any particular age
| (except age 0).
|
| You'd have to recenter the model about the age of interest.
| aquafox wrote:
| I always struggle to get a good intuition into models with
| interaction terms. I usually try to write down for every class
| of responses which terms of the model go into it and often that
| helps with interpretation. There's also the ExploreModelMatrix
| [1] that helps with that task.
|
| [1]
| https://www.bioconductor.org/packages/release/bioc/html/Expl...
| minimaxir wrote:
| When I was at CMU a decade ago I took 36-401 and 36-402 (then
| taught by Shalizi) and they were both very good statistical
| classes and they forced me to learn base R, for better or for
| worse.
|
| A big weakness of linear regression that I had to learn the hard
| way is that the academic assumptions for valid interpretation of
| the coefficients are easy to construct for small educational
| datasets but rarely applicable to messy real world data.
| aquafox wrote:
| It depends. The most important assumption is independence of
| the observations. If that is not given, you have to either
| account for correlated responses using a mixed-effects model or
| mean-aggregate those responses (computing the mean decreases
| the variance but also reduces the number of data points and
| those two cancel each other out in calculating the t-statistic
| of the Wald test).
|
| With regard to other assumptions, e.g. normality of the
| residuals, linear models can often deal with some degree of
| violation against those. But I agree that it's always good to
| understand the influence of those violations, e.g. by using
| simulations and making p-value histograms of null-data.
| bdjsiqoocwk wrote:
| The flip side is with messy real world data you just need a
| model that's ok enough, rather than being concerned whether the
| p-value is this or that.
| minimaxir wrote:
| At that point, if you don't care about interpretable
| coefficients, you might as well use gradient-boosted trees or
| a full neural network instead.
| borroka wrote:
| It depends on the "severity" of the violation of
| assumptions--you can also use GAMs to add flexible
| nonlinear relationships--and the amount of data you are
| working with. Statistical modeling is a nuanced job.
| g42gregory wrote:
| It looks like this article does not mention it, but linear
| regression will also exhibit Double Descent phenomenon, commonly
| seen in deep learning. You would need to introduce some
| regularization, in order to see this. It would be nice to add
| this discussion.
| gotoeleven wrote:
| Are there some papers in particular that you're referring to?
| Does the second descent happen after the model becomes
| overparameterized, like with neural nets? What kind of
| regularization?
| brcmthrowaway wrote:
| pfft, couldnt build an LLM with it
| rkp8000 wrote:
| I love that Ridge Regression is introduced in the context of
| multicollinearity. It seems almost everyone these days learns
| about it as a regularization technique to prevent overfitting,
| but one of its fundamental use cases (and indeed its origin I
| believe) is in balancing weights among highly correlated (or
| nearly linearly dependent) predictors, which can cause huge
| problems even if you plenty of data.
___________________________________________________________________
(page generated 2024-07-30 23:00 UTC)