[HN Gopher] The Truth About Linear Regression (2015)
       ___________________________________________________________________
        
       The Truth About Linear Regression (2015)
        
       Author : sebg
       Score  : 176 points
       Date   : 2024-07-30 16:38 UTC (6 hours ago)
        
 (HTM) web link (www.stat.cmu.edu)
 (TXT) w3m dump (www.stat.cmu.edu)
        
       | eachro wrote:
       | I'd love to see linear regression taught by say a quant
       | researcher from Citadel. How do these guys use it? What do they
       | particularly care about? Any theoretical results that
       | meaningfully change the way they view problems? And so on.
        
         | mikaeluman wrote:
         | I have some experience. Variants of regularization are a must.
         | There are just too few samples and too much noise per sample.
         | 
         | In a related problem, covariance matrix estimation, variants of
         | shrinkage is popular. The most straight forward one being
         | Linear Shrinkage (Ledoit, Wolf).
         | 
         | Excepting neural nets, I think most people doing regression
         | simply use linear regression with above type touches based on
         | the domain.
         | 
         | Particularly in finance you fool yourself too much with more
         | complex models.
        
           | fasttriggerfish wrote:
           | Yes these are good points and probably the most important
           | ones as far as the maths is concerned, though I would say
           | regularisations methods are really standard things one learns
           | in any ML / stat course. Ledoit, Wolf shrinkage is indeed
           | more exotic and very useful.
        
           | Ntrails wrote:
           | > There are just too few samples and too much noise per
           | sample.
           | 
           | Call it 2000 liquid products on the US exchanges. Many years
           | of data. Even if you approximate it down from per tick to 1
           | minutely, that doesn't feel like you're struggling for a
           | large in sample period?
        
             | bormaj wrote:
             | They may have been referring to (for example) reported
             | financial results or news events which are more
             | infrequent/rare but may have outsized impact on market
             | prices.
        
       | aquafox wrote:
       | Most people don't appreciate linear regression. 1) All common
       | statistical tests are linear models:
       | https://lindeloev.github.io/tests-as-linear/ 2) Linear models are
       | linear in the parameters, not the response! E.g. y =
       | a*sin(x)+bx^2 is a linear model. 3) By choosing an appropriate
       | spline basis, many non-linear relationships between the
       | predictors and the response can be modelled by linear models. 4)
       | And if that flexibility isn't enough, by virtue of the Taylor
       | Theorem, linear relations are often a good approximation of non-
       | linear ones.
        
         | crystal_revenge wrote:
         | These are all fantastic points, and I strongly agree that most
         | people don't appreciate linear models nearly enough.
         | 
         | Another one I would add that is _very_ important: Human beings,
         | especially in groups, can only reasonably make _linear_
         | decisions.
         | 
         | That is, when we are in a meeting making decisions for the
         | direction of the company we can only say things like "we need
         | to increase ad spend, while reducing the other costs of
         | acquisition such as discount vouchers". If you want to find the
         | balance between "increasing ad spend" while "decreasing other
         | costs" that's a simple linear model.
         | 
         | Even if you have a great non-linear model, it's not even a
         | matter of "interpretability" so much as "actionability". You
         | can bring the results of a regression analysis to a meeting and
         | very quickly model different strategies with reasonable
         | directional confidence.
         | 
         | I struggled communicating actionable insights upward until I
         | started to really understand regression analysis. After that it
         | became amazingly simple to quickly crack open and understand
         | fairly complex business processes.
        
           | nextos wrote:
           | If you add a multilevel structure to shrink your
           | (generalized) linear predictors, this framework becomes
           | incredibly powerful.
           | 
           | There are entire statistics textbooks devoted to multilevel
           | linear models, you can get really far with these.
           | 
           | Shrinking through information sharing is really important to
           | avoid overly optimistic predictions in the case of little
           | data.
        
             | aquafox wrote:
             | If you like shrinkage (I do), I highly recommend the work
             | of Matthew Stephens, e.g. ashr [1] and vash [2] for
             | shrinkage based on an empirically derived prior.
             | 
             | [1] https://cran.r-project.org/web/packages/ashr/index.html
             | [2] https://github.com/mengyin/vashr
        
               | nextos wrote:
               | Yes, the article linked to ashr is quite famous.
        
           | madrox wrote:
           | I have a degree in statistics yet I've never thought about
           | the relationship between linear models and business decisions
           | in this way. You're absolutely right. This is the best
           | comment I've read all month.
        
           | addaon wrote:
           | > Human beings, especially in groups, can only reasonably
           | make linear decisions.
           | 
           | There are absolutely decisions that need to get made, and do
           | get made, that are not linear. Step functions are a great
           | example. "We need to decide if we are going to accept this
           | acquisition offer" is an example of a decision with step
           | function utility. You can try to "linearize" it and then
           | apply a threshold -- "let's agree on a model for the value at
           | which we would accept an acquisition offer" -- but in many
           | ways that obscures that the utility function can be
           | arbitrarily non-linear.
        
             | mturmon wrote:
             | A single decision could still be easily modeled by a 0/1
             | variable (as an input) and a real variable (as an output,
             | like revenue for example).
             | 
             | That 0/1 input variable could also have arbitrary
             | interactions with other variables, which would also amount
             | to "step function " input effects.
             | 
             | See for example the autism/age setup down thread.
        
         | borroka wrote:
         | For point (3), in most of my academic research and work in
         | industry, I have used Generalized Additive Models with great
         | technical success (i.e., they fit the data well). Still, I have
         | noticed that they have been rarely understood or given the
         | proper appreciation by--it is a broad category--stakeholders.
         | Out of laziness and habit, mostly.
        
           | SubiculumCode wrote:
           | I've looked at additive models, but I have so far shied away
           | because I've read that they are not super equipped to deal
           | with non-additive interactions.
        
         | SubiculumCode wrote:
         | Do you have a useful reference for "3)"?
         | 
         | A common problem I encounter in the literature is authors over-
         | interpreting the slopes of a model with quadratic terms (e.g. Y
         | = age + age^2) at the lowest and highest ages. In variably the
         | plot (not the confidence intervals) will seem to indicate
         | declines (for example) at the oldest ages (example: random
         | example off internet [1]), when really the apparent negative
         | slope is due to the limitations of quadratic models not being
         | able to model an asymptote.
         | 
         | The approach I've used, when I do not have a theoretically
         | driven choice to work with) is using fractionated polynomials
         | [2], e.g. x^s where s = {-2, -1, -0.5, 0, 0.5, 1, 2, 3}, and
         | then picking a strategy to pick the best fitting polynomial
         | while avoiding overfitting.
         | 
         | Its not a bad technique; I've tried others like piecewise
         | polynomial regression, knots, etc [3],but I could not figure
         | out how to test (for example) for a group interaction between
         | two knotted splines). Also additive models.
         | 
         | [1] https://www.researchgate.net/figure/Scatter-plot-of-the-
         | quad...) [2]
         | https://journal.r-project.org/articles/RN-2005-017/RN-2005-0...
         | [3] https://bookdown.org/ssjackson300/Machine-Learning-
         | Lecture-N...
        
           | aquafox wrote:
           | For my applications, using natural cubic splines provided by
           | the 'ns' function in R, combined with trying out where knots
           | should be positioned, is sufficient. Maybe have a look at the
           | gratia package [1] for plotting lots of diagnostics around
           | spline fits.
           | 
           | [1] https://cran.r-project.org/web/packages/gratia/vignettes/
           | gra...
        
         | esafak wrote:
         | Re. 2) Then you end up doing feature engineering. For
         | applications where you don't know the data generating process
         | it is often better to just throw everything at the model let it
         | extract the features.
        
         | waveBidder wrote:
         | An SVM is purely a linear model from the right perspective, and
         | if you're being really reductive, RELU neural networks are
         | piecewise linear. I think this may be obscuring more than it
         | helps; picking the right transformation for your particular
         | case is a highly nontrivial problem; why sin(x) and x^2, rather
         | than, say, tanh(x) and x^(1/2).
        
         | parpfish wrote:
         | if you want to convert people into loving linear models (and
         | you should), we need to make sure that they learn the
         | difference between 'linear models' and 'linear models fit using
         | OLS'
         | 
         | i've met smart people that cant wrap their head around how it's
         | possible to create linear model where the number of parameters
         | exceeds the number of data points (that's an OLS restriction).
         | 
         | or they're worried about how they can apply their formula for
         | calculating the std error on the parameters. bruh, it's the
         | future and we have big computers. just bootstrap em and don't
         | make any assumptions.
        
       | yu3zhou4 wrote:
       | This looks very interesting, do you know a way to transform this
       | PDF to a mobile-optimized form?
        
       | SubiculumCode wrote:
       | The most important skill in regression is to RECOGNIZE the
       | intercept. It sounds trivial, and is, until you start including
       | interactions between terms. The number of times I've found a
       | young graduate student screw this up...
       | 
       | Take a simple linear model involving a test score, their age in
       | years (age range 7-16 years), and a binary categorical variable
       | autism diagnosis (0=control,1=autism): score = age + diagnosis +
       | age:diagnosis score = (X1)age + (X2)diagnosis +
       | (X3)age:diagnosis.
       | 
       | If the X2 is significant, the naive student would say, "look a
       | group difference!!", not realizing this is the predicted group
       | difference at the intercept, which is when participants were 0
       | years old. [[ You center age by the mean, or median, or better
       | yet, the age you are most interested in. Once interactions are in
       | the equation, all "lower order" parameter estimates are in
       | reference to the intercept.]]
       | 
       | They might also note a significant effect of age, and then assume
       | it applies to both groups, but the parameter X1 only tells you
       | what the predicted slope is for the reference group (controls),
       | while the interaction tests if the age slopes differ between
       | groups...moreover, even if the interaction isn't significant, the
       | age effect in the autism group might not significantly differ
       | from zero...the data is in the wish washy zone, and you have to
       | be careful in how one interprets the data.
       | 
       | To some here all this will seem obvious, but to many, getting
       | their head firmly into the conditional space of parameters when
       | their are interaction terms takes work. (note: for now I am
       | ignoring other ways of coding groups (grand mean vs one group
       | being the reference) but the lesson still remains. Understand
       | what the intercept means and to whom/what it refers.
        
         | SubiculumCode wrote:
         | If I said something stupid above, please let me know. I'm
         | always learning. If you are a strong Bayesian who doesn't like
         | p-values, that is also fine. I get it. I just wanted to provide
         | my observations about a great number of bright students I've
         | worked with who have nevertheless struggled to fluidly
         | interpret models with interaction terms, and point them in the
         | right direction.
        
         | mturmon wrote:
         | I think this is accurate.
         | 
         | A significant loading on diagnosis (X2) does not tell you
         | anything about the effect of diagnosis at any particular age
         | (except age 0).
         | 
         | You'd have to recenter the model about the age of interest.
        
         | aquafox wrote:
         | I always struggle to get a good intuition into models with
         | interaction terms. I usually try to write down for every class
         | of responses which terms of the model go into it and often that
         | helps with interpretation. There's also the ExploreModelMatrix
         | [1] that helps with that task.
         | 
         | [1]
         | https://www.bioconductor.org/packages/release/bioc/html/Expl...
        
       | minimaxir wrote:
       | When I was at CMU a decade ago I took 36-401 and 36-402 (then
       | taught by Shalizi) and they were both very good statistical
       | classes and they forced me to learn base R, for better or for
       | worse.
       | 
       | A big weakness of linear regression that I had to learn the hard
       | way is that the academic assumptions for valid interpretation of
       | the coefficients are easy to construct for small educational
       | datasets but rarely applicable to messy real world data.
        
         | aquafox wrote:
         | It depends. The most important assumption is independence of
         | the observations. If that is not given, you have to either
         | account for correlated responses using a mixed-effects model or
         | mean-aggregate those responses (computing the mean decreases
         | the variance but also reduces the number of data points and
         | those two cancel each other out in calculating the t-statistic
         | of the Wald test).
         | 
         | With regard to other assumptions, e.g. normality of the
         | residuals, linear models can often deal with some degree of
         | violation against those. But I agree that it's always good to
         | understand the influence of those violations, e.g. by using
         | simulations and making p-value histograms of null-data.
        
         | bdjsiqoocwk wrote:
         | The flip side is with messy real world data you just need a
         | model that's ok enough, rather than being concerned whether the
         | p-value is this or that.
        
           | minimaxir wrote:
           | At that point, if you don't care about interpretable
           | coefficients, you might as well use gradient-boosted trees or
           | a full neural network instead.
        
             | borroka wrote:
             | It depends on the "severity" of the violation of
             | assumptions--you can also use GAMs to add flexible
             | nonlinear relationships--and the amount of data you are
             | working with. Statistical modeling is a nuanced job.
        
       | g42gregory wrote:
       | It looks like this article does not mention it, but linear
       | regression will also exhibit Double Descent phenomenon, commonly
       | seen in deep learning. You would need to introduce some
       | regularization, in order to see this. It would be nice to add
       | this discussion.
        
         | gotoeleven wrote:
         | Are there some papers in particular that you're referring to?
         | Does the second descent happen after the model becomes
         | overparameterized, like with neural nets? What kind of
         | regularization?
        
       | brcmthrowaway wrote:
       | pfft, couldnt build an LLM with it
        
       | rkp8000 wrote:
       | I love that Ridge Regression is introduced in the context of
       | multicollinearity. It seems almost everyone these days learns
       | about it as a regularization technique to prevent overfitting,
       | but one of its fundamental use cases (and indeed its origin I
       | believe) is in balancing weights among highly correlated (or
       | nearly linearly dependent) predictors, which can cause huge
       | problems even if you plenty of data.
        
       ___________________________________________________________________
       (page generated 2024-07-30 23:00 UTC)