[HN Gopher] Machine Learning, Kolmogorov Complexity, and Squishy...
       ___________________________________________________________________
        
       Machine Learning, Kolmogorov Complexity, and Squishy Bunnies (2019)
        
       Author : coolvision
       Score  : 58 points
       Date   : 2021-02-04 15:34 UTC (7 hours ago)
        
 (HTM) web link (www.theorangeduck.com)
 (TXT) w3m dump (www.theorangeduck.com)
        
       | chriswarbo wrote:
       | Very cool. My only niggle is that PCA doesn't 'feel like'
       | Kolmogorov Complexity to me: it's much closer to Shannon
       | information than Algorithmic information.
       | 
       | Shannon information depends on population statistics of messages,
       | usually has a 'shallow' and domain-specific interpreter (to turn
       | encodings back into messages, e.g. like a Huffman tree), and
       | tries to minimise the average length of a _set_ of encoded
       | messages. In this case the NN is the interpreter, the length of
       | the encoding is the number of deformation axes, and the
       | population is the training data (which we hope is representative
       | of the real data).
       | 
       | Algorithmic information is different. It depends on patterns
       | within an _individual_ message, rather than across a population.
       | The interpreter is a Universal Turing Machine (or equivalent),
       | which is general-purpose. Encodings are programs which output a
       | message, and can be arbitrarily  'deep' by running for an
       | arbitrary amount of time.
       | 
       | Another way to look at it is: using PCA to solve the problem of
       | 'automatically decide which deformation axes to include' is
       | slightly interesting; much _more_ interesting is the problem of
       | 'how can we automatically decide that this data should be
       | modelled as a set of deformations along various axes?'. That's a
       | _much_ harder problem, and tends to require a lot of search
       | through program-space (whether NNs or GOFAI).
        
         | yters wrote:
         | Shannon info and KC are asymptotically equivalent.
        
           | chriswarbo wrote:
           | That's why I used 'scare quotes'. It's very similar (perhaps
           | provably equivalent, I'm not sure) to how Frequentist and
           | Bayesian probabilities are asymptotically equivalent; yet
           | offer their own perspective on few/one-shot cases.
           | 
           | In my experience, general practice w.r.t. Shannon information
           | is to manually pick a model with some numerical parameters
           | (Huffman tree, deformation-axes-NN, etc.), then set those
           | parameters to work well on an example population.
           | 
           | General practice w.r.t. algorithmic information seems to
           | involve searching program space for a good model. The
           | adjustable part is the language, which can certainly be
           | manually tailored to the application, but doesn't tend to be
           | the bulk of the work (e.g. _automatically_ searching for an
           | efficient language tends to be redundant once we 're doing
           | program search; since those programs will contain their own
           | abstractions in any case).
        
       | wfhpw wrote:
       | Hugged until KO:
       | https://web.archive.org/web/20210204154139/http://www.theora...
        
       | mywittyname wrote:
       | PCA can solve a surprisingly large number of problems. And when
       | it can't directly solve a problem, PCA can be used to as a
       | preliminary analytics step to rank features by weight/impact
       | which can then be combined with iterative methods to solve other
       | types of problems.
       | 
       | PCA was my introduction to data science and it remains one of my
       | favorite tools to pull out. I'm really surprised by the number of
       | data scientists that I meet who never use it, or have even never
       | heard of it. Then again, even Data Science From Scratch dedicates
       | like two pages to the subject, so maybe it is not something
       | modern data science programs spend any time on.
        
         | uoaei wrote:
         | I mostly use it to compress my input features, i.e., to
         | minimize the number of inputs needed on a given model,
         | accepting a tiny loss in accuracy. This is fine because with
         | linear models (such as the first/last layer of an NN) it merely
         | amounts to a rotation, and the mapping back to the original
         | feature space is well-defined.
        
       | fwsgonzo wrote:
       | Very illuminating. Must have taken a long time to write that blog
       | post.
       | 
       | Did the PS5 have a chip that could execute small neural networks?
       | My PC doesn't have one right now, but if I bought the latest in
       | GPU there is a dedicated chip already. This is all very new to
       | me.
        
       | [deleted]
        
       | derbOac wrote:
       | I remember this essay and liked it quite a bit. I do research in
       | this area and, although there's an armchair aspect to the piece,
       | there's also something insightful in it about the "memory versus
       | computation" paradigm. The way the algorithmic complexity
       | literature discusses modeling I think is missing something -- an
       | intuitive treatment of the tradeoffs involved in different types
       | of data representations -- and this schema I think is on to
       | something.
        
         | [deleted]
        
       | PartiallyTyped wrote:
       | Another achievement of ML is the Support Vector Machine (SVM),
       | and I'd say that it represents better the tradeoff between space
       | and computation.
       | 
       | In fact, SVMs literally make that tradeoff by using the Kernel
       | trick to save up on the computational costs of projecting to
       | higher, possibly infinite dimensions. As we know, projecting up
       | to higher dimensions can provide us with a 100% accuracy on the
       | training data, but of course, that's overfitting and we use
       | constrains/support vectors to regularize the learner.
        
       ___________________________________________________________________
       (page generated 2021-02-04 23:01 UTC)