[HN Gopher] My Python code is a neural network
       ___________________________________________________________________
        
       My Python code is a neural network
        
       Author : gnyeki
       Score  : 235 points
       Date   : 2024-07-01 12:47 UTC (10 hours ago)
        
 (HTM) web link (blog.gabornyeki.com)
 (TXT) w3m dump (blog.gabornyeki.com)
        
       | lawlessone wrote:
       | Edit: ok i see it detects code.
       | 
       | I thought it was replacing bits of ANN with custom python
       | functions.
        
       | sdwr wrote:
       | This is new to me, and therefore bad and scary.
       | 
       | It's great that you know NN well enough to fold it into regular
       | work. But think of all us poor regular developers! Who now have
       | to grapple with:
       | 
       | - an unfamiliar architecture
       | 
       | - uncertainty / effectively non-deterministic results in program
       | flow
        
         | sva_ wrote:
         | NN are in principle deterministic (unless you add randomness to
         | it such as is the case with LLM top p/k temperature).
         | 
         | Uncertainty is probably the better word of the two, but I feel
         | like there should be a different term.
        
       | alsxnt wrote:
       | Recurrent neural networks can be used for arbitrary computations,
       | the equivalence to Turing machines has been proven. However, they
       | are utterly impractical for the task.
       | 
       | This seems to be a state machine that is somehow learned. The
       | article could benefit from a longer synopsis and "Python" does
       | not appear to be relevant at all. Learning real Python semantics
       | would prove quite difficult due to the nature of the language (no
       | standard, just do as CPython does).
        
         | danans wrote:
         | > Recurrent neural networks can be used for arbitrary
         | computations, the equivalence to Turing machines has been
         | proven. However, they are utterly impractical for the task.
         | 
         | Karpathy's 2015 RNN article [1] demonstrated that RNNs trained
         | character-wise on Shakespeare's works could produce
         | Shakespeare-esque text (albeit without the narrative coherence
         | of LLMs). Given that, why wouldn't they be able to handle
         | natural language as formulaic as code review comments?
         | 
         | In that case inference was run with randomized inputs in order
         | to generate random "Shakespeare", but the structure of the
         | language and style was still learned by the RNN. Perhaps it
         | could be used for classification also.
         | 
         | 1. https://karpathy.github.io/2015/05/21/rnn-effectiveness/
        
           | vidarh wrote:
           | For RNN abilities, RWKV is worth a look[1]
           | 
           | It's billed as "an RNN with GPT-level LLM performance".
           | 
           | [1] https://www.rwkv.com/
        
       | danans wrote:
       | I'd like to see a cost vs precision/recall comparison of using a
       | RNN vs an LLM (local or API) for a problem like this.
        
         | danans wrote:
         | Hah... Ask, and HN (sorta) delivers. Not RNN, but self-
         | finetuned LLM cost vs performance compared to GPT-4.
         | 
         | https://openpipe.ai/blog/mixture-of-agents
        
       | Fripplebubby wrote:
       | Love this post! Gets into the details of what it _really_ means
       | to take some function and turn it into an RNN, and comparing that
       | to the "batteries included" RNNs included in PyTorch, as a
       | learning experience.
       | 
       | Question:
       | 
       | > To model the state, we need to add three hidden layers to the
       | network.
       | 
       | How did you determine that it would be three hidden layers? Is it
       | a consequence of the particular rule you were implementing, or is
       | that generally how many layers you would use to implement a rule
       | of this shape (using your architecture rather than Elman's -
       | could we use fewer layers with Elman's?)?
        
         | gnyeki wrote:
         | I'm glad you found it valuable! Both are good questions and I
         | haven't gone far enough mapping the code to Elman's
         | architecture to know the answer to the second.
         | 
         | For your first question, using three hidden layers makes it a
         | little clearer what the network does. Each layer performs one
         | step of the calculation. The first layer collects what is known
         | from the current token and what we knew after the calculation
         | for the previous token. The second layer decides whether the
         | current token looks like program code, by checking if it
         | satisfies the decision rule. The third layer compares the
         | decision with what we decided for previous tokens.
         | 
         | I think that this could be compressed into a single hidden
         | layer, too. A ReLU should be good enough at capturing non-
         | linearities so this should work.
        
           | Fripplebubby wrote:
           | Ah, that makes sense. So, we consider two hidden layers more
           | as "memory" or "buffers", and actually the rule is
           | implemented in just one layer, at least for a single token.
        
       | ryjo wrote:
       | Really awesome. Thanks for this thorough write-up. I don't
       | totally understand the deeper math concepts mentioned in this
       | article around RNNs, but it's sparked some of my own thoughts. It
       | feels similar to things I've been exploring lately-- that is:
       | building your app interwoven with forward chaining algorithms. In
       | your case, you're using RNNs, and in mine, I'm building into the
       | Rete algorithm.
       | 
       | You also touch on something in this article that I've found quite
       | powerful: putting things in terms of digesting an input string
       | character-by-character. Then, we offload all of the reasoning
       | logic to our algorithm. We write very thin i/o logic, and then
       | the algorithm does the rest.
        
       | pakl wrote:
       | There exists the Universal (Function) Approximation Theorem for
       | neural networks -- which states that they can represent/encode
       | any function to a desired level of accuracy[0].
       | 
       | However there does not exist a theorem stating that those
       | approximations can be learned (or how).
       | 
       | [0]
       | https://en.m.wikipedia.org/wiki/Universal_approximation_theo...
        
         | arketyp wrote:
         | Makes you wonder what is meant by learning...
        
           | dekhn wrote:
           | Learning is using observations to create/update a model that
           | makes predictions which are more accurate than chance. At
           | some point the model ends up having generalizability beyond
           | the domain.
        
         | jb1991 wrote:
         | FYI, there are actually many algorithms going back longer than
         | the neural network algorithm that have been proven to be a
         | universal function approximator. Neural networks are certainly
         | not the only and not the first to do so. There are quite a few
         | that are actually much more appropriate for many cases than a
         | neural network.
        
           | derangedHorse wrote:
           | What other algorithms can do this and which situations would
           | they be more useful than neural networks?
        
             | gnyeki wrote:
             | This area is covered by non-parametric statistics more
             | generally. There are many other methods to non-
             | parametrically estimate functions (that satisfy some
             | regularity conditions). Tree-based methods are one family
             | of such methods, and the consensus still seems to be that
             | they perform better than neural networks on tabular data.
             | For example:
             | 
             | https://arxiv.org/abs/2106.03253
        
             | someoneontenet wrote:
             | Newtons Method approximates square roots. Its useful if you
             | want to approximate something like that without pulling in
             | the computational power required of NN.
        
               | kristjansson wrote:
               | Newton's method related to universal function
               | approximation in the same way a natural oil seep is
               | related to a modern IC engine...
        
               | astrobe_ wrote:
               | I think the problem to solve is more like : given a set
               | of inputs and outputs, find a function that gives the
               | expected output for each input [1]. This is like Newton's
               | method on a higher order ;-). One can find such a tool in
               | Squeak or Pharo Smalltalk, IIRC.
               | 
               | [1] https://stackoverflow.com/questions/1539286/create-a-
               | functio...
        
             | jb1991 wrote:
             | https://www.quora.com/Is-there-any-universal-function-
             | approx...
        
         | richrichie wrote:
         | Not any function though. There are restrictions on type of
         | functions "universal" approximation theorem is applicable for.
         | Interestingly, the theorem is about a single layer network. In
         | practice, that does not work as well as having many layers.
        
         | visarga wrote:
         | They can model only continuous functions, more specifically any
         | continuous function on compact subsets of Rn. They can
         | approximate functions to an arbitrary level of accuracy, given
         | sufficient neurons
        
         | montebicyclelo wrote:
         | People throw that proof around all the time; but all it does is
         | show that a neural net is equivalent to a lookup table; and a
         | lookup table with enough memory can approximate any function.
         | It's miles away from explaining how real world, useful, neural
         | nets, like conv-nets, transformers, LSTMs, etc. actually work.
        
       | scotchmi_st wrote:
       | This is an interesting article if you read it like a howto for
       | constructing a neural network for performing a practical task.
       | But if you take it at face-value, and follow a similar method the
       | next time you need to parse some input, then, well, I don't know
       | what to say really.
       | 
       | The author takes a hard problem (parsing arbitrary input for
       | loosely-defined patterns), and correctly argues that this is
       | likely to produce hard-to-read 'spaghetti' code.
       | 
       | They then suggest replacing that with code that is so hard to
       | read that there is still active research into how it works, (i.e
       | a neural net).
       | 
       | Don't over-index something that's inscrutable versus something
       | that you can understand but is 'ugly'. Sometimes, _maybe_, a ML
       | model is what you want for a task. But a lot of the time,
       | something that you can read and see why it's doing what it's
       | doing, even if that takes some effort, is better than something
       | that's impossible.
        
         | thoughtlede wrote:
         | I think the mention of 'spaghetti code' is a red herring from
         | the author. If the output from an algorithm cannot be defined
         | precisely as a function of the input, but you have some
         | examples to show, that's where machine learning (ML) is useful.
         | 
         | In the end, ML provides one more option to choose from. Whether
         | it works or not for you depends on evaluations and how
         | deterministic and explainability you need from the chosen
         | algorithm/option.
         | 
         | The thing that struck me is if RNN is the right choice given
         | that it would need to be trained and we need a lot of examples
         | than what we might have. That said, maybe based on known
         | 'rules', we can produce synthetic data for both +ve and -ve
         | cases.
        
       | jlturner wrote:
       | If this interests you, it's worth taking a look at Genetic
       | Programming. I find it to be a simpler approach at the same
       | problem, no math required. It simply recombines programs by their
       | AST, and given some heuristic, optimizes the program for it. The
       | magic is in your heuristic function, where you can choose what
       | you want to optimize for (ie. Speed, program length, minimize
       | complex constructs or function calls, network efficiency, some
       | combination therein, etc).
       | 
       | https://youtu.be/tTMpKrKkYXo
        
         | PixelN0va wrote:
         | hmm thanks for the link
        
         | nickpsecurity wrote:
         | I'll add the Humies Awards that highlight human-competitive
         | results. One can learn a lot about what can or can't be done in
         | this field by just skimming across all the submitted papers.
         | 
         | https://www.human-competitive.org/
        
       | dekhn wrote:
       | Are RNNs completely subsumed by transformers? IE, can I forget
       | about learning anything about how to work with RNNs, and instead
       | focus on transformers?
        
         | Voloskaya wrote:
         | Not if you want to be a PhD/Researcher in ML, yes otherwise.
         | 
         | Source: Working on ML/LLMs as a research engineer for the past
         | 7 years, including for one of the FAANG's research lab, always
         | wanted to take time to learn about RNN but never did and never
         | needed to.
        
           | rolisz wrote:
           | Oh, I'm sure plenty of recent PhDs don't know about RNNs.
           | They've been dropped like a hot potato in the last 4-5 years.
        
             | Voloskaya wrote:
             | I think to do pure research it's definitely worth knowing
             | about the big ideas of the past, why we moved on from them,
             | what we learned etc.
        
           | derangedHorse wrote:
           | I haven't read it in a while but I remember this post giving
           | a good rundown of rnns
           | 
           | https://dennybritz.com/posts/wildml/recurrent-neural-
           | network...
        
         | toxik wrote:
         | Transformers have finite context, RNNs don't. In practice the
         | RNN gradient signal is limited by back propagation through
         | time, it decays. This is in fact the whole selling point of
         | transformers; association is not harder or easier in near/short
         | distance. But in theory a RNN can remember infinitely far away.
        
         | Fripplebubby wrote:
         | To further problematize this question (which I don't feel like
         | I can actually answer), consider this paper: "Transformers are
         | RNNs: Fast Autoregressive Transformers with Linear Attention" -
         | https://arxiv.org/pdf/2006.16236
         | 
         | What this shows is that actually a specific narrow definition
         | of transformer (a transformer with "causal masking" - see
         | paper) is equivalent to an RNN, and vice versa.
         | 
         | Similarly Mamba (https://arxiv.org/abs/2312.00752), the other
         | hot architecture at the moment, has an equivalent unit to a
         | gated RNN. For performance reasons, I believe they use an
         | equivalent CNN during training and an RNN during inference!
        
           | visarga wrote:
           | There still are important distinctions. RNNs have constant
           | memory while transformers expand their memory with each new
           | token. They are related, but one could in theory process an
           | unbounded sequence while the other cannot because of growing
           | memory usage.
        
           | Fripplebubby wrote:
           | To be more concrete: you might decide not to learn about
           | RNNs, but still find them lurking in the things you did learn
           | about!
        
       | thih9 wrote:
       | > Of course, we should try and avoid writing spaghetti code if we
       | can. But there are problems that are so ill-specified that any
       | serious attempt to solve them results in just that.
       | 
       | Can you elaborate or do you have an example?
       | 
       | Based on just the above, I disagree - I'd say it's the job of the
       | programmer to make sure that the problem is well-specified and
       | that they can write maintainable code.
        
       | fnord77 wrote:
       | > To model the state, we need to add three hidden layers to the
       | network
       | 
       | Why 3?
       | 
       | And why use "h" for layer names?
        
         | muricula wrote:
         | `h` is for "hidden" layers.
        
       | skybrian wrote:
       | This article doesn't talk much about testing or getting training
       | data. It seems like that part is key.
       | 
       | For code that you think you understand, it's because you've
       | informally proven to yourself that it has some properties that
       | generalize to all inputs. For example, a sort algorithm will sort
       | any list, not just the ones you tested.
       | 
       | The thing we're uncertain about for a neural network is that we
       | don't know how it will generalize; there are no properties that
       | we think are guaranteed for unseen input, even if it's slightly
       | different input. It might be because we have an ill-specified
       | problem and we don't know how to mathematically specify what
       | properties we want.
       | 
       | If you can actually specify a property well enough to write a
       | property-based test (like QuickCheck) then you can generate large
       | amounts of tests / training data though randomization. Start with
       | one example of what you want, then write tests that generate
       | every possible version of both positive and negative examples.
       | 
       | It's not a proof, but it's a start. At least you know what you
       | would prove, if you could.
       | 
       | If you have such a thing, relying on spaghetti code or a neural
       | network seem kind of similar? If you want another property to
       | hold, you can write another property-based test for it. I suppose
       | with the neural network you can train it instead of doing the
       | edits yourself, but then again we have AI assistance for code
       | fixes.
       | 
       | I think I'd still trust code more. At least you can debug it.
        
       | dinobones wrote:
       | This article was going decently and then it just falls off a
       | cliff.
       | 
       | The article basically says: 1) Here's this complex problem 2)
       | Here's some hand written heuristics 3) Here's a shitty neural net
       | 4) Here's another neural net with some guys last name from the
       | PyTorch library 5) Here are the constraints with adopting neural
       | nets
       | 
       | You can see why this is so unsatisfying, the leaps in logic
       | become more and more generous.
       | 
       | What I would have loved to see, is a comparison of a spaghetti
       | code implementation vs a neural net implementation on a large
       | dataset/codebase, then show examples in the validation set that
       | maybe the neural net _generalizes to_ , or fails at, but the
       | heuristic fails at, and so on.
       | 
       | This would demonstrate the value of neural nets, if for example,
       | there's a novel example that the neural net finds that the
       | spaghetti heuristic can't.
       | 
       | Show tangible results, show some comparison, show something,
       | giving some rough numbers on the performance of each in aggregate
       | would be really useful.
        
       | ultra_nick wrote:
       | I feel like neural networks are increasingly going to look like
       | code.
       | 
       | The next big innovation will be whoever figures out how to
       | convert MOE style models into something like function calls.
        
       | godelski wrote:
       | > Humans are bad at managing spaghetti code. Of course, we should
       | try and avoid writing spaghetti code if we can. But there are
       | problems that are so ill-specified that any serious attempt to
       | solve them results in just that.
       | 
       | Sounds like a skill issue.
       | 
       | But seriously, how many programmers do you know that reach for
       | the documents or help pages (man pages?) instead of just looking
       | for the first SO post with a similar question? That's how you
       | start programming because you're just trying to figure out how to
       | do anything in the first place, but not where you should be years
       | later. If you've been programming in a language for years you
       | should have read a good portion of the docs in that time (in
       | addition to SO posts), blogs, and so much more. Because the
       | things change too, so you have to be keeping up, and the truth is
       | that this will never happen if you just read SO posts to answer
       | your one question (and the next, and the next) because it will
       | always lag behind what tools exist and even more likely will
       | significantly lag because more recent posts have less time to
       | gain upvotes.
       | 
       | It kinda reminds me of the meme "how to exit vim." And how people
       | state that it is so hard to learn. Not only does just typing
       | `vim` into the terminal literally tell you how to quit, but
       | there's a built in `vimtutor` that'll tell you how to use it and
       | doesn't take very long to use. I've seen people go through this
       | and be better than people that have "used" vim for years. And
       | even then, how many people write `:help someFunction` into vim
       | itself? Because it is FAR better than googling your question and
       | you'll actually end up learning how the whole thing fits together
       | because it is giving you context. The same is true for literally
       | any programming language.
       | 
       | You should also be writing docs to your code because if you have
       | spaghetti code, there's a puzzle you haven't solved yet. And
       | guess what, documenting is not too different from the rubber
       | ducky method. Here's the procedure: write code to make shit work,
       | write docs and edit your code as you realize you can make things
       | better, go on and repeat but not revisit functions as you fuck
       | them up with another function. It's not nearly as much work as it
       | sounds and the investments compound. But quality takes time and
       | nothing worth doing is easy. It takes time to learn any habit and
       | skill. If you always look for the quickest solution to "just get
       | it done" and you never come back, then you probably haven't
       | learned anything, you've just parroted someone else. Moving fast
       | and breaking things is great, but once you have done that you got
       | to clean up your mess. You don't clean your kitchen by breaking
       | your dining room table. And your house isn't clean if all your
       | dishes are on the table! You might have to temporarily move stuff
       | around, but eventually you need to clean shit up. And code is
       | exactly the same way. If you regularly clean your house, it stays
       | clean and is easy to keep clean. But if you do it once a year it
       | is a herculean effort that you'll dread.
        
       | suzukigsx1100g wrote:
       | That's pretty lightwork for a snake in general. Send me a direct
       | message if you come up with something better.
        
       ___________________________________________________________________
       (page generated 2024-07-01 23:00 UTC)