[HN Gopher] I made a transformer to predict a simple sequence ma...
       ___________________________________________________________________
        
       I made a transformer to predict a simple sequence manually
        
       Author : lukastyrychtr
       Score  : 194 points
       Date   : 2023-09-22 08:22 UTC (14 hours ago)
        
 (HTM) web link (vgel.me)
 (TXT) w3m dump (vgel.me)
        
       | dang wrote:
       | [stub for sweeping offtopicness under the rug]
        
         | seabass-labrax wrote:
         | Minor request: can we have 'neural network' or something in the
         | title? This is related to the machine learning 'transformer'
         | architecture, rather than the bundle of coils that couples two
         | circuits electromagnetically.
        
           | sottol wrote:
           | "no training" gave it away for me.
        
             | seabass-labrax wrote:
             | I misinterpreted that as "I have no training; I am a
             | beginner" rather than "I am not going to train this thing"
             | :)
        
               | chungy wrote:
               | That's how I interpreted as well.
               | 
               | "Electrical transformer without any formal training in
               | electric engineering" is basically how the title read to
               | me. Followed by a lot of confusion when the article was
               | not that.
               | 
               | Good on him, it's just... an odd way to phrase it :)
        
               | eichin wrote:
               | Heh, I went a step further and thought "ooh, a novice EE
               | project, but titling it like that everyone's going to be
               | confused about it involving machine learning" until I
               | realized I was the one that got it backwards...
        
               | kulahan wrote:
               | I didn't even consider that it COULD be about LLMs until
               | I saw the comment above.
        
               | poniko wrote:
               | Though he was a novis and had no training in the field.
        
           | cypress66 wrote:
           | Funny how even though I studied EE, it didn't cross my mind
           | it could be an electrical transformer.
        
           | freecodyx wrote:
           | I also thought the physical transformer
        
           | 3abiton wrote:
           | It's always mildly annoying when different technologies have
           | the same name or acronym
        
           | u_name wrote:
           | And not about "alien robots who can disguise themselves by
           | transforming into everyday machinery, primarily vehicles" [1]
           | either. :)
           | 
           | [1] https://en.wikipedia.org/wiki/Transformers_(film)
        
             | seabass-labrax wrote:
             | Ha, I didn't think of that. But if it was, that might be
             | more impressive than either coiling wire or writing Python
             | code :)
        
               | cycomanic wrote:
               | You might appreciate this then
               | https://m.youtube.com/watch?v=uFmV0Xxae18
        
             | HankB99 wrote:
             | "Transformers: More than meets the eye!"
             | 
             | That was my second thought. My first thought was coils of
             | wire wrapping iron cores done by someone who does not know
             | what they're doing.
        
         | p1esk wrote:
         | Somehow I expected he made it out of mechanical parts. Like
         | literally "by hand".
        
           | jebarker wrote:
           | Nitpick: the author is female
        
             | GaggiX wrote:
             | The author is a trans woman, isn't male and female about
             | sex and woman and man about gender? At least that's what a
             | trans nonbinary person taught me.
        
               | hatthew wrote:
               | I don't think that is the prevailing opinion among trans
               | folks.
        
               | GaggiX wrote:
               | Well I don't think it was an opinion.
        
             | corenen wrote:
             | [flagged]
        
             | melenaboija wrote:
             | Yo, Twitter is not closed is just that they renamed it X.
             | These colorful comments deserve to have the right audience.
        
               | beebeepka wrote:
               | What the hell are you talking about?
        
               | jacoblambda wrote:
               | They aren't being mean. They are just nitting that
               | s/he/she/g
        
               | jebarker wrote:
               | This isn't a color comment. It's a small correction to a
               | factual error. No culture war to be had here.
        
               | iraqmtpizza wrote:
               | [flagged]
        
               | SalmoShalazar wrote:
               | Why are people allowed to post this hateful culture war
               | shit on this website?
        
               | iraqmtpizza wrote:
               | [flagged]
        
               | dredmorbius wrote:
               | <https://news.ycombinator.com/item?id=37603449>
        
           | sitzkrieg wrote:
           | me too, i was hoping to see a flyback transformer wound with
           | a power drill or something
        
         | [deleted]
        
         | b20000 wrote:
         | i thought, here is a guy building an electrical transformer by
         | hand, but no, more AI shite
        
           | artursapek wrote:
           | same lol but this post actually looks rly good
        
       | teddykoker wrote:
       | A related line of work is "Thinking Like Transformers" [1]. They
       | introduce a primitive programming language, RASP, which is
       | composed of operations capable of being modeled with transformer
       | components, and demonstrate how different programs can be written
       | with it, e.g. histograms, sorting. Sasha Rush and Gail Weiss have
       | an excellent blog post on it as well [2]. Follow on work actually
       | demonstrated how RASP-like programs could actually be compiled
       | into model weights without training [3].
       | 
       | [1] https://arxiv.org/abs/2106.06981
       | 
       | [2] https://srush.github.io/raspy/
       | 
       | [3] https://arxiv.org/abs/2301.05062
        
         | newhouseb wrote:
         | Huge fan of RASP et al. If you enjoy this space, might be fun
         | to take a glance at some of my work on HandCrafted Transformers
         | [1] wherein I hand-pick the weights in a transformer model to
         | do long-handed addition similar to how humans learn to do it in
         | gradeshcool.
         | 
         | [1]
         | https://colab.research.google.com/github/newhouseb/handcraft...
        
       | Supply5411 wrote:
       | I've been kicking around a similar idea for awhile. Why can't we
       | have an intuitive interface to the weights of a model, that a
       | domain expert can tweak by hand to accelerate training? For
       | example, in a vision model, they can increase the "orangeness"
       | collection of weights when detecting traffic cones. That way,
       | instead of requiring thousands/millions more examples to
       | calibrate "orangeness" right, it's accelerated by a human expert.
       | The difficulty is obviously having this interface map to the
       | collections of weights that mean different things, but is there a
       | technical reason this can't be done?
        
         | amilios wrote:
         | The technical reason it can't be done (or would be very
         | difficult to do) is that weights are typically very
         | uninterpretable. There aren't specific clusters of neurons that
         | map to one concept or another, everything kind of does
         | everything.
        
           | Supply5411 wrote:
           | I wonder if an expert can "impose" weights onto a model and
           | the model will opt to continue with them when it resumes
           | training. For example, in the vision example, the expert may
           | not know where "orangeness" currently exists, but if they
           | impose their own collection of weight adjustments that
           | represent orangeness, will the model continue to use these
           | weights as the path of least resistance when continuing to
           | optimize? Just spitballing, but if we can't pick out which
           | neurons do what, the alternative would seem to be to
           | encourage the model to adopt a control interface of neurons.
        
             | astrange wrote:
             | That would make it less efficient - since learning is
             | compression, a less compressed model will also learn less
             | at the same size.
        
         | klysm wrote:
         | The attention mechanisms present in transformers don't seem
         | easy to map to semantics that humans can understand. There are
         | too many parameters involved
        
         | elesiuta wrote:
         | > weights of a model, that a domain expert can tweak by hand
         | 
         | This sounds similar to how image recognition was done before
         | deep learning [1]
         | 
         | [1] https://www.youtube.com/watch?v=8SF_h3xF3cE&t=1358s
        
           | Supply5411 wrote:
           | Great example. Right, the deep learning approach uncovers all
           | kinds of hidden features and relationships automatically that
           | a team of humans might miss.
           | 
           | I guess I'm thinking about this problem from the perspective
           | of these GPT models requiring more training data than a
           | normal person can acquire. Currently, it seems you need the
           | entire internet worth of training data (and a lot of money)
           | to get something that can communicate reasonably well. But
           | most people can communicate reasonably well, so it would be
           | cool if that basic communication knowledge could be somehow
           | used to accelerate training and minimize the reliance on
           | training data.
        
             | mistrial9 wrote:
             | > the deep learning approach uncovers all kinds of hidden
             | features and relationships automatically that a team of
             | humans might miss
             | 
             | sitting in a lecture from a decent DeepLearning
             | practitioner, there were two questions from the audience
             | (among others). The first question asked "How can we check
             | the results using other models, so that computers will
             | catch the errors that humans miss?"
             | 
             | The second question was more like "when a model is built
             | across a non-trivial input space, the features and classes
             | that come out are one set of possibilities, but there are
             | many more possibilities. How can we discover more about the
             | model that is built, knowing that there are inherent
             | epistemological conflicts in any model?"
             | 
             | I also thought it was interesting that the two questioners
             | were from large but very different demographic groups, and
             | at different stages of learning and practice (the second
             | question was from a senior coder).
        
             | EricMausler wrote:
             | I am still learning transformers, but I believe part of the
             | issue may be that the weights do not necessarily correlate
             | to things like "orangeness"
             | 
             | Instead of a transformer for each color, you have like 5 to
             | 100 weights that represent some arbitrary combination of
             | colors. Literally the arbitrariness is defined by the
             | dataset and the number of weights allocated.
             | 
             | They may even represent more than just color.
             | 
             | So I am not sure if a weight is actually a "dial" like you
             | are describing it, where you can turn up or down different
             | qualities. I think the relationship between weights and
             | features is relatively chaotic.
             | 
             | Like you may increase orangeness but decrease "cone
             | shapedness" or accidentally make it identify deer as trees
             | or something, all by just changing 1 value on 1 weight
        
         | jarde wrote:
         | The number of layers and weights is really not at a scale we
         | can handle updating manually, and even if we could the
         | downstream effects of modifying weights are way too hard to
         | manage. Say you are updating the picture to be better at
         | orange, but unless you can monitor all the other colours for
         | correctness at the same time you probably are creating issues
         | for other colours without realizing it
        
       | PaulHoule wrote:
       | It's some kind of abstract machine, like a turing machine or the
       | machine that parses regexes, isn't it?
        
         | nerdponx wrote:
         | It's kind of hard to interpret these things as "automata" in
         | the sense that one might usually think of them.
         | 
         | Everything is usually a little fuzzy in a neural network.
         | There's rarely anything like an if/else statement, although (as
         | in the transformer example) you have some cases of "masking"
         | values with 0 or -[?]. The output is almost always fuzzy as
         | well, being a collection of scores or probabilities. For
         | example, a model that distinguishes cat pictures and dog
         | pictures might emit a result like "dog:0.95 cat:0.05", and we
         | say that it predicted a cat because the dog score is higher
         | than the cat score.
         | 
         | In fact, the core of the transformer, the attention mechanism,
         | is based on a kind of "soft lookup" operation. In a non-fuzzy
         | system, you might want to do something like loop through each
         | token in the sequence, check if that token is relevant to the
         | current token, and take some action if it's relevant. But in a
         | transformer, relevance is not a binary decision. Instead, the
         | attention mechanism computes a continuous relevance score
         | between each pair of tokens in the sequence, and uses those
         | scores to take further action.
         | 
         | But some things are _not_ easily generalized directly from a
         | system based on of binary decisions. For example, those
         | relevance scores are used as weights to compute a weighted
         | average over tokens in the vocabulary, and thereby obtain an
         | "average token" for the current position in the sequence. I
         | don't think there's an easy way to interpret this as an
         | extension of some process based on branching logic.
        
         | enriquto wrote:
         | Neural networks _are_ Turing machines. You can make them
         | perform any computation by carefully setting up their weights.
         | It would be nice to have compilers for them that were not based
         | on approximation, though.
        
           | shawntan wrote:
           | Would I be able to solve the Travelling Salesman Problem with
           | a Transformer with the appropriately assigned weights? That
           | would be an achievement. You'd beat some known bounds of the
           | complexity of TSP.
        
           | PartiallyTyped wrote:
           | > It would be nice to have compilers for them that were not
           | based on approximation, though.
           | 
           | Could you elaborate?
        
             | enriquto wrote:
             | People typically set the weights of a neural network using
             | heuristic approximation algorithms, by looking at a large
             | set of example inputs/outputs and trying to find weights
             | that perform the needed computation as accurately as
             | possible. This approximation process is called _training_.
             | But this approximation happens because nobody really knows
             | how to set the weights otherwise. It would be nice if we
             | had  "compilers" for neural networks, where you write an
             | algorithm in a programming language, and you get a neural
             | network (architecture+weights) that performs the same
             | computation.
             | 
             | TFA is a beautiful step in that direction. What I want is
             | an automated way to do this, without having to hire vgel
             | every time.
        
               | justo-rivera wrote:
               | That makes no entropy or cyber-netic sense at all. You
               | would just get a neural network that outputs the exact
               | formula, or algo. Like, if you would just do a sine it
               | would be a taylor series encoded into neurons
               | 
               | Its like going from computing PI as a constant to
               | computing it as a giantic float.
               | 
               | You lose info
        
               | PartiallyTyped wrote:
               | Why would you do that when it's better to do the
               | opposite? Given a model quantize it and compile it to
               | direct code objects that do the same thing much much much
               | faster?
               | 
               | The generality of the approach [NNs] implies that they
               | are effectively a union of all programs that may be
               | represented, and as such there needs to be the capacity
               | for that, this capacity is in size, which makes them
               | wasteful for exact solutions.
               | 
               | it is fairly trivial to create FFNNs that behave as
               | decision trees using just relus if you can encode your
               | problem as a continuous problem with a finite set of
               | inputs. Then you can very well say that this decision
               | tree is, well, a program, and there you have it.
               | 
               | The actual problem is the encoding, which is why NNs are
               | so powerful, that is, they learn the encodings themselves
               | through grad descent and variants.
        
               | shawntan wrote:
               | Something close exists:
               | 
               | RASP https://arxiv.org/abs/2106.06981
               | 
               | python implementation: https://srush.github.io/raspy/
        
               | Legend2440 wrote:
               | The point of training is to create computer programs
               | through optimization, because there are many problems
               | (like understanding language) that we just don't know how
               | to write programs to do.
               | 
               | It's not that we don't know how to set the weights -
               | neural networks are only designed with weights because it
               | makes them easy to optimize.
               | 
               | There is no reason to use them if you plan to write your
               | own code for them. You won't be able to do anything that
               | you couldn't do in a normal programming language, because
               | what makes NNs special _is_ the training process.
        
               | nyrikki wrote:
               | A turing complete system doesn't necessarily mean it's
               | useful, it just means that it's equivalent with a turing
               | machine. The ability to describe any possible algorithm
               | is not that powerful in itself.
               | 
               | As an example, algebraic type systems are often TC simply
               | because general recursion is allowed.
               | 
               | Feed forward networks are effectively DAGs and while you
               | may be able to express any algorithms using them they are
               | also pairwise linear in respect to inputs.
               | 
               | Statistical learning is powerful in finding and matching
               | patterns, but graph rewriting, which is what your doing
               | with initial random weights and training is not trivial.
               | 
               | More importantly it doesn't make issues like the halting
               | problem decidable.
               | 
               | I don't see why the same limits in graph rewriting
               | languages which were explored in the 90s won't hit using
               | feed forward networks as computation systems outside of
               | the application of nation-state scale computing power.
               | 
               | But I am open to understanding where I am wrong.
        
               | shawntan wrote:
               | Short version of this without the caveats: It's not even
               | Turing complete.
               | 
               | I review a few papers on the topic here:
               | https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/
        
           | ComputerGuru wrote:
           | The whole point is that we don't know what rules to encode
           | and we want the system to derive the weights for itself as
           | part of the training process. We have compilers that can do
           | deterministic code - that's the normal approach.
        
         | ntonozzi wrote:
         | Yes! Check out this paper describing how Linear Transformers
         | are secretly Fast Weight Programmers:
         | https://arxiv.org/abs/2102.11174.
        
         | thewataccount wrote:
         | This is simplified a bit - It's just a "machine" that maps [set
         | of inputs] -> [set of probabilities of the next output]
         | 
         | First you define a list of tokens - lets say 24 letters because
         | that's easier.
         | 
         | They are a machine that takes an input sequence of tokens, does
         | a deterministic series of matrix operations, and outputs what
         | is a list of the probability of every token.
         | 
         | "learning" is just the process of setting some of the numbers
         | inside of a matrix(s) used for some of the operations.
         | 
         | Notice that there's only a single "if" statement in their final
         | code, and it's for evaluating the result's accuracy. All of the
         | "logic" is from the result of these matrix operations.
        
         | taneq wrote:
         | It's just any pile of linear algebra that's been in contact
         | with the AllSpark, right?
        
       | tostadora1 wrote:
       | I always wanted to at least have a shallow understanding of
       | Transformers but the paper was way too technical for me.
       | 
       | This really helped me understand how they work! Or at least I
       | understood your example, it was very clear. And I also got to
       | brush up my matrix stuff from uni lol.
       | 
       | Thanks!
        
       | tayo42 wrote:
       | > maybe even feel inspired to make your own model by hand as
       | well!
       | 
       | Other then a learning exercise to satisfy your curiosity what are
       | you doing with this? I'm starting to get the feeling that
       | anything complex with ml models is unreasonable for a at home
       | blog reader?
        
         | nerdponx wrote:
         | It's an _excellent_ learning exercise, not just to satisfy
         | curiosity but to develop and deepen understanding.
        
         | onemoresoop wrote:
         | Author states in the first paragraph of their blog post:
         | 
         | "I've been wanting to understand transformers and attention
         | better for awhile now--I'd read The Illustrated Transformer,
         | but still didn't feel like I had an intuitive understanding of
         | what the various pieces of attention were doing. What's the
         | difference between q and k? And don't even get me started on
         | v!"
        
           | tayo42 wrote:
           | lol three people said the same thing, i get that. thats why i
           | said other than learning and satisfying curiosity...
        
         | taneq wrote:
         | I dunno, maybe they actually enjoy hacking on projects like
         | this? Weird I know.
        
       | utopcell wrote:
       | Darn it Theia: I can't stop playing your hoverator game!
        
       ___________________________________________________________________
       (page generated 2023-09-22 23:00 UTC)