[HN Gopher] I made a transformer to predict a simple sequence ma...
___________________________________________________________________
I made a transformer to predict a simple sequence manually
Author : lukastyrychtr
Score : 194 points
Date : 2023-09-22 08:22 UTC (14 hours ago)
(HTM) web link (vgel.me)
(TXT) w3m dump (vgel.me)
| dang wrote:
| [stub for sweeping offtopicness under the rug]
| seabass-labrax wrote:
| Minor request: can we have 'neural network' or something in the
| title? This is related to the machine learning 'transformer'
| architecture, rather than the bundle of coils that couples two
| circuits electromagnetically.
| sottol wrote:
| "no training" gave it away for me.
| seabass-labrax wrote:
| I misinterpreted that as "I have no training; I am a
| beginner" rather than "I am not going to train this thing"
| :)
| chungy wrote:
| That's how I interpreted as well.
|
| "Electrical transformer without any formal training in
| electric engineering" is basically how the title read to
| me. Followed by a lot of confusion when the article was
| not that.
|
| Good on him, it's just... an odd way to phrase it :)
| eichin wrote:
| Heh, I went a step further and thought "ooh, a novice EE
| project, but titling it like that everyone's going to be
| confused about it involving machine learning" until I
| realized I was the one that got it backwards...
| kulahan wrote:
| I didn't even consider that it COULD be about LLMs until
| I saw the comment above.
| poniko wrote:
| Though he was a novis and had no training in the field.
| cypress66 wrote:
| Funny how even though I studied EE, it didn't cross my mind
| it could be an electrical transformer.
| freecodyx wrote:
| I also thought the physical transformer
| 3abiton wrote:
| It's always mildly annoying when different technologies have
| the same name or acronym
| u_name wrote:
| And not about "alien robots who can disguise themselves by
| transforming into everyday machinery, primarily vehicles" [1]
| either. :)
|
| [1] https://en.wikipedia.org/wiki/Transformers_(film)
| seabass-labrax wrote:
| Ha, I didn't think of that. But if it was, that might be
| more impressive than either coiling wire or writing Python
| code :)
| cycomanic wrote:
| You might appreciate this then
| https://m.youtube.com/watch?v=uFmV0Xxae18
| HankB99 wrote:
| "Transformers: More than meets the eye!"
|
| That was my second thought. My first thought was coils of
| wire wrapping iron cores done by someone who does not know
| what they're doing.
| p1esk wrote:
| Somehow I expected he made it out of mechanical parts. Like
| literally "by hand".
| jebarker wrote:
| Nitpick: the author is female
| GaggiX wrote:
| The author is a trans woman, isn't male and female about
| sex and woman and man about gender? At least that's what a
| trans nonbinary person taught me.
| hatthew wrote:
| I don't think that is the prevailing opinion among trans
| folks.
| GaggiX wrote:
| Well I don't think it was an opinion.
| corenen wrote:
| [flagged]
| melenaboija wrote:
| Yo, Twitter is not closed is just that they renamed it X.
| These colorful comments deserve to have the right audience.
| beebeepka wrote:
| What the hell are you talking about?
| jacoblambda wrote:
| They aren't being mean. They are just nitting that
| s/he/she/g
| jebarker wrote:
| This isn't a color comment. It's a small correction to a
| factual error. No culture war to be had here.
| iraqmtpizza wrote:
| [flagged]
| SalmoShalazar wrote:
| Why are people allowed to post this hateful culture war
| shit on this website?
| iraqmtpizza wrote:
| [flagged]
| dredmorbius wrote:
| <https://news.ycombinator.com/item?id=37603449>
| sitzkrieg wrote:
| me too, i was hoping to see a flyback transformer wound with
| a power drill or something
| [deleted]
| b20000 wrote:
| i thought, here is a guy building an electrical transformer by
| hand, but no, more AI shite
| artursapek wrote:
| same lol but this post actually looks rly good
| teddykoker wrote:
| A related line of work is "Thinking Like Transformers" [1]. They
| introduce a primitive programming language, RASP, which is
| composed of operations capable of being modeled with transformer
| components, and demonstrate how different programs can be written
| with it, e.g. histograms, sorting. Sasha Rush and Gail Weiss have
| an excellent blog post on it as well [2]. Follow on work actually
| demonstrated how RASP-like programs could actually be compiled
| into model weights without training [3].
|
| [1] https://arxiv.org/abs/2106.06981
|
| [2] https://srush.github.io/raspy/
|
| [3] https://arxiv.org/abs/2301.05062
| newhouseb wrote:
| Huge fan of RASP et al. If you enjoy this space, might be fun
| to take a glance at some of my work on HandCrafted Transformers
| [1] wherein I hand-pick the weights in a transformer model to
| do long-handed addition similar to how humans learn to do it in
| gradeshcool.
|
| [1]
| https://colab.research.google.com/github/newhouseb/handcraft...
| Supply5411 wrote:
| I've been kicking around a similar idea for awhile. Why can't we
| have an intuitive interface to the weights of a model, that a
| domain expert can tweak by hand to accelerate training? For
| example, in a vision model, they can increase the "orangeness"
| collection of weights when detecting traffic cones. That way,
| instead of requiring thousands/millions more examples to
| calibrate "orangeness" right, it's accelerated by a human expert.
| The difficulty is obviously having this interface map to the
| collections of weights that mean different things, but is there a
| technical reason this can't be done?
| amilios wrote:
| The technical reason it can't be done (or would be very
| difficult to do) is that weights are typically very
| uninterpretable. There aren't specific clusters of neurons that
| map to one concept or another, everything kind of does
| everything.
| Supply5411 wrote:
| I wonder if an expert can "impose" weights onto a model and
| the model will opt to continue with them when it resumes
| training. For example, in the vision example, the expert may
| not know where "orangeness" currently exists, but if they
| impose their own collection of weight adjustments that
| represent orangeness, will the model continue to use these
| weights as the path of least resistance when continuing to
| optimize? Just spitballing, but if we can't pick out which
| neurons do what, the alternative would seem to be to
| encourage the model to adopt a control interface of neurons.
| astrange wrote:
| That would make it less efficient - since learning is
| compression, a less compressed model will also learn less
| at the same size.
| klysm wrote:
| The attention mechanisms present in transformers don't seem
| easy to map to semantics that humans can understand. There are
| too many parameters involved
| elesiuta wrote:
| > weights of a model, that a domain expert can tweak by hand
|
| This sounds similar to how image recognition was done before
| deep learning [1]
|
| [1] https://www.youtube.com/watch?v=8SF_h3xF3cE&t=1358s
| Supply5411 wrote:
| Great example. Right, the deep learning approach uncovers all
| kinds of hidden features and relationships automatically that
| a team of humans might miss.
|
| I guess I'm thinking about this problem from the perspective
| of these GPT models requiring more training data than a
| normal person can acquire. Currently, it seems you need the
| entire internet worth of training data (and a lot of money)
| to get something that can communicate reasonably well. But
| most people can communicate reasonably well, so it would be
| cool if that basic communication knowledge could be somehow
| used to accelerate training and minimize the reliance on
| training data.
| mistrial9 wrote:
| > the deep learning approach uncovers all kinds of hidden
| features and relationships automatically that a team of
| humans might miss
|
| sitting in a lecture from a decent DeepLearning
| practitioner, there were two questions from the audience
| (among others). The first question asked "How can we check
| the results using other models, so that computers will
| catch the errors that humans miss?"
|
| The second question was more like "when a model is built
| across a non-trivial input space, the features and classes
| that come out are one set of possibilities, but there are
| many more possibilities. How can we discover more about the
| model that is built, knowing that there are inherent
| epistemological conflicts in any model?"
|
| I also thought it was interesting that the two questioners
| were from large but very different demographic groups, and
| at different stages of learning and practice (the second
| question was from a senior coder).
| EricMausler wrote:
| I am still learning transformers, but I believe part of the
| issue may be that the weights do not necessarily correlate
| to things like "orangeness"
|
| Instead of a transformer for each color, you have like 5 to
| 100 weights that represent some arbitrary combination of
| colors. Literally the arbitrariness is defined by the
| dataset and the number of weights allocated.
|
| They may even represent more than just color.
|
| So I am not sure if a weight is actually a "dial" like you
| are describing it, where you can turn up or down different
| qualities. I think the relationship between weights and
| features is relatively chaotic.
|
| Like you may increase orangeness but decrease "cone
| shapedness" or accidentally make it identify deer as trees
| or something, all by just changing 1 value on 1 weight
| jarde wrote:
| The number of layers and weights is really not at a scale we
| can handle updating manually, and even if we could the
| downstream effects of modifying weights are way too hard to
| manage. Say you are updating the picture to be better at
| orange, but unless you can monitor all the other colours for
| correctness at the same time you probably are creating issues
| for other colours without realizing it
| PaulHoule wrote:
| It's some kind of abstract machine, like a turing machine or the
| machine that parses regexes, isn't it?
| nerdponx wrote:
| It's kind of hard to interpret these things as "automata" in
| the sense that one might usually think of them.
|
| Everything is usually a little fuzzy in a neural network.
| There's rarely anything like an if/else statement, although (as
| in the transformer example) you have some cases of "masking"
| values with 0 or -[?]. The output is almost always fuzzy as
| well, being a collection of scores or probabilities. For
| example, a model that distinguishes cat pictures and dog
| pictures might emit a result like "dog:0.95 cat:0.05", and we
| say that it predicted a cat because the dog score is higher
| than the cat score.
|
| In fact, the core of the transformer, the attention mechanism,
| is based on a kind of "soft lookup" operation. In a non-fuzzy
| system, you might want to do something like loop through each
| token in the sequence, check if that token is relevant to the
| current token, and take some action if it's relevant. But in a
| transformer, relevance is not a binary decision. Instead, the
| attention mechanism computes a continuous relevance score
| between each pair of tokens in the sequence, and uses those
| scores to take further action.
|
| But some things are _not_ easily generalized directly from a
| system based on of binary decisions. For example, those
| relevance scores are used as weights to compute a weighted
| average over tokens in the vocabulary, and thereby obtain an
| "average token" for the current position in the sequence. I
| don't think there's an easy way to interpret this as an
| extension of some process based on branching logic.
| enriquto wrote:
| Neural networks _are_ Turing machines. You can make them
| perform any computation by carefully setting up their weights.
| It would be nice to have compilers for them that were not based
| on approximation, though.
| shawntan wrote:
| Would I be able to solve the Travelling Salesman Problem with
| a Transformer with the appropriately assigned weights? That
| would be an achievement. You'd beat some known bounds of the
| complexity of TSP.
| PartiallyTyped wrote:
| > It would be nice to have compilers for them that were not
| based on approximation, though.
|
| Could you elaborate?
| enriquto wrote:
| People typically set the weights of a neural network using
| heuristic approximation algorithms, by looking at a large
| set of example inputs/outputs and trying to find weights
| that perform the needed computation as accurately as
| possible. This approximation process is called _training_.
| But this approximation happens because nobody really knows
| how to set the weights otherwise. It would be nice if we
| had "compilers" for neural networks, where you write an
| algorithm in a programming language, and you get a neural
| network (architecture+weights) that performs the same
| computation.
|
| TFA is a beautiful step in that direction. What I want is
| an automated way to do this, without having to hire vgel
| every time.
| justo-rivera wrote:
| That makes no entropy or cyber-netic sense at all. You
| would just get a neural network that outputs the exact
| formula, or algo. Like, if you would just do a sine it
| would be a taylor series encoded into neurons
|
| Its like going from computing PI as a constant to
| computing it as a giantic float.
|
| You lose info
| PartiallyTyped wrote:
| Why would you do that when it's better to do the
| opposite? Given a model quantize it and compile it to
| direct code objects that do the same thing much much much
| faster?
|
| The generality of the approach [NNs] implies that they
| are effectively a union of all programs that may be
| represented, and as such there needs to be the capacity
| for that, this capacity is in size, which makes them
| wasteful for exact solutions.
|
| it is fairly trivial to create FFNNs that behave as
| decision trees using just relus if you can encode your
| problem as a continuous problem with a finite set of
| inputs. Then you can very well say that this decision
| tree is, well, a program, and there you have it.
|
| The actual problem is the encoding, which is why NNs are
| so powerful, that is, they learn the encodings themselves
| through grad descent and variants.
| shawntan wrote:
| Something close exists:
|
| RASP https://arxiv.org/abs/2106.06981
|
| python implementation: https://srush.github.io/raspy/
| Legend2440 wrote:
| The point of training is to create computer programs
| through optimization, because there are many problems
| (like understanding language) that we just don't know how
| to write programs to do.
|
| It's not that we don't know how to set the weights -
| neural networks are only designed with weights because it
| makes them easy to optimize.
|
| There is no reason to use them if you plan to write your
| own code for them. You won't be able to do anything that
| you couldn't do in a normal programming language, because
| what makes NNs special _is_ the training process.
| nyrikki wrote:
| A turing complete system doesn't necessarily mean it's
| useful, it just means that it's equivalent with a turing
| machine. The ability to describe any possible algorithm
| is not that powerful in itself.
|
| As an example, algebraic type systems are often TC simply
| because general recursion is allowed.
|
| Feed forward networks are effectively DAGs and while you
| may be able to express any algorithms using them they are
| also pairwise linear in respect to inputs.
|
| Statistical learning is powerful in finding and matching
| patterns, but graph rewriting, which is what your doing
| with initial random weights and training is not trivial.
|
| More importantly it doesn't make issues like the halting
| problem decidable.
|
| I don't see why the same limits in graph rewriting
| languages which were explored in the 90s won't hit using
| feed forward networks as computation systems outside of
| the application of nation-state scale computing power.
|
| But I am open to understanding where I am wrong.
| shawntan wrote:
| Short version of this without the caveats: It's not even
| Turing complete.
|
| I review a few papers on the topic here:
| https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/
| ComputerGuru wrote:
| The whole point is that we don't know what rules to encode
| and we want the system to derive the weights for itself as
| part of the training process. We have compilers that can do
| deterministic code - that's the normal approach.
| ntonozzi wrote:
| Yes! Check out this paper describing how Linear Transformers
| are secretly Fast Weight Programmers:
| https://arxiv.org/abs/2102.11174.
| thewataccount wrote:
| This is simplified a bit - It's just a "machine" that maps [set
| of inputs] -> [set of probabilities of the next output]
|
| First you define a list of tokens - lets say 24 letters because
| that's easier.
|
| They are a machine that takes an input sequence of tokens, does
| a deterministic series of matrix operations, and outputs what
| is a list of the probability of every token.
|
| "learning" is just the process of setting some of the numbers
| inside of a matrix(s) used for some of the operations.
|
| Notice that there's only a single "if" statement in their final
| code, and it's for evaluating the result's accuracy. All of the
| "logic" is from the result of these matrix operations.
| taneq wrote:
| It's just any pile of linear algebra that's been in contact
| with the AllSpark, right?
| tostadora1 wrote:
| I always wanted to at least have a shallow understanding of
| Transformers but the paper was way too technical for me.
|
| This really helped me understand how they work! Or at least I
| understood your example, it was very clear. And I also got to
| brush up my matrix stuff from uni lol.
|
| Thanks!
| tayo42 wrote:
| > maybe even feel inspired to make your own model by hand as
| well!
|
| Other then a learning exercise to satisfy your curiosity what are
| you doing with this? I'm starting to get the feeling that
| anything complex with ml models is unreasonable for a at home
| blog reader?
| nerdponx wrote:
| It's an _excellent_ learning exercise, not just to satisfy
| curiosity but to develop and deepen understanding.
| onemoresoop wrote:
| Author states in the first paragraph of their blog post:
|
| "I've been wanting to understand transformers and attention
| better for awhile now--I'd read The Illustrated Transformer,
| but still didn't feel like I had an intuitive understanding of
| what the various pieces of attention were doing. What's the
| difference between q and k? And don't even get me started on
| v!"
| tayo42 wrote:
| lol three people said the same thing, i get that. thats why i
| said other than learning and satisfying curiosity...
| taneq wrote:
| I dunno, maybe they actually enjoy hacking on projects like
| this? Weird I know.
| utopcell wrote:
| Darn it Theia: I can't stop playing your hoverator game!
___________________________________________________________________
(page generated 2023-09-22 23:00 UTC)