[HN Gopher] Reverse engineering a neural network's clever soluti...
___________________________________________________________________
Reverse engineering a neural network's clever solution to binary
addition
Author : Ameo
Score : 451 points
Date : 2023-01-16 10:32 UTC (12 hours ago)
(HTM) web link (cprimozic.net)
(TXT) w3m dump (cprimozic.net)
| moyix wrote:
| Related: an independent reverse engineering of a network that
| "grokked" modular addition:
|
| https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...
|
| One interesting thing is that this network similarly does
| addition using a Fourier transform.
| AstralStorm wrote:
| When you expect your network to learn a binary adder and it
| learns FFT addition instead. :)
|
| (For an ANN FFT is more natural as it's a projection algorithm.)
| jccalhoun wrote:
| These stories remind me of a story from Discover Magazine
| https://www.discovermagazine.com/technology/evolving-a-consc... A
| researcher was using a process to "evolve" a FPGA and the result
| was a circuit that was super efficient but worked in ways that
| were unexpected: part of the circuit seemed unconnected to the
| rest but if removed the whole thing stopped working and it would
| only work at a specific temperature.
| teaearlgraycold wrote:
| Overfitting IRL
| gooseyard wrote:
| oh weird, I first heard about a story like this in
| https://www.damninteresting.com/on-the-origin-of-circuits/, I
| think perhaps they're both about the same lab.
| jccalhoun wrote:
| Same guy: Adrian Thompson
| davesque wrote:
| I can't believe someone dropped a link to this story. I
| remember reading this and feeling like it broadened my sense of
| what evolutionary processes can produce.
| googlryas wrote:
| Reading that article made me start playing with evolutionary
| algorithms and start to think they were the future, as
| opposed to neural nets. Oops!
| weakfortress wrote:
| It's interesting to see these unconventional solutions. Genetic
| algorithms evolving antenna design produce similar illogical
| but very efficient designs. Humans have a draw to aesthetic.
| Robots don't have such limitations.
| qikInNdOutReply wrote:
| Ah, the good old radio component you can program into fpgas.
| pjc50 wrote:
| Very easy to design a radio component into electronics, much
| harder to design it out.
| nomel wrote:
| I mistakenly made a two AM radios in my electronics
| classes, from trying to make an amplifier and a PLL. The FM
| radio was mistakenly made while trying to make an AM radio.
| :-|
| nkrisc wrote:
| I've had all kinds of cheap electronics that came with
| unadvertised radio features for free.
| TeMPOraL wrote:
| It's sad that those pesky regulators from the FCC make it
| so hard to get devices with unintended radio
| functionality these days!
| gumby wrote:
| I know, if people really cared about devices rejecting
| interference the free market would sort that out,
| amirite?
| WalterBright wrote:
| The free market doesn't allow you to pour sewage onto
| your neighbor's property, either.
|
| Think "property rights" are required for a free market.
| gumby wrote:
| OMG you mean the libertarians are mistaken? Sacre bleu!
| WalterBright wrote:
| I don't know what you heard, but you're mistaken about
| what free markets are.
| WalterBright wrote:
| My fillings tune in Radio Moscow. Keeps me awake at
| night.
| kordlessagain wrote:
| Yeah, IIRC it grew an antenna that was utilizing the clock in
| the computer it was sitting next to.
| the8472 wrote:
| Primary source (or at least one of several published versions):
| https://cse-robotics.engr.tamu.edu/dshell/cs625/Thompson96Ev...
| rehevkor5 wrote:
| Not too surprising. Each layer can multiply each input by a set
| of weights, and add them up. That's all you need, in order to
| convert from base 2 to base 10. In base 2, we calculate the value
| of a set of digits by multiplying each by 2^0, 2^1, ..., 2^n.
| Each weight is twice the previous. The activation function is
| just making it easier to center those factors on zero to reduce
| the magnitude of the regularization term. I am guessing you could
| probably get the job done with just two neurons in the first
| layer, which output the real value of each input, one neuron in
| the second layer to do the addition (weights 1, 1), and 8 neurons
| in the final layer to convert back to binary digits.
| tonmoy wrote:
| While your analysis sounds good, I don't think you mean base
| 10, you probably mean a "single float"?
| wwarner wrote:
| So important. Here's a prediction. Work like this on model
| explainability will continue and eventually will reveal patterns
| in the input that reliably produce features in the resulting
| models. That knowledge will tell us about _nature_ , in addition
| to ML.
| dejongh wrote:
| Interesting idea to reverse engineer the network. Are there other
| sources that have done this?
| SonOfLilit wrote:
| The field of NN explainability tries, but usually there's only
| handwavy things to be done (because too many weights). This
| project involved intentionally building very very small
| networks that can be understood completely.
| nerdponx wrote:
| Maybe not exactly what you had in mind, but is a lot of
| literature in general on trying to extract interpretations from
| neural network models, and to a lesser extent from other
| complicated nonlinear models like gradient boosted trees.
|
| Somewhat famously, you can plot the weight activations on a
| heatmap from a CNN for image processing and obtain a visual
| representation of the "filter" that the model has learned,
| which the model (conceptually) slides across the image until it
| matches something. For example:
| https://towardsdatascience.com/convolutional-neural-network-...
|
| Many techniques don't look directly at the numbers in the
| model. Instead, they construct inputs to the model that attempt
| to trace out its behavior under various constraints. Examples
| include Partial Dependence, LIME, and SHAP.
|
| Also, those "deep dream" images that were popular a couple
| years ago are generated by running parts of a deep NN model
| without running the whole thing.
| zozbot234 wrote:
| Yes, more generally this entails looking at the numbers
| learned by a model not as singular "weights" (which doesn't
| scale as the model gets larger) but more as a way of
| approximating a non-parametric representation (i.e. something
| function-like or perhaps even more general). There's a well-
| established field of variously "explainable" semi-parametric
| and non-parametric modeling and statistics, which aims to
| seamlessly scale to large volumes of data and modeling
| complexity much like NN's do.
| hoseja wrote:
| Reminded me of the 0xFFFF0000 0xFF00FF00 0xF0F0F0F0 0xCCCCCCCC
| 0XAAAAAAAA trick, the use of which I don't currently recall...
|
| edit: several, but probably not very relevant
| https://graphics.stanford.edu/~seander/bithacks.html
| MontyCarloHall wrote:
| A couple questions:
|
| 1. How much of this outcome is due to the unusual (pseudo)
| periodic activation function? Seems like a lot of the DAC-like
| behavior is coming from the periodicity of the first layer's
| output, which seems to be due to the unique activation function.
|
| 2. Would the behavior of the network change if the binary strings
| were encoded differently? The author encodes them as 1D arrays
| with 1 corresponding to 1 and 0 corresponding to -1, which is an
| unusual way of doing things. What if the author encoded them as
| literal binary arrays (i.e. 1->1, 0->0)? What about one-hot
| arrays (i.e. 2D arrays with 1->[1, 0] and 0->[0, 1]), which is
| the most common way to encode categorical data?
| evrimoztamur wrote:
| For 1., the author used Ameo (the weird activation function) in
| the first layer and tanh for others, later on notes:
|
| "While playing around with this setup, I tried re-training the
| network with the activation function for the first layer
| replaced with sin(x) and it ends up working pretty much the
| same way. Interestingly, the weights learned in that case are
| fractions of p rather than 1."
|
| By the looks of it, any activation function that maps a
| positive and negative range should work. Haven't tested that
| myself. The 1 vs p is likely due to the peaks of the functions,
| Ameo at 1 and sine at p/2.
|
| Regardless, it's not Ameo.
| Ameo wrote:
| I think that the activation function definitely is important
| for this particular case, but it's not actually periodic; it
| saturates (with a configurable amount of leakyness) on both
| ends. The periodic behavior happens due to patterns in
| individual bits as you count up.
|
| As for the encoding, I think it's a pretty normal way to encode
| binary inputs like this. Having the values be -1 and 1 is
| pretty common since it makes the data centered at 0 rather than
| 0.5 which can lead to better training results.
| KhoomeiK wrote:
| A more theoretical but increasingly popular approach to
| identifying behavior like this is interchange intervention:
| https://ai.stanford.edu/blog/causal-abstraction/
| nerdponx wrote:
| I liked this article a lot, but
|
| > One thought that occurred to me after this investigation was
| the premise that the immense bleeding-edge models of today with
| billions of parameters might be able to be built using orders of
| magnitude fewer network resources by using more efficient or
| custom-designed architectures.
|
| Transformer units themselves are already specialized things.
| Wikipedia says that GPT-3 is a standard transformer network, so
| I'm sure there is additional room for specialization. But that's
| not a new idea either, and it's often the case that a after a
| model is released, smaller versions tend to follow.
| [deleted]
| sebzim4500 wrote:
| Transformers are specialized to run on our hardware, whereas
| the article is suggesting architectures which are specialized
| for a specific task.
| nerdponx wrote:
| Are transformers not already very specialized to the task of
| learning from sequences of word vectors? I'm sure there is
| more that can be done with them other than making the input
| sequences really long, but my point was that LLMs are hardly
| lacking in design specialized to their purpose.
| sebzim4500 wrote:
| > Are transformers not already very specialized to the task
| of learning from sequences of word vectors?
|
| No, you can use transformers for vision, image generation,
| audio generation/recognition, etc.
|
| They are 'specialized' in that they are for working with
| sequences of data, but almost everything can be nicely
| encoded as a sequence. In order to input images, for
| example, you typically split the image into blocks and then
| use a CNN to produce a token for each block. Then you
| concatenate the tokens and feed them into a transformer.
| nerdponx wrote:
| That's fair, and I didn't realize people were using them
| for images that way. I would still argue that they are at
| least somewhat more specialized than plain fully-
| connected layers, much like convolutional layers.
|
| It is definitely interesting that we can do so much with
| a relatively small number of generic "primitive"
| components in these big models. but I suppose that's part
| of the point.
| kens wrote:
| The trick of performing binary addition by using analog voltages
| was used in the IAS family of computers (1952), designed by John
| von Neumann. It implemented a full adder by converting two input
| bits and a carry in bit into voltages that were summed. Vacuum
| tubes converted the analog voltage back into bits by using a
| threshold to generate the carry-out and more complex thresholds
| to generate the sum-out bit.
| elcritch wrote:
| Nifty, makes one wonder if logarithmic or sigmoid functions for
| ML could be done using this method. Especially as we approach
| the node size limit, perhaps dealing with fuzzy analog will
| become more valuable.
| djmips wrote:
| There are a few startups making analog ML compute. Mythic &
| Aspinity for example
|
| https://www.eetimes.com/aspinity-puts-neural-networks-
| back-t...
| shagie wrote:
| Veritasium has a bit of an video on Mythic - Future
| Computers Will Be Radically Different (Analog Computing) -
| https://youtu.be/GVsUOuSjvcg
| greenbit wrote:
| That's awesome - the network's solution is essentially to convert
| the input to analog, _perform the actual addition in analog_ ,
| and then convert that back to digital. And the first two parts of
| that all happened in the input weights, no less.
| 082349872349872 wrote:
| Sometimes when doing arithmetic with _sed_ I use the unary
| representation, but usually it 's better to use lookup tables:
| https://en.wikipedia.org/wiki/IBM_1620#Transferred_to_San_Jo...
| agapron1 wrote:
| An 8 bit adder is too small and allows such 'hack'. He should
| try training a 32 or 64 bit adder, decrease the weights
| accuracy to bfloat16, introduce dropout regularization (or
| other kind of noise) to get more 'interesting' results.
|
| Addendum: another interesting variation to try is a small
| transformer network, and feeding the bits sequentially as
| symbols. This kind of architecture could compute bignum-sized
| integers.
| Jensson wrote:
| Yeah, the current solution is similar to overfitting, this
| wont generalize to harder math where the operation doesn't
| correspond to the activation function of the network.
| rileymat2 wrote:
| Is this true? If it can add, it can probably subtract? If
| it can add and subtract it may be able to multiply
| (repeated addition) and divide? If it can multiply it can
| do exponents?
|
| I don't know, but cannot jump to your conclusion without
| much more domain knowledge.
| eru wrote:
| > If it can add and subtract it may be able to multiply
| (repeated addition) and divide? If it can multiply it can
| do exponents?
|
| It was just a simple feed-forward network. It can't do
| arbitrary amounts of repeated addition (nor repeat any
| other operation arbitrarily often).
| sebzim4500 wrote:
| None of these require arbitrary amounts of repeated
| addition though. E.g. multiplying two 8 bit numbers
| requires at most 7 additions.
| eru wrote:
| Are you doing the equivalent of repeated squaring here?
|
| Otherwise, you'd need up to 255 (or so) additions to
| multiply two 8 bit numbers, I think?
| adwn wrote:
| An 8x8 bit multiplication only requires 7 additions,
| either in parallel or sequentially. Remember long-form
| multiplication? [1] It's the same principle. Of course,
| high-speed digital multiplication circuits use a much
| more optimized, much more complex implementation.
|
| [1] https://en.wikipedia.org/wiki/Multiplication_algorith
| m#Examp...
| rileymat2 wrote:
| Yeah I did not mean a loop, repeated in the network.
| im3w1l wrote:
| Repeated additions may be hard, but computing a*b as
| exp(log(a)+log(b)) should be learnable.
| torginus wrote:
| A 32-bit adder is just 4 8-bit adders with carry connected. I
| don't see why it'd be significantly more difficult.
| dehrmann wrote:
| In my EE class that discussed adders, we learned this
| wasn't exactly true. A lot of adders compute the carry bit
| separately because it takes time to propagate if you
| actually capture it as output.
| Cyph0n wrote:
| For a naive adder, sure. Actual adder circuits use carry
| lookahead. The basic idea is to "parallelize" carry
| computation as much as possible. As with anything else,
| there is a cost: area and power.
|
| Edit: This wiki page has a good explanation of the concept:
| https://en.wikipedia.org/wiki/Carry-lookahead_adder
| GTP wrote:
| Yes, but the adder that the NN came up with has no carry.
| stabbles wrote:
| It's better to say the NN "converged to" something than
| "came up with" something.
| canadianfella wrote:
| [dead]
| amelius wrote:
| Carry is just the 9th bit?
| GTP wrote:
| Not really, to compose adders you need specific lines for
| carry.
| [deleted]
| sebzim4500 wrote:
| It's not harder computationally but finding that solution
| with SGD might be hard. I'd be really interested in the
| results.
| ape4 wrote:
| Its converted the input into its native language, in a way
| waynecochran wrote:
| Does this relate to the addition theorem of Fourier Transforms?
| The weights of his NN looked not only looked like sine waves
| but were crudely orthogonal.
|
| Maybe in the future we will think of the word "digital" as
| archaic and "analog" will be the moniker for advanced tech.
| amelius wrote:
| So that's how savants do it ...
| teknopaul wrote:
| There is one savant that can do crazy math but is otherwise
| normal. Afgter an epileptic fit.
|
| His system is base 10000, he can add any two numbers below
| 5000 and come up with a single digit response in one loop.
| Then covert to base10 for the rest of us. Each digit in his
| base 10000 system has a different visual representation, like
| we have 0-9.
|
| It's integer accurate so I don't think it uses sine wave
| approximations. Guy can remember pi for 24 hours.
|
| I don't think the boffins or he himself got close to
| understanding what is going on in his head.
| [deleted]
| [deleted]
| monkpit wrote:
| I was interested in reading more about this, but the only
| google result is your comment. Could you remember a name or
| any other details that might help find the story?
| WalterBright wrote:
| In a similar vein, in the 1980s a "superoptimizer" was built for
| optimizing snippets of assembler by doing an exhaustive search of
| all combinations of instructions. It found several delightful and
| unexpected optimizations, which quickly became incorporated into
| compiler code generators.
| convolvatron wrote:
| you can mention Massalin by name. she's pretty brilliant, idk
| where she is now
| WalterBright wrote:
| I don't recall who wrote the paper, thanks for the tip,
| enabling me to find the paper:
|
| H. Massalin, "Superoptimizer - A Look at the Smallest
| Program," ACM SIGARCH Comput. Archit. News, pp. 122-126,
| 1987.
|
| https://web.stanford.edu/class/cs343/resources/superoptimize.
| ..
|
| The H is for "Henry".
| mgliwka wrote:
| It's Alexia Massalin now:
| https://en.m.wikipedia.org/wiki/Alexia_Massalin
| puttycat wrote:
| Recent work on NNs where applying Kolmogorov complexity
| optimization leads to a network that solves binary addition with
| a carry-over counter and 100% provable accuracy: (see Results
| section)
|
| https://arxiv.org/abs/2111.00600
| aargh_aargh wrote:
| The essay linked from the article is interesting:
| http://www.incompleteideas.net/IncIdeas/BitterLesson.html
| sva_ wrote:
| I think there is no doubt that there must be more efficient
| model architectures out there, take for example the sample
| efficiency of GPT-3:
|
| > If you think about what a human, a human probably in a
| human's lifetime, 70 years, processes probably about a half a
| billion words, maybe a billion, let's say a billion. So when
| you think about it, GPT-3 has been trained on 57 billion times
| the number of words that a human in his or her lifetime will
| ever perceive.[0]
|
| 0. https://hai.stanford.edu/news/gpt-3-intelligent-directors-
| co...
| IanCal wrote:
| I cannot wrap my head around that. I listened to the audio to
| check it wasn't a transcription error and don't think it is.
|
| He is claiming GPT3 was trained on 57 billion _billion_
| words. The training dataset is something like 500B tokens and
| not all of that is used (common crawl is processed less than
| once), and I 'm damn near certain that it wasn't trained for
| a hundred million epochs. Their original paper says the
| largest model was trained on 300B tokens [0]
|
| Assuming a token is a word, as we're going for orders of
| magnitude, you're actually looking at about a few hundred
| times more text. The point kind of stands, it's more, but not
| billions of times.
|
| I wouldn't be surprised if I'm wrong here because they seem
| to be an expert but this didn't pass the sniff test and
| looking into it doesn't support what they're saying to me.
|
| [0] https://arxiv.org/pdf/2005.14165.pdf appendix D
| sva_ wrote:
| I didn't even catch that. Surely they meant 57 billion
| words.
|
| Some words are broken up into several tokens, which might
| explain the 300B tokens.
| IanCal wrote:
| It's at about 7 minutes in the video, they really do say
| it several times in a few ways. He starts by saying it's
| trained on 570 billion megabytes, which is probably where
| this confusion starts. Looking again at the paper, Common
| Crawl after filtering is 570GB or 570 billion _bytes_. So
| he makes two main mistakes - one is straight up
| multiplying by another million, then by assuming one byte
| is equivalent to one word. Then a bit more because less
| than half of it is used. That 's probably taking it out
| by a factor of about ten million or more.
|
| 300B is then the "training budget" in a sense, not every
| dataset is used in its entirety, some are processed more
| than once, but each of the GPT3 sizes were trained on
| 300B tokens.
| lostmsu wrote:
| Humans are not trianed on words.
| lgas wrote:
| What do you mean? That we are not trained ONLY on words? Or
| that the training we receive on words is not the same as
| the training of a NN? Or something else?
| gnfargbl wrote:
| One thing that the essay doesn't consider is the importance of
| _efficiency_ in computation. Efficiency is important because in
| practice it is often the factor which most limits the
| scalability of a computational system.
|
| The human brain only consumes around 20W [1], but for numerical
| calculations it is massively outclassed by an ARM chip
| consuming a tenth of that. Conversely, digital models of neural
| networks need a huge power budget to get anywhere close to a
| brain; this estimate [2] puts training GPT-3 at about a TWh,
| which is about six million years' of power for a single brain.
|
| [1] https://www.pnas.org/doi/10.1073/pnas.2107022118
|
| [2] https://www.numenta.com/blog/2022/05/24/ai-is-harming-our-
| pl....
| swyx wrote:
| > a TWh, which is about six million years' of power
|
| i'm not a physics guy but wanted to check this - a Watt is a
| per second measure, and a Terawatt is 1 trillion watts, so 1
| TWh is 50 billion seconds of 20 Watts, which is 1585 years of
| power for a single brain, not 6 million.
|
| i'm sure i got this wrong as i'm not a physics guy but where
| did i go wrong here?
|
| a more neutral article (that doesn't have a clear "AI is
| harming our planet" agenda) estimates closer to 1404MWh to
| train GPT3: https://blog.scaleway.com/doing-ai-without-
| breaking-the-bank... i dont know either way but i'd like as
| best an estimate as possible since this seems an important
| number.
| belval wrote:
| A watt is not a per-second measure, Wh are the energy
| measurement to W being the power. TWh = 10^12Wh which means
| a trillion-watt for 1 hour.
|
| 10^12 / 20 (power of brain) / 24 (hours in a day) / 365
| (days in a year) = 5 707 762 years.
| pfdietz wrote:
| One watt is one joule per second. What exactly do you
| mean by "a per-second measure"?
| belval wrote:
| The seconds are not a denominator, OP was doing TWh / W
| as if W = 1 / 3600 * Wh which is not the case.
| primis wrote:
| Watts are actually a time independent measurement, note
| that the TWh has "hour" affixed to the end. This is 1 Tera
| Watt over the course of one hour, not one second. Your
| numbers are off by a factor of 3600.
|
| 1TWh / 20 Watt brain = 50,000,000,000 (50 Billion) Hours.
|
| 50 Billion Hours / (24h * 365.25) = 5,703,855.8 Years
| gnfargbl wrote:
| Thanks for explaining the calculation! There is a huge
| error though, which is that I mis-read the units in the
| second link I posted. The actual power estimate for GPT-3
| is more like 1 GWh (not 1TWh), so about 6000 years and
| not 6 million...!
| swyx wrote:
| thank you both.. was googling this and found some
| stackexchange type answers and still got confused.
| apparently it is quite common to see the "per time" part
| of the definition of a Watt and get it mixed up with
| Watt-hours. i think i got it now thank yoyu
| jeffbee wrote:
| If you're going to reinterpret watts on one side you have
| to do it on both sides. So, by your thinking, which is
| not wrong, 1h*1TJ/s vs 20J/s. As you can see, you can
| drop J/s from both sides and just get 1/20th of a
| trillion hours.
| TeMPOraL wrote:
| It is. The main culprit, arguably, is the ubiquitous use
| of kilowatt hours as unit of energy, particularly of
| electricity in context of the power company billing you
| for it - kilowatt hours are what you see on the power
| meter and on your power bill.
| gnfargbl wrote:
| I can't edit my post any more, but I'm a moron: the article
| says about a GWh, not a TWh. So the calculation is out by a
| factor of 1000.
| [deleted]
| dsign wrote:
| Thanks for those numbers! They are frankly scary.
|
| Six millions years of power for a single brain is one year of
| power for six million brains. The knowledge that ChatGPT
| contains is far higher than the knowledge a random sample of
| six million people have produced in a year.
|
| With that said, I think we should apply ML and the results of
| the bitter lesson to things which are more intractable than
| searching the web or playing Go. Have you ever talked to a
| System's Biologist? Ask them about how anything works, and
| they'll start with 'oh, it's so complicated. You have X, and
| then Y, and then Z, and nobody knows about W. And then how
| they work together? Madness!"
| [deleted]
| TeMPOraL wrote:
| > _Six millions years of power for a single brain is one
| year of power for six million brains. The knowledge that
| ChatGPT contains is far higher than the knowledge a random
| sample of six million people have produced in a year._
|
| On the other hand, I don't think if it would be larger than
| that of six million people selected more carefully (though
| I suppose you'd have to include the costs of _selection
| process_ in the tally). I also imagine it wouldn 't be
| larger than six thousand people specifically raised and
| taught in coordinated fashion to fulfill this role.
|
| On the _other_ other hand, a human brain does a lot more
| than learning language, ideas and conversion between one
| and the other. It also, simultaneously, learns a lot of
| video and audio processing, not to mention smell,
| proprioception, touch (including pain and temperature),
| and... a bunch of other stuff (I was actually surprised by
| the size of the "human" part of the Wikipedia infobox
| here: https://en.wikipedia.org/wiki/Template:Sensation_and_
| percept...). So it's probably hard to compare to
| specialized NN models until we learn to better classify and
| isolate how biological brains learn and encode all the
| various things they do.
| sweezyjeezy wrote:
| In the years since this came out I have shifted my opinion from
| 100% agreement to kind of the opposite in a lot of cases, a
| bitter lesson from using AI to solve complex end-to-end tasks.
| If you want to build a system that will get the best score on a
| test set - he's right, get all the data you possibly can, try
| to make your model as e2e as possible. This often has the
| advantage of being the easiest approach.
|
| The problem is: your input dataset definitely has biases that
| you don't know about, and the model is going to learn spurious
| things that you don't want it to. It can make some indefensible
| decisions with high confidence and you may not know why. Some
| domains your model may be basically useless, and you won't know
| this until it happens. This often isn't good enough in
| industry.
|
| To stop batshit things coming out of your system, you may have
| to do the opposite - use domain knowledge to break the problem
| the model is trying to solve down into steps you can reason
| about. Use this to improve your data. Use this to stop the
| system from doing anything you know makes no sense. This is
| really hard and time consuming, but IMO complex e2e models are
| something to be wary of.
| version_five wrote:
| Yes I just read it too. It appears to have been discussed here
| multiple times:
|
| https://hn.algolia.com/?q=bitter+lesson
|
| I'd say it has aged well and there are probably lots of new
| takes on it based on some of the achievements of the last 6
| months.
|
| At the same time, I don't agree with it. To simplify, the
| opposite of "don't over-optimize" isn't "it's too complicated
| to understand so never try". It is thought provoking though
| agapon wrote:
| Very interesting. But I missed how the network handled the
| overflow.
|
| Spotted one small typo: digital to audio converter should be
| digital to analog.
| Ameo wrote:
| Oh nice catch on the typo - thanks!!
| sebzim4500 wrote:
| It converted the inputs to analog so there isn't really a
| notion of 'overflow'.
| krisoft wrote:
| Putting aside the idea that "analog" doesn't have overflow.
| (Any physical implementation of that analog signal would have
| limits in the real world. If nothing else, then how much
| current or voltage your PSU can supply, or how much current
| can go through your wire.)
|
| But that is not even the real issue. The real issue that it
| is not analog, it is just "analog" with quotes. The neural
| network is executed by a digital computer, on digital
| hardware. So that "analog" is just the underlying float type
| of the of the CPU or GPU executing the network. That is still
| very much digital. Just on a different abstraction level.
| allisdust wrote:
| Are there any true analog adder circuits? If so are they
| also less error prone like digital.
| krisoft wrote:
| > Are there any true analog adder circuits?
|
| Sure. Here is a semantics with an op-amp: https://www.tut
| orialspoint.com/linear_integrated_circuits_ap...
| ilyt wrote:
| Of course it is, there is whole business around overflowing
| analog signals in right way in analog guitar effects pedal
| SonOfLilit wrote:
| Do you know how DAC/ADC circuits work? Build one that can
| handle one more bit of input than you have and you have
| overflow handling.
| YesThatTom2 wrote:
| Is this going to change how 8-but adders are implemented in
| chips?
| GTP wrote:
| No, there was a time where we had analog calculators but there
| are good reasons why digital electronics won in the computing
| space.
| Lisa_Novak wrote:
| [dead]
| juunge wrote:
| Awesome reverse engineering project. Next up: reverse engineer a
| NN's solution to Fourier Transform!
| mananaysiempre wrote:
| The Fourier transform is also linear, so the same solution
| should work. No clue if an NN would find it though.
| perihelions wrote:
| What's an analog implementation of a Fourier transform look
| like? It sounds interesting!
| mananaysiempre wrote:
| Depends on what kind of "analog" you want.
|
| In an NN context, given that you already have "transform
| with a matrix" as a primitive, probably something very much
| like sticking a https://en.wikipedia.org/wiki/DFT_matrix
| somewhere. (You are already extremely familliar with the
| 2-input DFT, for example: it's the (x, y) - (x+y, x-y)
| map.)
|
| If you want a physical implementation of a Fourier
| transform, it gets a little more fun. A sibling comment
| already mentioned one possibility. Another is that far-
| field (i.e. long-distance; "Fraunhofer") diffraction of
| coherent light on a semi-transparent planar screen gives
| you the Fourier transform of the transmissivity (i.e.
| transparency) of that screen[1]. That's extremely neat and
| covers all textbook examples of diffraction (e.g. a finite-
| width slit gives a sinc for the usual reasons), but
| probably impractical to mention in an introductory course
| because the derivation is to get a gnarly general formula
| then apply the far-field approximation to it.
|
| A related application is any time the "reciprocal lattice"
| is mentioned in solid-state physics; e.g. in X-ray
| crystallography, what you see on the CRT screen in the
| simplest case once the X-rays have passed through the
| sample is the (continuous) Fourier transform of (a bunch of
| Dirac deltas stuck at each center of) its crystal
| lattice[2], and that's because it's basically the same
| thing as the Fraunhofer diffraction in the previous
| paragraph.
|
| Of course, the mammalian inner ear is also a spectral
| analyzer[3].
|
| [1] https://en.wikipedia.org/wiki/Fourier_optics#The_far_fi
| eld_a...
|
| [2] https://en.wikipedia.org/wiki/Laue_equations
|
| [3] https://en.wikipedia.org/wiki/Basilar_membrane#Frequenc
| y_dis...
| bradrn wrote:
| > Another is that far-field (i.e. long-distance;
| "Fraunhofer") diffraction of coherent light on a semi-
| transparent planar screen gives you the Fourier transform
| of the transmissivity (i.e. transparency) of that screen
|
| Ooh, yes, I'd forgotten that! And I've actually done this
| experiment myself -- it works impressively well when you
| set it up right. I even recall being able to create
| filters (low-pass, high-pass etc.) simply by blocking the
| appropriate part of the light beam and reconstituting the
| final image using another lens. Should have mentioned it
| in my comment...
| bradrn wrote:
| It looks like Albert Michelson's Harmonic Analyzer: see
| https://engineerguy.com/fourier/ (and don't miss the
| accompanying video series! It's pretty cool.)
| ilyt wrote:
| A prism and CCD sensor. Probably bit more fuckery to get
| phase info out of that tho.
| allisdust wrote:
| totally off topic: Any pointers to tutorials/articles that teach
| implementation of common neural network models in javascript/rust
| (from scratch) ?
| 2bitencryption wrote:
| > One thought that occurred to me after this investigation was
| the premise that the immense bleeding-edge models of today with
| billions of parameters might be able to be built using orders of
| magnitude fewer network resources by using more efficient or
| custom-designed architectures.
|
| I'm pretty convinced that something equivalent to GPT could run
| on consumer hardware today, and the only reason it doesn't is
| because OpenAI has a vested interest in selling it as a service.
|
| It's the same as Dall-E and Stable Diffusion - Dall-E makes no
| attempt to run on consumer hardware because it benefits OpenAI to
| make it so large that you must rely on someone with huge
| resources (i.e. them) to use it. Then some new research shows
| that effectively the same thing can be done on a consumer GPU.
|
| I'm aware that there's plenty of other GPT-like models available
| on Huggingface, but (to my knowledge) there is nothing that
| reaches the same quality that can run on consumer hardware -
| _yet_.
| dagss wrote:
| Impressive. Am I right that what is really going on here is that
| the network is implementing the "+" operator by actually having
| the CPU of the host carrying out the "+" when executing the
| network?
|
| I.e., the network converts the binary input and output to
| floating point, and then it is the CPU of the host the network is
| running on that _really_ does the addition in floating point.
|
| So usually one does a bunch of FLOPs to get "emerging" behaviour
| that isn't doing arithmetic. But in this case, instead the
| network does a bunch of transforms in and out so that in a
| critical point, the addition executed in the network runner is
| used for exactly its original purpose: Addition.
|
| And I guess saying it is "analog" is a good analogy for this..
| TeMPOraL wrote:
| I'd say it's more that the network "discovered" that addition
| is something its own structure does "naturally", and reduced
| the problem into decoding binary input, and encoding binary
| output. In particular, I don't think it's specifically about
| _the CPU_ having an addition operator.
|
| My intuition is as follows: if I were to train this network
| with pencil, paper and a slide rule, I'd expect the same
| result. Addition (or maybe rather _integration_ ) is embedded
| in the abstract structure of a neural network as a computation
| artifact.
|
| Sure, specifics of the substrate may "leak through" - e.g. in
| the pen-and-paper case, were I to round everything to first
| decimal space, or in the computer case, was the network
| implemented with 4-bit floats, I'd expect it not to converge
| because of loss of precision range (or maybe figure out the
| logic gate solution). But if the substrate can execute the
| mathematical model of a neural network to sufficient precision,
| I'd expect the same result to occur regardless of whether the
| network is run on paper, on a CPU, an fluid-based analog
| computer, or a beam of light and a clever arrangement of semi-
| transparent plastic plates.
| greenbit wrote:
| You know how the basic node does a weighted sum of inputs, and
| feeds the result to a non linear function for output? This
| network exploited the addition operation implicit in that first
| part, the weighted sum.
| xyzzy_plugh wrote:
| Fascinating! Is it really exploitation? It seems to me that
| the node _should_ optimize to the minimal inputs (weights) in
| the happy path. I.e. it 's optimal for all non-operand inputs
| to be weighted to zero.
|
| I'm curious how reproducible this is. I'm also curious how
| networks form abstractions and if we can verify those
| abstractions in isolation (or have other networks verify
| them, like this) then the building blocks become far less
| opaque.
| Filligree wrote:
| That's an accurate enough description of what's going on here,
| yeah, but I'm very curious if this could be implemented in
| hardware. What we've got is an inexact not-quite-binary adder,
| but one that's potentially smaller and faster than the regular
| binary ones.
| garganzol wrote:
| It can be implemented in hardware but the implementation
| would be more complex than a digital adder based on logical
| gates.
| em3rgent0rdr wrote:
| Analog addition is actually really easy if you can tolerate
| noise and heat and can convert to/from representing the
| signal as an analog current. Using Kirchhoff's Current Law,
| addition of N currents from current sources is achieved by
| joining those N wires to a shared exit wire which contains
| the summed current.
| garganzol wrote:
| Addition is easy, DAC is relatively easy, but ADC is not.
| em3rgent0rdr wrote:
| There is valid use case for when designers simply need "fuzzy
| logic" or "fuzzy math" which doesn't need _exact_ results,
| but which can tolerate inaccurate results. If using fuzzy
| hardware saves resources (time, energy, die space, etc.) then
| it might be a valid tradeoff to use inaccurate fuzzy hardware
| instead of exact hardware.
|
| For instance instead of evaluating an exponential function in
| digital logic, it might be quicker and more energy-efficient
| to just evaluate it using a diode as an analog voltage, if
| the value is available as an analog voltage and if some noise
| is tolerable, as done with analog computers.
| vletal wrote:
| Why it works here is that the "analog" representation is not
| influenced by noise, because it's simulated on digital
| hardware.
|
| On the other hand, why we use digital hardware precisely
| because it's robust against the noise present in the hardware
| analog circuits.
| TeMPOraL wrote:
| So in other words, this is analog computation on a digital
| hardware on analog substrate - the analog-to-digital step
| eliminates the noise of our physical reality, and the
| subsequent digital-to-analog reintroduces a certain
| flexibility of design thinking.
|
| I wonder, is there ever a case analog-on-digital is better
| to work with as an abstraction layer, or is it always
| easier to work with digital signals directly?
| chaorace wrote:
| Yes, at least I think so. That's why we so often try to
| approximate analog with psuedo-continuous data patterns
| like samples and floats. Even going beyond electronic
| computing, we invented calculus due to similar
| limitations with discrete data in mathematics.
|
| Of course, these are all just approximations of analog.
| Much like emulation, there are limitations, penalties,
| and inaccuracies that will inevitably kneecap
| applications when compared to a native implementation
| running on bare metal (though, it seems that humans lack
| a proper math coprocessor, so calculus could be
| considered no more "native" than traditional algebra).
|
| We do sometimes see specialized hardware emerge (DSPs in
| the case of audio signal processing, FPUs in the case of
| floats, NPUs/TPUs in the case of neural networks), but
| these are almost always highly optimized for operations
| specific to analog- _like_ data patterns (e.g.: fourier
| transforms) rather than true analog operations. This is
| probably because scalable /reliable/fast analog memory
| remains an unsolved problem (semiconductors are simply
| _too_ useful).
| Filligree wrote:
| Right, but there exists software that would be robust to
| noise, e.g. ...neural networks. I'm curious if not-quiet-
| perfect adders might be beneficial in some cases.
| Someone wrote:
| Quad-level flash memory cells
| (https://en.wikipedia.org/wiki/Multi-level_cell#Quad-
| level_ce...) can store 4 bits for months.
|
| Because of that, I would (hand-wavingly) expect it can be
| made to work for a three-bit adder (with four bits of output)
___________________________________________________________________
(page generated 2023-01-16 23:00 UTC)