[HN Gopher] "AI" on a Calculator (2023)
___________________________________________________________________
"AI" on a Calculator (2023)
Author : mariuz
Score : 44 points
Date : 2025-01-03 11:57 UTC (1 days ago)
(HTM) web link (z80.me)
(TXT) w3m dump (z80.me)
| zeeland wrote:
| Great project! I'm amazed by the ingenuity of running a CNN on a
| TI-84 Plus CE.
|
| Given the ez80 CPU's limitations with floating-point operations,
| do you think fixed-point arithmetic could significantly speed up
| inference without adding too much complexity?
| PaulHoule wrote:
| One funny topic is "why did it take us so long to develop
| advanced neural networks?" Part of the current advance is about
| improving hardware performance but I'd imagine if you went back
| to the 1980s with the knowledge we have you could have gotten
| much more out of neural nets than we did then.
|
| As for fixed point vs floating I got into an argument with a
| professor in grad school about it. I'd argue that floating
| point often gets used out of familiarity and convenience, if
| you look at fields such as signal processing it is very common
| to do algorithms like FFT in fixed point, it seems absurd that
| an activation in a neural net could range from (say) 10^-6 to
| 10^6 and have the first digit be meaningful the whole way. If
| everything was properly scaled and normalized you shouldn't
| need to waste bits on exponents but maybe when you have a
| complex network with a lot of layers that normalization is
| easier said than done.
| pmarreck wrote:
| > why did it take us so long to develop advanced neural
| networks?
|
| I have a pet theory (with some empirical support) that Marvin
| Minsky put somewhat of a multi-year kibosh on
| "perceptron"/neural-net research due to his (true, but
| arguably irrelevant) assertion that trained neural nets were
| essentially "nondeterministic black boxes" which was
| basically heretical.
| PaulHoule wrote:
| That's not just a pet theory. I'll tell you though that
|
| https://en.wikipedia.org/wiki/Perceptrons_(book)
|
| is a brilliant book from the early days of computer science
| which is asking questions about the fundamentals of
| algorithms and what you can and can't do with different
| methods. (Like the result that it takes at least N log N
| comparisons to sort N items)
| hansvm wrote:
| > I'd argue that floating point often gets used out of
| familiarity and convenience
|
| You're probably right. I can't tell you how often I've seen
| floats used in places they really shouldn't have been, or
| where nobody bothered to even write an algorithm that got the
| dynamic range for _floats_ (much less fixed-points) correct.
|
| > if you look at fields such as signal processing it is very
| common to do algorithms like FFT in fixed point
|
| FFT is almost uniquely suited to a fixed-point
| implementation. All the multiplications are with powers of
| roots of unity, and a typical problem size in that domain
| only involves O(10k) additions, keeping the result firmly in
| an appropriately scaled fixed-point input's dynamic range.
|
| > If everything was properly scaled and normalized you
| shouldn't need to waste bits on exponents but maybe when you
| have a complex network with a lot of layers that
| normalization is easier said than done.
|
| It's much harder to accomplish that even in vanilla neural
| networks (compared to FFT), much less if they have anything
| "interesting" (softmax, millions of inputs, ...) going on.
|
| Take a look at a single matmul operation, or even just a dot
| product, with f32 vs i32 weights. Unless you do something in
| the architecture to prevent large weights from ever being
| lined up with large inputs (which would be an odd thing to
| attempt in the first place; the whole point of a dot-product
| in a lot of these architectures is to measure similarity
| between two pieces of information, and forcing inputs to be
| less similar to the weights in that way doesn't sound
| intuitively helpful), you'll eventually have to compute some
| product `x y` where both are roughly as large as you've
| allowed with whatever normalization scheme you've used up to
| that point. The resulting large activation will be one of the
| more important pieces of information from that dot product,
| so clipping the value is not a good option. With i32 inputs,
| you're left with one of:
|
| 1. Cap the inputs to 15 bits of precision (as opposed to 23
| with f32)
|
| 2. Compute the outputs as i64, perhaps even larger (or still
| partially restricting input precision) when you factor in the
| additions and other things you still have to handle
|
| 3. Rework the implementation of your dot product to perhaps,
| somehow, avoid large multiplications (not actually possible
| in general though; only when the result fits in your target
| dynamic range)
|
| The first option is just straight-up worse than an f32
| implementation. The third isn't general-purpose enough (in
| that you again start having to design a network amenable to
| fixed points). The second might work, but to keep the bit
| count from exploding through the network you immediately
| require some form of normalization. That's probably fine if
| you have a layer-norm after every layer, but it still
| complicates the last layer, and it doesn't work at all if you
| have any fancy features at that particular matmul precluding
| layer normalization (like enforcing a left-unitary Jacobian
| at each composite layer (including any normalization and
| whatnot)).
|
| You might, reasonably, counter that you don't need more than
| 15 bits of precision anyway. That's a fair counter (mostly,
| other than that the argument you're making is that by not
| wasting space on the exponent you have room for more real
| data), but all the common <f32 floating-point implementations
| have a higher precision than the required reduction in
| precision from an equally-sized fixed-point. Plus, you still
| need _some_ kind of normalization after the matmul to keep
| the dynamic range correct, and doing that by tweaking the
| model architecture precludes some features you might
| otherwise reasonably want to include, so you need to do
| something like separately keep track of a multiplicative
| scaling factor at each layer, which itself screws with how
| most activation functions behave, ....
|
| Plus, at a hardware level, almost nobody provides operations
| like fmadd for integers/fixed-point yet. As a collective
| action problem, you're starting off at least twice as slow if
| you try to popularize fixed-point networks.
|
| Not to mention, when you use fixed-points you generally start
| having to build larger computational blocks rather than
| working in smaller, re-usable units. As a simple example, the
| order of operations when computing a linear projection almost
| doesn't matter for floats, but for fixed-points you have all
| the same precision problems I mentioned for multiplications
| earlier, with the added anti-benefit of the internal division
| actively eating your precision. You can't have a component
| that just computes similarity and then applies that
| similarity without also doubling the bits of the return type
| and multiplying the result by something like (1<<32) if the
| divisor is large and some other hackery if it's small. Since
| that component is unwieldy to write and use, you start having
| to code in larger units like computing the entire linear
| projection.
|
| I've probably gone on more than long enough. Floats are a
| nice middle ground when you're doing a mix of nearly
| unconstrained additions and multiplications, and they allow
| you to focus a bit more on the actual ML. For a fixed-point
| network to have advantages, I think it'd need to be built
| specifically to take advantage of fixed-point's benefits,
| especially since so many aspects of these networks instead
| explicitly play into its drawbacks.
| PaulHoule wrote:
| Great!
| JKCalhoun wrote:
| That's wild -- focusing on the much narrower problem of digit
| recognition; I wonder now if hardware using op-amps could be
| designed to also tackle this smaller problem in the analog
| domain.
| kordlessagain wrote:
| Mostly unrelated, but I remember typing in a simple calculator
| program (hex codes I think) from a magazine for my Apple IIe that
| would then speak the answer aloud. I was fascinated by the fact
| you could type in things that then created the sound of a voice.
| Googling isn't yielding much but I'll dig around and see if I can
| find it.
| sjsdaiuasgdia wrote:
| Maybe you had a speech card for your IIe, like
| https://en.wikipedia.org/wiki/Echo_II_(expansion_card)
| pmarreck wrote:
| Not necessarily. There was a lot of experimentation at the
| time and things that did hacky things like blip the mono
| speaker at different frequencies to produce interesting
| sounds were not out of the question.
| fferen wrote:
| I first started coding in high school on the TI-84 calculators.
| My 1st language was TI-BASIC, 2nd was Z80 assembly - quite a big
| step - and I quit when I faced some tricky bugs that my teenage
| self could not figure out :-) Back then, I don't believe they had
| a C/C++ toolchain. Some time later, I tried using Small Device C
| compiler (SDCC), but encountered several compiler bugs which I
| couldn't fix but duly reported. Great to see there is such
| excellent tooling nowadays.
___________________________________________________________________
(page generated 2025-01-04 23:00 UTC)