[HN Gopher] "AI" on a Calculator (2023)
       ___________________________________________________________________
        
       "AI" on a Calculator (2023)
        
       Author : mariuz
       Score  : 44 points
       Date   : 2025-01-03 11:57 UTC (1 days ago)
        
 (HTM) web link (z80.me)
 (TXT) w3m dump (z80.me)
        
       | zeeland wrote:
       | Great project! I'm amazed by the ingenuity of running a CNN on a
       | TI-84 Plus CE.
       | 
       | Given the ez80 CPU's limitations with floating-point operations,
       | do you think fixed-point arithmetic could significantly speed up
       | inference without adding too much complexity?
        
         | PaulHoule wrote:
         | One funny topic is "why did it take us so long to develop
         | advanced neural networks?" Part of the current advance is about
         | improving hardware performance but I'd imagine if you went back
         | to the 1980s with the knowledge we have you could have gotten
         | much more out of neural nets than we did then.
         | 
         | As for fixed point vs floating I got into an argument with a
         | professor in grad school about it. I'd argue that floating
         | point often gets used out of familiarity and convenience, if
         | you look at fields such as signal processing it is very common
         | to do algorithms like FFT in fixed point, it seems absurd that
         | an activation in a neural net could range from (say) 10^-6 to
         | 10^6 and have the first digit be meaningful the whole way. If
         | everything was properly scaled and normalized you shouldn't
         | need to waste bits on exponents but maybe when you have a
         | complex network with a lot of layers that normalization is
         | easier said than done.
        
           | pmarreck wrote:
           | > why did it take us so long to develop advanced neural
           | networks?
           | 
           | I have a pet theory (with some empirical support) that Marvin
           | Minsky put somewhat of a multi-year kibosh on
           | "perceptron"/neural-net research due to his (true, but
           | arguably irrelevant) assertion that trained neural nets were
           | essentially "nondeterministic black boxes" which was
           | basically heretical.
        
             | PaulHoule wrote:
             | That's not just a pet theory. I'll tell you though that
             | 
             | https://en.wikipedia.org/wiki/Perceptrons_(book)
             | 
             | is a brilliant book from the early days of computer science
             | which is asking questions about the fundamentals of
             | algorithms and what you can and can't do with different
             | methods. (Like the result that it takes at least N log N
             | comparisons to sort N items)
        
           | hansvm wrote:
           | > I'd argue that floating point often gets used out of
           | familiarity and convenience
           | 
           | You're probably right. I can't tell you how often I've seen
           | floats used in places they really shouldn't have been, or
           | where nobody bothered to even write an algorithm that got the
           | dynamic range for _floats_ (much less fixed-points) correct.
           | 
           | > if you look at fields such as signal processing it is very
           | common to do algorithms like FFT in fixed point
           | 
           | FFT is almost uniquely suited to a fixed-point
           | implementation. All the multiplications are with powers of
           | roots of unity, and a typical problem size in that domain
           | only involves O(10k) additions, keeping the result firmly in
           | an appropriately scaled fixed-point input's dynamic range.
           | 
           | > If everything was properly scaled and normalized you
           | shouldn't need to waste bits on exponents but maybe when you
           | have a complex network with a lot of layers that
           | normalization is easier said than done.
           | 
           | It's much harder to accomplish that even in vanilla neural
           | networks (compared to FFT), much less if they have anything
           | "interesting" (softmax, millions of inputs, ...) going on.
           | 
           | Take a look at a single matmul operation, or even just a dot
           | product, with f32 vs i32 weights. Unless you do something in
           | the architecture to prevent large weights from ever being
           | lined up with large inputs (which would be an odd thing to
           | attempt in the first place; the whole point of a dot-product
           | in a lot of these architectures is to measure similarity
           | between two pieces of information, and forcing inputs to be
           | less similar to the weights in that way doesn't sound
           | intuitively helpful), you'll eventually have to compute some
           | product `x y` where both are roughly as large as you've
           | allowed with whatever normalization scheme you've used up to
           | that point. The resulting large activation will be one of the
           | more important pieces of information from that dot product,
           | so clipping the value is not a good option. With i32 inputs,
           | you're left with one of:
           | 
           | 1. Cap the inputs to 15 bits of precision (as opposed to 23
           | with f32)
           | 
           | 2. Compute the outputs as i64, perhaps even larger (or still
           | partially restricting input precision) when you factor in the
           | additions and other things you still have to handle
           | 
           | 3. Rework the implementation of your dot product to perhaps,
           | somehow, avoid large multiplications (not actually possible
           | in general though; only when the result fits in your target
           | dynamic range)
           | 
           | The first option is just straight-up worse than an f32
           | implementation. The third isn't general-purpose enough (in
           | that you again start having to design a network amenable to
           | fixed points). The second might work, but to keep the bit
           | count from exploding through the network you immediately
           | require some form of normalization. That's probably fine if
           | you have a layer-norm after every layer, but it still
           | complicates the last layer, and it doesn't work at all if you
           | have any fancy features at that particular matmul precluding
           | layer normalization (like enforcing a left-unitary Jacobian
           | at each composite layer (including any normalization and
           | whatnot)).
           | 
           | You might, reasonably, counter that you don't need more than
           | 15 bits of precision anyway. That's a fair counter (mostly,
           | other than that the argument you're making is that by not
           | wasting space on the exponent you have room for more real
           | data), but all the common <f32 floating-point implementations
           | have a higher precision than the required reduction in
           | precision from an equally-sized fixed-point. Plus, you still
           | need _some_ kind of normalization after the matmul to keep
           | the dynamic range correct, and doing that by tweaking the
           | model architecture precludes some features you might
           | otherwise reasonably want to include, so you need to do
           | something like separately keep track of a multiplicative
           | scaling factor at each layer, which itself screws with how
           | most activation functions behave, ....
           | 
           | Plus, at a hardware level, almost nobody provides operations
           | like fmadd for integers/fixed-point yet. As a collective
           | action problem, you're starting off at least twice as slow if
           | you try to popularize fixed-point networks.
           | 
           | Not to mention, when you use fixed-points you generally start
           | having to build larger computational blocks rather than
           | working in smaller, re-usable units. As a simple example, the
           | order of operations when computing a linear projection almost
           | doesn't matter for floats, but for fixed-points you have all
           | the same precision problems I mentioned for multiplications
           | earlier, with the added anti-benefit of the internal division
           | actively eating your precision. You can't have a component
           | that just computes similarity and then applies that
           | similarity without also doubling the bits of the return type
           | and multiplying the result by something like (1<<32) if the
           | divisor is large and some other hackery if it's small. Since
           | that component is unwieldy to write and use, you start having
           | to code in larger units like computing the entire linear
           | projection.
           | 
           | I've probably gone on more than long enough. Floats are a
           | nice middle ground when you're doing a mix of nearly
           | unconstrained additions and multiplications, and they allow
           | you to focus a bit more on the actual ML. For a fixed-point
           | network to have advantages, I think it'd need to be built
           | specifically to take advantage of fixed-point's benefits,
           | especially since so many aspects of these networks instead
           | explicitly play into its drawbacks.
        
             | PaulHoule wrote:
             | Great!
        
       | JKCalhoun wrote:
       | That's wild -- focusing on the much narrower problem of digit
       | recognition; I wonder now if hardware using op-amps could be
       | designed to also tackle this smaller problem in the analog
       | domain.
        
       | kordlessagain wrote:
       | Mostly unrelated, but I remember typing in a simple calculator
       | program (hex codes I think) from a magazine for my Apple IIe that
       | would then speak the answer aloud. I was fascinated by the fact
       | you could type in things that then created the sound of a voice.
       | Googling isn't yielding much but I'll dig around and see if I can
       | find it.
        
         | sjsdaiuasgdia wrote:
         | Maybe you had a speech card for your IIe, like
         | https://en.wikipedia.org/wiki/Echo_II_(expansion_card)
        
           | pmarreck wrote:
           | Not necessarily. There was a lot of experimentation at the
           | time and things that did hacky things like blip the mono
           | speaker at different frequencies to produce interesting
           | sounds were not out of the question.
        
       | fferen wrote:
       | I first started coding in high school on the TI-84 calculators.
       | My 1st language was TI-BASIC, 2nd was Z80 assembly - quite a big
       | step - and I quit when I faced some tricky bugs that my teenage
       | self could not figure out :-) Back then, I don't believe they had
       | a C/C++ toolchain. Some time later, I tried using Small Device C
       | compiler (SDCC), but encountered several compiler bugs which I
       | couldn't fix but duly reported. Great to see there is such
       | excellent tooling nowadays.
        
       ___________________________________________________________________
       (page generated 2025-01-04 23:00 UTC)