[HN Gopher] Implementing neural networks on the "3 cent" 8-bit m...
       ___________________________________________________________________
        
       Implementing neural networks on the "3 cent" 8-bit microcontroller
        
       Author : cpldcpu
       Score  : 146 points
       Date   : 2024-10-19 18:09 UTC (1 days ago)
        
 (HTM) web link (cpldcpu.wordpress.com)
 (TXT) w3m dump (cpldcpu.wordpress.com)
        
       | magicalhippo wrote:
       | Fun to see neural nets pushed to such extremes, really enjoyed
       | the post.
       | 
       | > The smallest models had to be trained without data
       | augmentation, as they would not converge otherwise.
       | 
       | Was this also the case for the 2-bit model you ended up with?
        
         | cpldcpu wrote:
         | Yes, as far as i remember the limit was somewhere around 1kbyte
         | total parameters size.
        
       | malwrar wrote:
       | Super interesting!
       | 
       | I wish tfa would have found some way to measure the PMS150C
       | implementation the headline brags about, but even the PFS154 (2x
       | mem, 3x price) version is super neat! Interesting to see how the
       | net in particular is built at such small scale. I also wish they
       | included numbers about performance like they do in their linked
       | CH32V003 post. I'm wondering how quick these MCUs are compared to
       | each other and e.g. OP's PC, and how hot they get under sustained
       | load.
        
         | cpldcpu wrote:
         | There are no performance profiling mechanisms on these small
         | devices, and the timers are rather coarse.
         | 
         | But it is easily possible to estimate the execute time:
         | 
         | - mulacc of one weight takes 11 clock cycles.
         | 
         | - There are 1696 weights in the model, each one is only touched
         | once.
         | 
         | - We can assume ~25%-50% overhead for loops and housekeeping
         | (1:4 unrolled)
         | 
         | => ~23000-28000 clock cycles per inference, which is less than
         | 2ms at 16MHz
         | 
         | Since this is an MLP, the inference time directly scales with
         | the number of weights. (This would be different for a CNN)
         | 
         | As per veryfing on PMC150C - I considered using an LED for
         | valid/nonvalid output. But iterating with OTP devices is quite
         | tedious when you do not have an emulator. Since both devices
         | are code compatible, we can assume that the code works on the
         | smaller devices, though.
        
           | wongarsu wrote:
           | If flipping one of the output pins is fast enough you could
           | use that in combination with an oscilloscope as a coarse but
           | very accurate profiling method.
           | 
           | Though I believe for most people "roughly 2ms" is good enough
        
       | Lerc wrote:
       | I feel like to really get to the level of hypothetically useful
       | it should be able to take the samples from an input source.
       | 
       | I wonder if you could do it on the full 28*28 by never holding
       | the full image in memory at once, just as an input stream. say a
       | 1d convolution on each line as it comes in to turn a [1,28] to
       | [3,7] buffer two lines of the [3,7] = 42. Then after there are
       | three results of the third line convolution are produced [3,3]=9,
       | start performing a 2d convolution using the first two lines
       | [2,3,:3] replacing the data at the start (as it has already been
       | processed).
        
         | cpldcpu wrote:
         | Yes, you could implement it in a way where the first layer is
         | streamed and accumulate on output activations in parallel in
         | the memory. This would limit the memory requirements for the
         | input activations, but would increase execution time, as more
         | activiations have to be shuffled around.
         | 
         | In this case I am streaing from ROM anyways, so it does not
         | matter if the inputs are read only once or multiple times.
        
       | amelius wrote:
       | This challenges only the memory of the MCU, not the speed.
       | 
       | And it is a bit disappointing that they didn't finish the project
       | by adding a 8x8 pixel camera and a 7-segment display.
        
         | robertclaus wrote:
         | Is an 8bit camera an off-the-shelf part? Or do you just mean
         | data streaming in and out in general?
        
           | amelius wrote:
           | Here is an example of an 8x8 camera sensor for hobby use. You
           | can filter out the IR if desired. There are many similar
           | sensors, and they are often used for motion tracking.
           | 
           | https://learn.adafruit.com/adafruit-amg8833-8x8-thermal-
           | came...
        
             | numpad0 wrote:
             | AMG8833 is an actual thermal sensor array, not an e.g. 8x8
             | photodiode array with IR sensitivity. It's of thermopile
             | type, and different from microbolometer type like FLIR
             | cameras(which has to go blank periodically, unlike
             | themopiles that need not to).
             | 
             | Mouse sensors are 8x8ish cameras, but very few of
             | them(basically only the genuine HP/Agilent/Avago ADNS-2610
             | at bottom tier price) has raw image export feature, for
             | some reason.
             | 
             | There are many other tiny potato camera parts on various
             | markets, but most of them are missing datasheet && require
             | complicated interfacing.
             | 
             | Overall it's actually not so trivial to get a small image
             | sensor for hobby experiments.
        
           | vardump wrote:
           | Optical mice have similar cameras.
        
       | pjmlp wrote:
       | As proof of concept, it is quite cool.
       | 
       | However for going into production with something like this, maybe
       | writing everything in Assembly, and not just some parts, would be
       | much better.
       | 
       | But after a quick search it seems the macro assembler story for
       | RISC-V isn't that great.
        
         | kragen wrote:
         | The Padauk chips being discussed here aren't RISC-V. Gas is an
         | adequate macro assembler for RISC-V, but I think the Padauk
         | chips don't have anything similar. Still, you can get pretty
         | far with m4... or writing a shell script with an echo function.
        
       | Someone wrote:
       | FTA: _"One major issue when programming these devices in C is
       | that every function call consumes RAM for the return stack and
       | function parameters. This is unavoidable"_
       | 
       | It's not completely unavoidable: don't use function parameters
       | (globals are your friends on these CPUs). You can't avoid having
       | a return stack, but you can make as few function calls as
       | possible (ideally zero, but you may have to write functions to
       | fit things into ROM)
       | 
       | > *"To solve this, I flattened the inference code"
       | 
       | I think that's "make as few function calls as possible"
       | 
       | > and implemented the inner loop in assembly to optimize variable
       | usage.
       | 
       | That _should_ only make a difference for memory usage if your C
       | compiler isn't perfect (but of course, it never is, certainly on
       | CPUs like this one, which is a poor fit for C)
        
         | dpassens wrote:
         | You could kind of avoid the return stack, if you only ever do
         | tail calls. Obviously, that's pretty unrealistic, but it's
         | possible.
        
           | ska wrote:
           | Assuming your compiler implements TCO properly, also.
        
         | cpldcpu wrote:
         | >That _should_ only make a difference for memory usage if your
         | C compiler isn't perfect
         | 
         | Considering that the PMC150 has an accumulator based 8 bit
         | architecture which is almost hostile to C, it is safe to assume
         | that the compiler is not perfect :)
        
         | whobre wrote:
         | I bet something like Forth would work better on such a
         | microcontroller. It is known to produce very high code density
         | and embedding assembly is usually very straightforward.
        
         | varispeed wrote:
         | There is probably a way to change calling convention to use
         | something else instead of stack.
        
       ___________________________________________________________________
       (page generated 2024-10-20 23:01 UTC)