[HN Gopher] Implementing neural networks on the "3 cent" 8-bit m...
___________________________________________________________________
Implementing neural networks on the "3 cent" 8-bit microcontroller
Author : cpldcpu
Score : 146 points
Date : 2024-10-19 18:09 UTC (1 days ago)
(HTM) web link (cpldcpu.wordpress.com)
(TXT) w3m dump (cpldcpu.wordpress.com)
| magicalhippo wrote:
| Fun to see neural nets pushed to such extremes, really enjoyed
| the post.
|
| > The smallest models had to be trained without data
| augmentation, as they would not converge otherwise.
|
| Was this also the case for the 2-bit model you ended up with?
| cpldcpu wrote:
| Yes, as far as i remember the limit was somewhere around 1kbyte
| total parameters size.
| malwrar wrote:
| Super interesting!
|
| I wish tfa would have found some way to measure the PMS150C
| implementation the headline brags about, but even the PFS154 (2x
| mem, 3x price) version is super neat! Interesting to see how the
| net in particular is built at such small scale. I also wish they
| included numbers about performance like they do in their linked
| CH32V003 post. I'm wondering how quick these MCUs are compared to
| each other and e.g. OP's PC, and how hot they get under sustained
| load.
| cpldcpu wrote:
| There are no performance profiling mechanisms on these small
| devices, and the timers are rather coarse.
|
| But it is easily possible to estimate the execute time:
|
| - mulacc of one weight takes 11 clock cycles.
|
| - There are 1696 weights in the model, each one is only touched
| once.
|
| - We can assume ~25%-50% overhead for loops and housekeeping
| (1:4 unrolled)
|
| => ~23000-28000 clock cycles per inference, which is less than
| 2ms at 16MHz
|
| Since this is an MLP, the inference time directly scales with
| the number of weights. (This would be different for a CNN)
|
| As per veryfing on PMC150C - I considered using an LED for
| valid/nonvalid output. But iterating with OTP devices is quite
| tedious when you do not have an emulator. Since both devices
| are code compatible, we can assume that the code works on the
| smaller devices, though.
| wongarsu wrote:
| If flipping one of the output pins is fast enough you could
| use that in combination with an oscilloscope as a coarse but
| very accurate profiling method.
|
| Though I believe for most people "roughly 2ms" is good enough
| Lerc wrote:
| I feel like to really get to the level of hypothetically useful
| it should be able to take the samples from an input source.
|
| I wonder if you could do it on the full 28*28 by never holding
| the full image in memory at once, just as an input stream. say a
| 1d convolution on each line as it comes in to turn a [1,28] to
| [3,7] buffer two lines of the [3,7] = 42. Then after there are
| three results of the third line convolution are produced [3,3]=9,
| start performing a 2d convolution using the first two lines
| [2,3,:3] replacing the data at the start (as it has already been
| processed).
| cpldcpu wrote:
| Yes, you could implement it in a way where the first layer is
| streamed and accumulate on output activations in parallel in
| the memory. This would limit the memory requirements for the
| input activations, but would increase execution time, as more
| activiations have to be shuffled around.
|
| In this case I am streaing from ROM anyways, so it does not
| matter if the inputs are read only once or multiple times.
| amelius wrote:
| This challenges only the memory of the MCU, not the speed.
|
| And it is a bit disappointing that they didn't finish the project
| by adding a 8x8 pixel camera and a 7-segment display.
| robertclaus wrote:
| Is an 8bit camera an off-the-shelf part? Or do you just mean
| data streaming in and out in general?
| amelius wrote:
| Here is an example of an 8x8 camera sensor for hobby use. You
| can filter out the IR if desired. There are many similar
| sensors, and they are often used for motion tracking.
|
| https://learn.adafruit.com/adafruit-amg8833-8x8-thermal-
| came...
| numpad0 wrote:
| AMG8833 is an actual thermal sensor array, not an e.g. 8x8
| photodiode array with IR sensitivity. It's of thermopile
| type, and different from microbolometer type like FLIR
| cameras(which has to go blank periodically, unlike
| themopiles that need not to).
|
| Mouse sensors are 8x8ish cameras, but very few of
| them(basically only the genuine HP/Agilent/Avago ADNS-2610
| at bottom tier price) has raw image export feature, for
| some reason.
|
| There are many other tiny potato camera parts on various
| markets, but most of them are missing datasheet && require
| complicated interfacing.
|
| Overall it's actually not so trivial to get a small image
| sensor for hobby experiments.
| vardump wrote:
| Optical mice have similar cameras.
| pjmlp wrote:
| As proof of concept, it is quite cool.
|
| However for going into production with something like this, maybe
| writing everything in Assembly, and not just some parts, would be
| much better.
|
| But after a quick search it seems the macro assembler story for
| RISC-V isn't that great.
| kragen wrote:
| The Padauk chips being discussed here aren't RISC-V. Gas is an
| adequate macro assembler for RISC-V, but I think the Padauk
| chips don't have anything similar. Still, you can get pretty
| far with m4... or writing a shell script with an echo function.
| Someone wrote:
| FTA: _"One major issue when programming these devices in C is
| that every function call consumes RAM for the return stack and
| function parameters. This is unavoidable"_
|
| It's not completely unavoidable: don't use function parameters
| (globals are your friends on these CPUs). You can't avoid having
| a return stack, but you can make as few function calls as
| possible (ideally zero, but you may have to write functions to
| fit things into ROM)
|
| > *"To solve this, I flattened the inference code"
|
| I think that's "make as few function calls as possible"
|
| > and implemented the inner loop in assembly to optimize variable
| usage.
|
| That _should_ only make a difference for memory usage if your C
| compiler isn't perfect (but of course, it never is, certainly on
| CPUs like this one, which is a poor fit for C)
| dpassens wrote:
| You could kind of avoid the return stack, if you only ever do
| tail calls. Obviously, that's pretty unrealistic, but it's
| possible.
| ska wrote:
| Assuming your compiler implements TCO properly, also.
| cpldcpu wrote:
| >That _should_ only make a difference for memory usage if your
| C compiler isn't perfect
|
| Considering that the PMC150 has an accumulator based 8 bit
| architecture which is almost hostile to C, it is safe to assume
| that the compiler is not perfect :)
| whobre wrote:
| I bet something like Forth would work better on such a
| microcontroller. It is known to produce very high code density
| and embedding assembly is usually very straightforward.
| varispeed wrote:
| There is probably a way to change calling convention to use
| something else instead of stack.
___________________________________________________________________
(page generated 2024-10-20 23:01 UTC)