hngopher.com

       [HN Gopher] Towards the Cutest Neural Network
       ___________________________________________________________________
        
       Towards the Cutest Neural Network
        
       Author : surprisetalk
       Score  : 106 points
       Date   : 2025-05-01 18:10 UTC (4 days ago)
        
 (HTM) web link (kevinlynagh.com)
 (TXT) w3m dump (kevinlynagh.com)
        
       | mattdesl wrote:
       | I wonder how well BitNet (ternary weights) would work for this.
       | It seems like a promising way forward for constrained hardware.
       | 
       | https://arxiv.org/abs/2310.11453
       | 
       | https://github.com/cpldcpu/BitNetMCU/blob/main/docs/document...
        
       | gitroom wrote:
       | I gotta say, I'm always interested in new ways to make stuff
       | lighter especially for small devices - you think these clever
       | tricks actually hold up for real-world use or just look cool on
       | paper?
        
       | Onavo wrote:
       | > _since our input data comes from multiple sensors and the the
       | output pose has six components (three spatial positions and three
       | spatial rotations)_
       | 
       | Typo: two "the"
       | 
       | For robotics/inverse pose applications, don't people usually use
       | a 3x3 matrix (three rotations, three spatial) for coordinate
       | representation? Otherwise you get weird gimbal lock issues (I
       | think).
        
         | lynaghk wrote:
         | For my application I need just the translations and Euler
         | angles. The range of poses is mechanically constrained so I
         | don't have to worry about gimbal lock. But yeah, my limited
         | understanding matches yours that other parameterizations are
         | more useful in general contexts.
         | 
         | This post and interactive explanations have been on my backlog
         | to read and internalize: https://thenumb.at/Exponential-
         | Rotations/
         | 
         | (Also: Thanks for pointing out the typo, I just deployed a
         | fix.)
        
           | 01HNNWZ0MV43FF wrote:
           | Hey there op. I don't know what your sensors are measuring
           | (distance to a point maybe? Or angle from a Valve lighthouse
           | for inside-out tracking?)
           | 
           | But here's my "why didn't you just"
           | 
           | Since you have a forward simulation function (pose to
           | measurements), why didn't you use an iterative solver to
           | reverse it? Coordinate descent is easy to code and if you
           | have a constrained range of poses you can probably just use
           | multiple starting points to avoid getting stuck with a local
           | minimum. Then use the last solution as a starting point for
           | the next one to save iterations.
           | 
           | Sure it's not closed-form like an NN and it can still have
           | pathological cases, but the code is a little more transparent
        
             | lynaghk wrote:
             | That's a reasonable idea, but unfortunately wouldn't work
             | in my case since the simulation relies on a lot of
             | scientific libraries in Python and I need the inversion to
             | happen on the microcontroller.
             | 
             | When you say "coordinate descent" do you mean gradient
             | descent? I.e., updating a potential pose using the gradient
             | of a loss term (e.g., (predicted sensor reading - actual
             | sensor reading)**2)?
             | 
             | I bet that would work, but a tricky part would be
             | calculating gradients. I'm not sure if the Python libraries
             | I'm using support that. My understanding is that automatic
             | differentiation through libraries might be easier in a
             | language like Julia where dual numbers flow through
             | everything via the multiple dispatch mechanism.
        
       | thomassmith65 wrote:
       | What benefit does jax.nn provide over rolling one's own? There
       | are countless examples on the web of small neural networks,
       | written from scratch.
        
         | thih9 wrote:
         | Could you point to an example that you like more? One of the
         | author's goals is to:
         | 
         | > solicit "why don't you just ..." emails from experienced
         | practitioners who can point me to the library/tutorial I've
         | been missing =D (see the alternatives-considered down the page
         | for what I struck out on)
        
       | light_hue_1 wrote:
       | All of this is absurdly complicated. Exactly what I would expect
       | from a new student who doesn't know what they're doing and has no
       | one to teach them how do you engineering in a systematic manner.
       | I don't mean this as an insult. I teach this stuff and have seen
       | it hundreds of times.
       | 
       | You should look for "post training static quantization" also
       | called . There are countless ways to quantize. This will quantize
       | both the weights and the activations after training.
       | 
       | You're doing this on hard mode for no reason. This is typical and
       | something I often need to break people out of. Optimizing for
       | performance by doing custom things in Jax when you're a beginner
       | is a terrible path to take.
       | 
       | Performance is not your problem. You're training a trivial
       | network that would have run on a CPU 20 years ago.
       | 
       | There's no clear direction here, just trying complicated stuff in
       | no logical order with no learning or dependencies between steps.
       | You need to treat these problems as scientific experiments. What
       | do I do to learn more about my domain, what do I change depending
       | on the answer I get, etc. Not, now it's time to try something
       | else random like jax.
       | 
       | Worse. You need to learn the key lesson in this space. Credit
       | assignment for problems is extremely hard. If something isn't
       | working why isn't it? Because of a bug? A hopeless problem? Using
       | a crappy optimizer? Etc. That's why you should start in a
       | framework that works and escape it later if you want.
       | 
       | Here's a simple plan to do this:
       | 
       | First forget about quantization. Use pytorch. Implement your
       | trivial network in 5 lines. Train it with Adam. Make sure it
       | works. Make sure your problem is solveable with the data that you
       | have and the network you've chosen and your activation functions
       | and the loss and the optimizer (use Adam, forget about this doing
       | stuff by hand for now).
       | 
       | > Unless I had an expert guide who was absolutely sure it'd be
       | straightforward (email me!), I'd avoid high-level frameworks like
       | TensorFlow and PyTorch and instead implement the quantization-
       | aware training myself.
       | 
       | This is exactly backwards. Unless you have an expert never
       | implement anything yourself. If you don't have one, rely on what
       | already exists. Because you can logically narrow down the options
       | for what works and what's wrong. If you do it yourself you're
       | always lost.
       | 
       | Once you have that working start backing off. Slowly change the
       | working network into what you need. Step by step. At every step
       | write down why you think your change is good and what you would
       | do if it isn't. Then look at the results.
       | 
       | Forget about microflow-rs or whatever. Train with pytorch, export
       | to onnx, generate c code for your onnx for inference.
       | 
       | Read the pytorch guide on PTSQ and use it.
        
         | revskill wrote:
         | Well said. Thanks.
        
         | omneity wrote:
         | Despite the tone this is excellent advice! I had similar
         | impressions reading the article and was wondering if I missed
         | something.
        
         | seletskiy wrote:
         | I kind of see your point, but only in the context of working on
         | time-sensitive task which others rely upon. But if it is
         | hobby/educational project, what is wrong doing things by
         | yourself? And resort to decomposing existing solution if you
         | can't figure out why yours is not working?
         | 
         | There's nothing better for understanding something rather than
         | trying to do that "something" from scratch yourself.
        
           | bubblyworld wrote:
           | I think the point is that OP is learning things about a wide
           | variety of topics that aren't really relevant to their stated
           | goal, i.e. solving the sensor/state inference problem.
           | 
           | Which, as you say, can be valuable! There's nothing wrong
           | with that. But the more complexity you add the less likely
           | you are to actually solve the problem (all else being equal,
           | some problems are just inherently complex).
        
         | jasonjmcghee wrote:
         | Targeting ONNX and using something like
         | https://github.com/kraiskil/onnx2c as parent mentioned is good
         | advice.
        
       | JanSchu wrote:
       | Nice write-up. A couple of notes from doing roughly the same
       | dance on Cortex-M0 and M3 boards for sensor fusion.
       | 
       | 1. You can, in fact, get rid of every FP instruction on M0. The
       | trick is to pre-bake the scale and zero_point into a single
       | fixed-point multiplier per layer (the dyadic form you mentioned).
       | The formula is
       | 
       | ini Copy Edit y = ((W _x + b)_ M)  >> s Where M fits in an int32
       | and s is the power-of-two shift. You compute M and s once on the
       | host, write them as const tables, and your inner loop is
       | literally a MAC followed by a multiply-accumulate-shift. No
       | fpsoft library, no division.
       | 
       | 2. CMSIS-NN already gives you the fast int8 kernels. The docs are
       | painful but you can steal just four files:
       | arm_fully_connected_q7.c, arm_nnsupportfunctions.c, and their
       | headers. On M0 this compiled to ~3 kB for me. Feed those kernels
       | fixed-point activations and you only pay for the ops you call.
       | 
       | 3. Workflow that kept me sane
       | 
       | Prototype in PyTorch. Tiny net, ReLU, MSE, Adam, done.
       | 
       | torch.quantization.quantize_qat for quantization-aware training.
       | Export to ONNX, then run a one-page Python script that dumps .h
       | files with weight, bias, M, s.
       | 
       | Hand-roll the inference loop in C. It is about 40 lines per
       | layer, easy to unit-test on the host with the same vectors you
       | trained on.
       | 
       | By starting with a known-good fp32 model you always have a
       | checksum: the int8 path must match fp32 within tolerance or you
       | know exactly where to look.
        
         | lynaghk wrote:
         | Awesome, thanks! This is exactly the kind of experienced take I
         | was hoping my blog post would summon =D
         | 
         | Re: computing M and s, does torch.quantization.quantize_qat do
         | this or do you do it yourself from the (presumably f32)
         | activation scaling that torch finds?
         | 
         | I don't have much experience with this kind of numerical
         | computing, so I have no intuition about how much the
         | "quantization" of selecting M and s might impact the overall
         | performance of the network. I.e., whether
         | 
         | - M and s should be trained as part of QAT (e.g., the "Learned
         | Step Size Quantization" paper)
         | 
         | - it's fine to just deterministically compute M and s from the
         | f32 activation scaling.
         | 
         | Also: Thanks for the tips re: CMSIS-NN, glad to know it's
         | possible to use in a non-framework way. Any chance your example
         | is open source somewhere?
        
       | alexpotato wrote:
       | If people like "give me the simplest possible working coding
       | example" of neural networks, I highly recommend this one:
       | 
       | "A Neural Network in 11 lines of Python (Part 1)":
       | https://iamtrask.github.io/2015/07/12/basic-python-network/
        
       | QuadmasterXLII wrote:
       | Experienced practitioner here, the second half pf the post
       | describes doing everything exactly the way I have done it (only
       | differences are I picked C++ and Eigen instead of rust and
       | nalgebrafor inference, and i used torch's ndarray and backprop
       | tools instead of jax's- with the analagous "just print out a C++
       | code from python" approach to weight serialization). You picked
       | up on the key insight which is that the size of the code needed
       | to just directly implement the inference equations is much
       | smaller than the size of the configuration file of any possible
       | framework that was flexible enough to meet your requirements of
       | (rust, no inference time allocation, no inference time floating
       | point, trained from scratch, ultra small parameter count, ...)
        
       | hansvm wrote:
       | The last time I did anything like this, the easiest workflow I
       | found was to use your favorite high-level runtime for training
       | and just implement a serializer converting the model into source
       | code for your target embedded system. Hand-code the inference
       | loop. This is exactly the strategy TFA landed on.
       | 
       | One advantage of having it implemented in code is that you can
       | observe and think about the instructions being generated. TFA
       | didn't talk at all about something pretty important for
       | small/fast neural networks -- the normal "cleanup" code (padding,
       | alignment, length alignment, data-dependent horizontal sums, etc)
       | can dwarf the actual mul->add execution times. You might want to,
       | e.g., ensure your dimensions are all multiples of 8. You
       | definitely want to store weights as column-major instead of row-
       | major if the network is written as vec @ mat instead of mat @ vec
       | (and vice versa for the latter).
       | 
       | When you're baking weights and biases into code like that, use an
       | affine representation -- explicitly pad the input with the number
       | one, along with however many extra zeroes you need for any other
       | length padding requirements make sense for your problem (usually
       | zero for embedded, but this is a similar workflow to low-resource
       | networks on traditional computers, where you probably want
       | vectorization).
       | 
       | Floats are a tiny bit hard to avoid for dot products. For similar
       | precision, you require nearly twice the bit count in a fixed-
       | point representation just to make the multiplies work, plus some
       | extra bits proportional to the log2 of the dimension. E.g., if
       | you trained on f16 inputs then you'll have roughly comparable
       | precision with i32 fixed-point weights, and that's assuming you
       | go through the effort to scale and shift everything into an
       | appropriate numerical regime. Twice the instruction count (or
       | thereabouts) on twice the register width makes fixed-point 2-4x
       | slower for similar precision than a hardware float, supposing
       | those wide instructions exist for your microcontroller, and soft
       | floats are closer to 10x slower for multiply-accumulate. If
       | you're emulating wide integer instructions, just use soft floats.
       | If you don't care about a 4x slowdown, just use soft floats.
       | 
       | Training can be a little finicky for small networks. At a
       | minimum, you probably want to create train/test/validate sets and
       | have many training runs. There are other techniques if you want
       | to go down a rabbit hole.
       | 
       | Other ML architectures can be much more performant here.
       | Gradient-boosted trees are already SOTA on many of these
       | problems, and oblivious trees map extremely well to normal
       | microcontroller instruction sets. By skipping the multiplies,
       | your fixed-point precision is on par with floats of similar bit-
       | width, making quantization a breeze.
        
       | imtringued wrote:
       | This is such a confused blogpost I swear this had to be a
       | teenager just banging their head against the wall.
       | 
       | Wanting to natively train a quantized neural network is stupid
       | unless you are training directly on your microcontroller. I was
       | constantly waiting for the author to explain their special
       | circumstances and it turns out they don't have any. They just
       | have a standard TinyML [0] use case that's been done to death
       | with fixed point quanitization aware training, which unlike what
       | the author of the blog post said, doesn't rely on terabytes of
       | data.
       | 
       | QAT is done on a conventionally trained model with much less data
       | than the full training process. Doing QAT early has no benefits.
       | The big downside of QAT isn't that you need a lot of data, it's
       | that you need the same data distribution as the original training
       | data and nobody has access to that, because only the weights are
       | published.
       | 
       | [0] https://medium.com/@thommaskevin/tinyml-quantization-
       | aware-t...
        
       | jasonjmcghee wrote:
       | Out of curiosity, did you consider bayesian state estimators?
       | 
       | For example, an unscented kalman filter:
       | https://www.mathworks.com/help/control/ug/nonlinear-state-es...
        
       | nico wrote:
       | Great article. For a moment, I thought this would be about a gen
       | AI that would turn any input into a "kawai" version of it
       | 
       | Anyway, excellent insights and detail
        
       | gwern wrote:
       | My suggestion would be that, since you want a tiny integer-only
       | NN tailored for a specific computer, are only occasionally
       | training one for a specific task, and you have a simulator to
       | generate unlimited data, you simply do random search or an
       | evolutionary method like CMA-ES.
       | 
       | They are easy to understand and code up by hand in a few lines
       | (which is one reason you won't find any libraries for them - they
       | are the 'leftpad' or 'isEven' of NNs, the effort it would take to
       | install and understand and use a library often exceeds what it
       | would take to just write it yourself), will handle any NN
       | topology or numeric type you can invent, and will train very fast
       | in this scenario.
        
       ___________________________________________________________________
       (page generated 2025-05-05 23:01 UTC)