[HN Gopher] Efficient Computer's Electron E1 CPU - 100x more eff...
       ___________________________________________________________________
        
       Efficient Computer's Electron E1 CPU - 100x more efficient than
       Arm?
        
       Author : rpiguy
       Score  : 123 points
       Date   : 2025-07-25 16:30 UTC (6 hours ago)
        
 (HTM) web link (morethanmoore.substack.com)
 (TXT) w3m dump (morethanmoore.substack.com)
        
       | rpiguy wrote:
       | The architecture diagram in the article resembles the approach
       | Apple took in the design of their neural engine.
       | 
       | https://www.patentlyapple.com/2021/04/apple-reveals-a-multi-...
       | 
       | Typically these architectures are great for compute. How will it
       | do on scalar tasks with a lot of branching? I doubt well.
        
       | SoftTalker wrote:
       | > Efficient's goal is to approach the problem by static
       | scheduling and control of the data flow - don't buffer, but run.
       | No caches, no out-of-order design, but it's also not a VLIW or
       | DSP design. It's a general purpose processor.
       | 
       | Sounds like a mainframe. Is there any similarity?
        
         | wmf wrote:
         | This has nothing to do with mainframes (which are fairly normal
         | general purpose computers).
        
       | Grosvenor wrote:
       | Is this the return if Itanium? static scheduling and pushing
       | everything to the compiler it sounds like it.
        
         | darksaints wrote:
         | It kinda sounds like it, though the article explicitly said
         | it's not VLIW.
         | 
         | I've always felt like itanium was a great idea but came too
         | soon and too poorly executed. It seemed like the majority of
         | the commercial failure came down to friction from switching
         | architecture and the inane pricing rather than the merits of
         | the architecture itself. Basically intel being intel.
        
           | bri3d wrote:
           | I disagree; Itanium was fundamentally flawed for general
           | purpose computing and especially time-shared generally
           | purpose computing. VLIW is not practical in time-sharing
           | systems without completely rethinking the way cache works,
           | and Itanium didn't really do that.
           | 
           | As soon as a system has variable instruction latency, VLIW
           | completely stops working; the entire concept is predicated on
           | the compiler knowing how many cycles each instruction will
           | take to retire ahead of time. With memory access hierarchy
           | and a nondeterministic workload, the system inherently cannot
           | know how many cycles an instruction will take to retire
           | because it doesn't know what tier of memory its data
           | dependencies live in up front.
           | 
           | The advantage of out-of-order execution is that it
           | dynamically adapts to data availability.
           | 
           | This is also why VLIW works well where data availability is
           | _not_ dynamic, for example in DSP applications.
           | 
           | As for this Electron thing, the linked article is too puffed
           | to tell what it's actually doing. The first paragraph says
           | something about "no caches" but the block diagram has a bunch
           | of caches in it. It sort of sounds like an FPGA with bigger
           | primitives (configurable instruction tiles rather than
           | gates), which means that synchronization is going to continue
           | to be the problem and I don't know how they'll solve for
           | variable latency.
        
             | als0 wrote:
             | > VLIW is not practical in time-sharing systems without
             | completely rethinking the way cache works
             | 
             | Just curious as to how you would rethink the design of
             | caches to solve this problem. Would you need a dedicated
             | cache per execution context?
        
               | bri3d wrote:
               | That's the simplest and most obvious way I can think of.
               | I know the Mill folks were deeply into this space and
               | probably invented something more clever but I haven't
               | kept up with their research in many years.
        
             | hawflakes wrote:
             | Not to detract form your point, but Itanium's design was to
             | address the code compatibility between generations. You
             | could have code optimized for a wider chip run on a
             | narrower chip because of the stop bits. The compiler still
             | needs to know how to schedule to optimize for a specific
             | microarchitecture but the code would still run albeit not
             | as efficiently.
             | 
             | As an aside, I never looked into the perf numbers but
             | having adjustable register windows while cool probably made
             | for terrible context switching and/or spilling performance.
        
           | cmrdporcupine wrote:
           | It does feel maybe like the world has changed a bit now that
           | LLVM is ubiquitous with its intermediate representation form
           | being available for specialized purposes. Translation from IR
           | to a VLIW plan _should_ be easier now than the state of
           | compiler tech in the 90s.
           | 
           | But "this is a good idea just poorly executed" seems to be
           | the perennial curse of VLIW, and how Itanium ended up shoved
           | onto people in the first place.
        
         | mochomocha wrote:
         | On the other hand, Groq seems pretty successful.
        
         | wood_spirit wrote:
         | The Mill videos are worth watching again - there are variations
         | on NaT handling and looping and branching etc that make DSPs
         | much more general-purpose.
         | 
         | I don't know how similar this Electron is, but the Mill
         | explained how it could be done.
         | 
         | Edit: aha, found them!
         | https://m.youtube.com/playlist?list=PLFls3Q5bBInj_FfNLrV7gGd...
        
       | vendiddy wrote:
       | I don't know much about CPUs so maybe someone can clarify.
       | 
       | Is this effectively having a bunch of tiny processors on a single
       | chip each with its own storage and compute?
        
         | lawlessone wrote:
         | I think it's more like having the instructions your program
         | does spread accross mulitple tiny processors.
         | 
         | So one instruction gets done.. output is pass to the next.
         | 
         | Hopefully i've made somebody mad enough to explain why i am
         | wrong.
        
       | kendalf89 wrote:
       | This grid based architecture reminds me of a programming game
       | from zactronics, TIS-100.
        
       | wolfi1 wrote:
       | reminds me from the architecture of transputers but on the same
       | silicon
        
         | fidotron wrote:
         | Yep, or the old GreenArrays GA144 or even maybe XMOS with more
         | compiler magic.
         | 
         | One of the big questions here is how quickly it can switch
         | between graphs, or if that will be like a context switch from
         | hell. In an embedded context that's likely to become a headache
         | way too fast, so the idea of a magic compiler fixing it so you
         | don't have to know what it's doing sounds like a fantasy
         | honestly.
        
           | nolist_policy wrote:
           | Also, how would cycle-accurate assembly look like for this
           | chip?
        
       | ZiiS wrote:
       | Percentage chance this is 100X more efficent at the general
       | purpose computing ARM is optimized for: 1/100%
        
       | renewiltord wrote:
       | Is there a dev board available? Seems hard to find. I am curious.
        
       | pclmulqdq wrote:
       | This is a CGRA. It's like an FPGA but with bigger cells. It's not
       | a VLIW core.
       | 
       | I assume that like all past attempts at this, it's about 20x more
       | efficient when code fits in the one array (FPGAs get this ratio),
       | but if your code size grows past something very trivial, the grid
       | config needs to switch and that costs tons of time and power.
        
         | rf15 wrote:
         | I agree this is very "FPGA-shaped" and I wonder if they have
         | further switching optimisations on hand.
        
       | artemonster wrote:
       | As a person who is highly vested and interested in CPU space,
       | especially embedded, I am HIGHLY skeptical of such claims.
       | Somebody played TIS-100, remembered GA144 failed and decided to
       | try their own. You know what can be a simple proof of your
       | claims? No, not a press release. No, not a pitch deck or a
       | youtube video. And NO, not even working silicon, you silly. A
       | SIMPLE FUCKING ISA EMULATOR WITH A PROFILER. Instead we got bunch
       | of whitepapers. Yeah, I call it 90% chance for total BS and
       | vaporware
        
         | bmenrigh wrote:
         | I like Ian but he's rapidly losing credibility by postings so
         | much sponsored content. Many of his videos and articles now are
         | basically just press releases.
        
         | wmf wrote:
         | There's >20 years of academic research behind dataflow
         | architectures going back to TRIPS and MIT RAW. It's not
         | literally a scam but the previous versions weren't practical
         | and it's unlikely this version succeeds either. I agree that if
         | the compiler was good they would release it and if they don't
         | release it that's probably because it isn't good.
        
         | jecel wrote:
         | The 2022 PhD thesis linked from their web site includes a
         | picture of what they claim was an actual chip made using a 22nm
         | process. I understand that the commercial chip might be
         | different, but it is possible that the measurements made for
         | the thesis could be valid for their future products as well.
        
       | lazyeye wrote:
       | https://www.efficient.computer/technology
        
       | archipelago123 wrote:
       | It's a dataflow architecture. I assume the hardware
       | implementation is very similar to what is described here:
       | https://csg.csail.mit.edu/pubs/memos/Memo-229/Memo-229.pdf. The
       | problem is that it becomes difficult to exploit data locality,
       | and there is so much optimization you can perform during compile
       | time. Also, the motivation for these types of architectures (e.g.
       | lack of ILP in Von-Neumann style architectures) are non-existent
       | in modern OoO cores.
        
         | timschmidt wrote:
         | Out of order cores spend an order of magnitude more logic and
         | energy than in-order cores handling invalidation, pipeline
         | flushes, branch prediction, etc etc etc... All with the goal of
         | increasing performance. This architecture is attempting to
         | lower the joules / instruction at the cost of performance, not
         | increase energy use in exchange for performance.
        
       | Imustaskforhelp wrote:
       | Pardon me but could somebody here explain to me like I am 15?
       | Because I guess Its late night and I can't go into another
       | rabbithole and I guess I would appreciate it. Cheers and good
       | night fellow HN users.
        
         | wmf wrote:
         | Probably not. This is graduate-level computer architecture.
        
         | hencoappel wrote:
         | Found this video a good explanation.
         | https://youtu.be/xuUM84dvxcY?si=VPBEsu8wz70vWbX4
        
         | elseless wrote:
         | Sure. You can think of a (simple) traditional CPU as executing
         | instructions in _time_ , one-at-a-time[1] -- it fetches an
         | instruction, decodes it, performs an arithmetic/logical
         | operation, or maybe a memory operation, and then the
         | instruction is considered to be complete.
         | 
         | The Efficient architecture is a CGRA (coarse-grained
         | reconfigurable array), which means that it executes
         | instructions in _space_ instead of time. At compile time, the
         | Efficient compiler looks at a graph made up of all the
         | "unrolled" instructions (and data) in the program, and decides
         | how to map it all spatially onto the hardware units. Of course,
         | the graph may not all fit onto the hardware at once, in which
         | case it must also be split up to run in batches over time. But
         | the key difference is that there's this sort of spatial
         | unrolling that goes on.
         | 
         | This means that a lot of the work of fetching and decoding
         | instructions and data can be eliminated, which is good.
         | However, it also means that the program must be mostly, if not
         | completely, static, meaning there's a very limited ability for
         | data-dependent branching, looping, etc. to occur compared to a
         | CPU. So even if the compiler claims to support C++/Rust/etc.,
         | it probably does not support, e.g., pointers or dynamically-
         | allocated objects as we usually think of them.
         | 
         | [1] Most modern CPUs don't actually execute instructions one-
         | at-a-time -- that's just an abstraction to make programming
         | them easier. Under the hood, even in a single-core CPU, there
         | is all sorts of reordering and concurrent execution going on,
         | mostly to hide the fact that memory is much slower to access
         | than on-chip registers and caches.
        
           | pclmulqdq wrote:
           | Pointers and dynamic objects are probably fine given the
           | ability to do indirect loads, which I assume they have (Side
           | note: I have built b-trees on FPGAs before, and these kinds
           | of data structures are smaller than you think). It's actually
           | pure code size that is the problem here rather than specific
           | capabilities, as long as the hardware supports those
           | instructions.
           | 
           | Instead of assembly instructions taking time in these
           | architectures, they take space. You will have a capacity of
           | 1000-100000 instructions (including all the branches you
           | might take), and then the chip is full. To get past that
           | limit, you have to store state to RAM and then reconfigure
           | the array to continue computing.
        
             | kannanvijayan wrote:
             | Hmm. You'd be able to trade off time for that space by
             | using more general configurations that you can dynamically
             | map instruction-sequences onto, no?
             | 
             | The mapping wouldn't be as efficient as a bespoke
             | compilation, but it should be able to avoid the
             | configuration swap-outs.
             | 
             | Basically a set of configurations that can be used as an
             | interpreter.
        
       | icandoit wrote:
       | I wondered if this was using interaction combinators like the
       | vine programming language does.
       | 
       | I haven't read much that explains how they do it.
       | 
       | I have been very slowly trying to build a translation layer
       | between starlark and vine as a proof of concept of massively
       | parallel computing. If someone better qualified finds a better
       | solution the market it sure to have demand for you. A translation
       | layer is bound to be cheaper than teaching devs to write in jax
       | or triton or whatever comes next.
        
       | variadix wrote:
       | Pretty interesting concept, though as other commenters have
       | pointed out the efficiency gains likely break down once your
       | program doesn't fit onto the mesh all at once. Also this looks
       | like it requires a "sufficiently smart compiler", which isn't a
       | good sign either. The need to do routing etc. reminds me of the
       | problems FPGAs have during place and route (effectively the
       | minimum cut problem on a graph, i.e. NP), hopefully compilation
       | doesn't take as long as FPGA synthesis takes.
        
       | gamache wrote:
       | Sounds a lot like GreenArray GA144 (https://www.greenarraychips.c
       | om/home/documents/greg/GA144.ht...)! Sadly, without a bizarre and
       | proprietary FORTH dialect to call its own, I fear the E1 will not
       | have the market traction of its predecessor.
        
         | jnpnj wrote:
         | That was my first thought too. I really like the idea of
         | interconnected nodes array. There's something biological,
         | thinking in topology and neighbours diffusion that I find
         | appealing.
        
           | londons_explore wrote:
           | One day someone will get it working...
           | 
           | Data transfer is slow and power hungry - it's obvious that
           | putting a little bit of compute next to every bit of memory
           | is the way to minimize data transfer distance.
           | 
           | The laws of physics can't be broken, yet people demand more
           | and more performance, so eventually the difficulty of solving
           | this issue will be worth solving.
        
       | trhway wrote:
       | > spatial data flow model. Instead of instructions flowing
       | through a centralized pipeline, the E1 pins instructions to
       | specific compute nodes called tiles and then lets the data flow
       | between them. A node, such as a multiply, processes its operands
       | when all the operand registers for that tile are filled. The
       | result then travels to the next tile where it is needed. There's
       | no program counter, no global scheduler. This native data-flow
       | execution model supposedly cuts a huge amount of the energy
       | overhead typical CPUs waste just moving data.
       | 
       | should work great for NN.
        
       ___________________________________________________________________
       (page generated 2025-07-25 23:00 UTC)