[HN Gopher] Efficient Computer's Electron E1 CPU - 100x more eff...
___________________________________________________________________
Efficient Computer's Electron E1 CPU - 100x more efficient than
Arm?
Author : rpiguy
Score : 123 points
Date : 2025-07-25 16:30 UTC (6 hours ago)
(HTM) web link (morethanmoore.substack.com)
(TXT) w3m dump (morethanmoore.substack.com)
| rpiguy wrote:
| The architecture diagram in the article resembles the approach
| Apple took in the design of their neural engine.
|
| https://www.patentlyapple.com/2021/04/apple-reveals-a-multi-...
|
| Typically these architectures are great for compute. How will it
| do on scalar tasks with a lot of branching? I doubt well.
| SoftTalker wrote:
| > Efficient's goal is to approach the problem by static
| scheduling and control of the data flow - don't buffer, but run.
| No caches, no out-of-order design, but it's also not a VLIW or
| DSP design. It's a general purpose processor.
|
| Sounds like a mainframe. Is there any similarity?
| wmf wrote:
| This has nothing to do with mainframes (which are fairly normal
| general purpose computers).
| Grosvenor wrote:
| Is this the return if Itanium? static scheduling and pushing
| everything to the compiler it sounds like it.
| darksaints wrote:
| It kinda sounds like it, though the article explicitly said
| it's not VLIW.
|
| I've always felt like itanium was a great idea but came too
| soon and too poorly executed. It seemed like the majority of
| the commercial failure came down to friction from switching
| architecture and the inane pricing rather than the merits of
| the architecture itself. Basically intel being intel.
| bri3d wrote:
| I disagree; Itanium was fundamentally flawed for general
| purpose computing and especially time-shared generally
| purpose computing. VLIW is not practical in time-sharing
| systems without completely rethinking the way cache works,
| and Itanium didn't really do that.
|
| As soon as a system has variable instruction latency, VLIW
| completely stops working; the entire concept is predicated on
| the compiler knowing how many cycles each instruction will
| take to retire ahead of time. With memory access hierarchy
| and a nondeterministic workload, the system inherently cannot
| know how many cycles an instruction will take to retire
| because it doesn't know what tier of memory its data
| dependencies live in up front.
|
| The advantage of out-of-order execution is that it
| dynamically adapts to data availability.
|
| This is also why VLIW works well where data availability is
| _not_ dynamic, for example in DSP applications.
|
| As for this Electron thing, the linked article is too puffed
| to tell what it's actually doing. The first paragraph says
| something about "no caches" but the block diagram has a bunch
| of caches in it. It sort of sounds like an FPGA with bigger
| primitives (configurable instruction tiles rather than
| gates), which means that synchronization is going to continue
| to be the problem and I don't know how they'll solve for
| variable latency.
| als0 wrote:
| > VLIW is not practical in time-sharing systems without
| completely rethinking the way cache works
|
| Just curious as to how you would rethink the design of
| caches to solve this problem. Would you need a dedicated
| cache per execution context?
| bri3d wrote:
| That's the simplest and most obvious way I can think of.
| I know the Mill folks were deeply into this space and
| probably invented something more clever but I haven't
| kept up with their research in many years.
| hawflakes wrote:
| Not to detract form your point, but Itanium's design was to
| address the code compatibility between generations. You
| could have code optimized for a wider chip run on a
| narrower chip because of the stop bits. The compiler still
| needs to know how to schedule to optimize for a specific
| microarchitecture but the code would still run albeit not
| as efficiently.
|
| As an aside, I never looked into the perf numbers but
| having adjustable register windows while cool probably made
| for terrible context switching and/or spilling performance.
| cmrdporcupine wrote:
| It does feel maybe like the world has changed a bit now that
| LLVM is ubiquitous with its intermediate representation form
| being available for specialized purposes. Translation from IR
| to a VLIW plan _should_ be easier now than the state of
| compiler tech in the 90s.
|
| But "this is a good idea just poorly executed" seems to be
| the perennial curse of VLIW, and how Itanium ended up shoved
| onto people in the first place.
| mochomocha wrote:
| On the other hand, Groq seems pretty successful.
| wood_spirit wrote:
| The Mill videos are worth watching again - there are variations
| on NaT handling and looping and branching etc that make DSPs
| much more general-purpose.
|
| I don't know how similar this Electron is, but the Mill
| explained how it could be done.
|
| Edit: aha, found them!
| https://m.youtube.com/playlist?list=PLFls3Q5bBInj_FfNLrV7gGd...
| vendiddy wrote:
| I don't know much about CPUs so maybe someone can clarify.
|
| Is this effectively having a bunch of tiny processors on a single
| chip each with its own storage and compute?
| lawlessone wrote:
| I think it's more like having the instructions your program
| does spread accross mulitple tiny processors.
|
| So one instruction gets done.. output is pass to the next.
|
| Hopefully i've made somebody mad enough to explain why i am
| wrong.
| kendalf89 wrote:
| This grid based architecture reminds me of a programming game
| from zactronics, TIS-100.
| wolfi1 wrote:
| reminds me from the architecture of transputers but on the same
| silicon
| fidotron wrote:
| Yep, or the old GreenArrays GA144 or even maybe XMOS with more
| compiler magic.
|
| One of the big questions here is how quickly it can switch
| between graphs, or if that will be like a context switch from
| hell. In an embedded context that's likely to become a headache
| way too fast, so the idea of a magic compiler fixing it so you
| don't have to know what it's doing sounds like a fantasy
| honestly.
| nolist_policy wrote:
| Also, how would cycle-accurate assembly look like for this
| chip?
| ZiiS wrote:
| Percentage chance this is 100X more efficent at the general
| purpose computing ARM is optimized for: 1/100%
| renewiltord wrote:
| Is there a dev board available? Seems hard to find. I am curious.
| pclmulqdq wrote:
| This is a CGRA. It's like an FPGA but with bigger cells. It's not
| a VLIW core.
|
| I assume that like all past attempts at this, it's about 20x more
| efficient when code fits in the one array (FPGAs get this ratio),
| but if your code size grows past something very trivial, the grid
| config needs to switch and that costs tons of time and power.
| rf15 wrote:
| I agree this is very "FPGA-shaped" and I wonder if they have
| further switching optimisations on hand.
| artemonster wrote:
| As a person who is highly vested and interested in CPU space,
| especially embedded, I am HIGHLY skeptical of such claims.
| Somebody played TIS-100, remembered GA144 failed and decided to
| try their own. You know what can be a simple proof of your
| claims? No, not a press release. No, not a pitch deck or a
| youtube video. And NO, not even working silicon, you silly. A
| SIMPLE FUCKING ISA EMULATOR WITH A PROFILER. Instead we got bunch
| of whitepapers. Yeah, I call it 90% chance for total BS and
| vaporware
| bmenrigh wrote:
| I like Ian but he's rapidly losing credibility by postings so
| much sponsored content. Many of his videos and articles now are
| basically just press releases.
| wmf wrote:
| There's >20 years of academic research behind dataflow
| architectures going back to TRIPS and MIT RAW. It's not
| literally a scam but the previous versions weren't practical
| and it's unlikely this version succeeds either. I agree that if
| the compiler was good they would release it and if they don't
| release it that's probably because it isn't good.
| jecel wrote:
| The 2022 PhD thesis linked from their web site includes a
| picture of what they claim was an actual chip made using a 22nm
| process. I understand that the commercial chip might be
| different, but it is possible that the measurements made for
| the thesis could be valid for their future products as well.
| lazyeye wrote:
| https://www.efficient.computer/technology
| archipelago123 wrote:
| It's a dataflow architecture. I assume the hardware
| implementation is very similar to what is described here:
| https://csg.csail.mit.edu/pubs/memos/Memo-229/Memo-229.pdf. The
| problem is that it becomes difficult to exploit data locality,
| and there is so much optimization you can perform during compile
| time. Also, the motivation for these types of architectures (e.g.
| lack of ILP in Von-Neumann style architectures) are non-existent
| in modern OoO cores.
| timschmidt wrote:
| Out of order cores spend an order of magnitude more logic and
| energy than in-order cores handling invalidation, pipeline
| flushes, branch prediction, etc etc etc... All with the goal of
| increasing performance. This architecture is attempting to
| lower the joules / instruction at the cost of performance, not
| increase energy use in exchange for performance.
| Imustaskforhelp wrote:
| Pardon me but could somebody here explain to me like I am 15?
| Because I guess Its late night and I can't go into another
| rabbithole and I guess I would appreciate it. Cheers and good
| night fellow HN users.
| wmf wrote:
| Probably not. This is graduate-level computer architecture.
| hencoappel wrote:
| Found this video a good explanation.
| https://youtu.be/xuUM84dvxcY?si=VPBEsu8wz70vWbX4
| elseless wrote:
| Sure. You can think of a (simple) traditional CPU as executing
| instructions in _time_ , one-at-a-time[1] -- it fetches an
| instruction, decodes it, performs an arithmetic/logical
| operation, or maybe a memory operation, and then the
| instruction is considered to be complete.
|
| The Efficient architecture is a CGRA (coarse-grained
| reconfigurable array), which means that it executes
| instructions in _space_ instead of time. At compile time, the
| Efficient compiler looks at a graph made up of all the
| "unrolled" instructions (and data) in the program, and decides
| how to map it all spatially onto the hardware units. Of course,
| the graph may not all fit onto the hardware at once, in which
| case it must also be split up to run in batches over time. But
| the key difference is that there's this sort of spatial
| unrolling that goes on.
|
| This means that a lot of the work of fetching and decoding
| instructions and data can be eliminated, which is good.
| However, it also means that the program must be mostly, if not
| completely, static, meaning there's a very limited ability for
| data-dependent branching, looping, etc. to occur compared to a
| CPU. So even if the compiler claims to support C++/Rust/etc.,
| it probably does not support, e.g., pointers or dynamically-
| allocated objects as we usually think of them.
|
| [1] Most modern CPUs don't actually execute instructions one-
| at-a-time -- that's just an abstraction to make programming
| them easier. Under the hood, even in a single-core CPU, there
| is all sorts of reordering and concurrent execution going on,
| mostly to hide the fact that memory is much slower to access
| than on-chip registers and caches.
| pclmulqdq wrote:
| Pointers and dynamic objects are probably fine given the
| ability to do indirect loads, which I assume they have (Side
| note: I have built b-trees on FPGAs before, and these kinds
| of data structures are smaller than you think). It's actually
| pure code size that is the problem here rather than specific
| capabilities, as long as the hardware supports those
| instructions.
|
| Instead of assembly instructions taking time in these
| architectures, they take space. You will have a capacity of
| 1000-100000 instructions (including all the branches you
| might take), and then the chip is full. To get past that
| limit, you have to store state to RAM and then reconfigure
| the array to continue computing.
| kannanvijayan wrote:
| Hmm. You'd be able to trade off time for that space by
| using more general configurations that you can dynamically
| map instruction-sequences onto, no?
|
| The mapping wouldn't be as efficient as a bespoke
| compilation, but it should be able to avoid the
| configuration swap-outs.
|
| Basically a set of configurations that can be used as an
| interpreter.
| icandoit wrote:
| I wondered if this was using interaction combinators like the
| vine programming language does.
|
| I haven't read much that explains how they do it.
|
| I have been very slowly trying to build a translation layer
| between starlark and vine as a proof of concept of massively
| parallel computing. If someone better qualified finds a better
| solution the market it sure to have demand for you. A translation
| layer is bound to be cheaper than teaching devs to write in jax
| or triton or whatever comes next.
| variadix wrote:
| Pretty interesting concept, though as other commenters have
| pointed out the efficiency gains likely break down once your
| program doesn't fit onto the mesh all at once. Also this looks
| like it requires a "sufficiently smart compiler", which isn't a
| good sign either. The need to do routing etc. reminds me of the
| problems FPGAs have during place and route (effectively the
| minimum cut problem on a graph, i.e. NP), hopefully compilation
| doesn't take as long as FPGA synthesis takes.
| gamache wrote:
| Sounds a lot like GreenArray GA144 (https://www.greenarraychips.c
| om/home/documents/greg/GA144.ht...)! Sadly, without a bizarre and
| proprietary FORTH dialect to call its own, I fear the E1 will not
| have the market traction of its predecessor.
| jnpnj wrote:
| That was my first thought too. I really like the idea of
| interconnected nodes array. There's something biological,
| thinking in topology and neighbours diffusion that I find
| appealing.
| londons_explore wrote:
| One day someone will get it working...
|
| Data transfer is slow and power hungry - it's obvious that
| putting a little bit of compute next to every bit of memory
| is the way to minimize data transfer distance.
|
| The laws of physics can't be broken, yet people demand more
| and more performance, so eventually the difficulty of solving
| this issue will be worth solving.
| trhway wrote:
| > spatial data flow model. Instead of instructions flowing
| through a centralized pipeline, the E1 pins instructions to
| specific compute nodes called tiles and then lets the data flow
| between them. A node, such as a multiply, processes its operands
| when all the operand registers for that tile are filled. The
| result then travels to the next tile where it is needed. There's
| no program counter, no global scheduler. This native data-flow
| execution model supposedly cuts a huge amount of the energy
| overhead typical CPUs waste just moving data.
|
| should work great for NN.
___________________________________________________________________
(page generated 2025-07-25 23:00 UTC)