[HN Gopher] Esperanto Champions the Efficiency of Its 1,092-Core...
___________________________________________________________________
Esperanto Champions the Efficiency of Its 1,092-Core RISC-V Chip
Author : rbanffy
Score : 91 points
Date : 2021-08-29 18:12 UTC (4 hours ago)
(HTM) web link (www.hpcwire.com)
(TXT) w3m dump (www.hpcwire.com)
| [deleted]
| klelatti wrote:
| Mentions that each ET-Minion core has a vector / tensor unit.
| From [1]
|
| > The ET-Minion core, based on the open RISC-V ISA, adds
| proprietary extensions optimized for machine learning. This
| general-purpose 64-bit microprocessor executes instructions in
| order, for maximum efficiency, while extensions support vector
| and tensor operations on up to 256 bits of floating-point data
| (using 16-bit or 32-bit operands) or 512 bits of integer data
| (using 8-bit operands) per clock period.
|
| So sounds like at least 8736 SP FP operations per cycle.
|
| [1] https://www.esperanto.ai/technology/
| langarto wrote:
| More information (HotChips 33 presentation):
| https://www.esperanto.ai/wp-content/uploads/2021/08/HC2021.E...
| turminal wrote:
| I'm not any kind of expert in the field, but trading single chip
| speed for more chips surely has it's downsides, which aren't
| mentioned in the article at all.
| jmercouris wrote:
| Of course it does, usually systems have slower and faster
| compute units to compensate for performance penalty of non
| parallelisable operations
| cogman10 wrote:
| Biggest that comes up quick is memory bandwidth.
|
| The more cores you have, the more memory is needed to keep the
| chips processing.
| touisteur wrote:
| Well, it really depends on the computational intensity your
| algorithm needs. I've stumbled upon things of beauty porting
| things to GPUs, especially if you're going to perform huge
| amount of operations based on a very small amount of data. As
| long as you don't have too much intermediate data, register
| spilling, etc. these GPU things do fly. They're also very
| impressive on NN-based workloads... Even something 2 or 3
| gens behind can be game changing, with some optimization
| effort. Tensor libraries leave a lot on the floor to pick up,
| especially if you're not using the canned 'competition
| winning' networks.
| Mikeb85 wrote:
| Read the article. It's about ML workloads which scale well
| across many cores. It's also being compared to GPUs. The whole
| point of what they're doing to is to be able to pack more cores
| versus a CPU but with a larger instruction set than a GPU core.
| zozbot234 wrote:
| I like ML but it's not a very good language for this highly
| parallel HPC'ish stuff. We'll see how Rust does, it should be
| a lot closer to what's actually needed here.
| medo-bear wrote:
| ML as in machine learning
| mirker wrote:
| Yes, but it's meant to do ML inference, which can be
| parallelized decently. On those workloads, you can use GPUs,
| which are also composed of thousands of "wimpy" cores.
| R0b0t1 wrote:
| A full CPU is useful for decision intensive or time series
| intensive data. Normal ML inference is not necessarily either
| of those. You could have more complicated neurons (or just
| make normal compute tiles which they may be doing).
| goldenkey wrote:
| I thought the same thing back in 2015 considering the way
| GPUs supposedly handle branches with warps. However, my
| stock trading simulator ran way better on 3 GTX Titans
| rather than the Intel "Knights Many Cores" Phi preview I
| had exclusively been able to obtain. I was excited because
| it had something like 100 pentium 4 cores on it, and was
| supposed to be much faster than a GPU for logical code.
| Dissapointment set in when the GPU stomped it performance
| wise. I still don't even understand why but I do know now
| that the whole "GPUs can't handle branching performantly"
| is a bit overstated. Intel discontinued their Phi which I
| can only gander was due to its lack of competitiveness.
| sdenton4 wrote:
| A standard way to handle branching in gpu code is with
| masking, like so (where x is a vector, and broadcasting
| is implied): M = x > 0 y = M * f(x) + (1-M) * g(x)
|
| So you end up evaluating both sides of the decision
| branch and adding the results. But this is fine if you've
| got a dumb number of cores. And often traditional cpus
| wind up evaluating both branches anyway.
| monocasa wrote:
| > And often traditional cpus wind up evaluating both
| branches anyway.
|
| That's actually really overstated. Evaluating both sides
| isn't really something CPUs tend to do, but instead
| predict one path and roll back on mispredict. This is
| because the out of order hardware isn't a tree
| fundamentally, but generally better thought of a ring
| buffer where uncommitted state is what's between the head
| and tail. Storing diverging paths is incredibly expensive
| there. I'm not going to say something as strong as "it's
| never been done", but but I certainly don't know of an
| general purpose CPU arch that'll compute both sides of an
| architectural branch, instead relying an making good
| predictions down one instruction stream, then rolling
| back and restarting when it's clear you mispredicted.
| jasonwatkinspdx wrote:
| It's not even about the expense of implementing diverging
| paths in hardware.
|
| This concept, of exploring like a tree vs a path was
| explored under the name Disjoint Eager Execution. You
| know what killed it? Branch predictors. In a world where
| branch predictors are maybe only 75% effective, DEE could
| make sense. We live in a world where branch predictors
| are _far_ better than that. So it just isn 't worth
| speculating off the predicted most likely path.
| monocasa wrote:
| What killed it was more the effectiveness of Tomasulo
| style out of order machinery, and the real problem not
| being control hazards, but data hazards. DEE was thought
| of in a day where memory was about as fast as the
| processor. That's why it's always being compared with
| cores like R3000.
| monocasa wrote:
| Sort of. GPU "cores" in the CPU space would be called SIMD
| lanes. Apples to apples GPU cores using the CPU terminology
| would put an Nvidia 3060 at 28 cores and a 3090 at 82 cores.
| klyrs wrote:
| Sure, but this is a coprocessor on an expansion card, similar
| to a GPU. I've worked on a few systolic algorithms and this
| kind of chip has massive potential in that space. TPUs have
| been a big letdown in that regard, as they don't even have the
| comparison operation needed for the matrix-based shortest-path
| algorithm.
| joe_the_user wrote:
| Looking at this, I'm confused by basic questions. Is this a Mimd
| or Simd architecture chip? [1] What is the memory/caching
| structure here and would be fast or slow? Is this to replace a
| GPU or to replace the CPU you connect to the GPU or both? Would
| you get code and/or data divergence here? IE, "Many cores" seems
| to imply each has it's own instructions but ML usually runs on
| vectoring machines like a GPU.
|
| Edit: OK, I can see this has "network on a chip" architecture but
| I think that only answers some of my questions.
|
| [1] https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
| jasonwatkinspdx wrote:
| MIMD via 1088 cores per chip, each core has 512 bit short
| vector SIMD, and a 1024 bit tensor unit.
|
| ~100 MB of SRAM on chip, 8 GDDR busses to DRAM off chip.
|
| It's purpose designed for parallel sparse matrix ML problems.
| It's more efficient than both a CPU and GPU at these, as well
| as faster in absolute terms, taking their numbers at face
| value.
| dragontamer wrote:
| I mean, GPUs are only SIMD for 32 lanes (Nvidia or AMD RDNA) or
| 64 lanes (AMD CDNA).
|
| The rest of those lanes come from MIMD techniques.
|
| -------
|
| CPU cores are more MIMD today than SISD because of out of order
| and superscalar operations. So honestly, I think it's about
| time to retire Flynn's taxonomy. Everything is MIMD.
| joe_the_user wrote:
| _I mean, GPUs are only SIMD for 32 lanes (Nvidia or AMD RDNA)
| or 64 lanes (AMD CDNA).
|
| The rest of those lanes come from MIMD techniques._
|
| Not sure what you mean here. Are there places where different
| groups of kernels can simultaneous execute different code?
| qwerty456127 wrote:
| With all this RISC-V hype going on I'm curious how compatible
| RISC-V processors from different vendors actually are.
| zozbot234 wrote:
| The RISC-V folks are in the process of ratifying standards for
| vector processing ("V") and lightweight packed-SIMD/DSP compute
| ("P") that should make it easier to expand compatibility in
| these domains. As of now, these standards are not quite ready
| and Esperanto are still using proprietary extensions for their
| stuff.
| OneEyedRobot wrote:
| So what in the heck does the CPU interconnect look like?
| sakras wrote:
| This looks similar to a research project I worked on called the
| Hammerblade Manycore. The cores were connected by a network,
| where you could read/write to an address by sending a packet
| which contained the address of the word, whether you were
| reading or writing, and if writing, the new value. The packet
| would then hop along the network one unit per cycle until it
| reached its destination.
| monocasa wrote:
| Classic "network on chip" for those that want the most
| searchable term.
| vletal wrote:
| - Transistor on chip
|
| - CPU on chip
|
| - System on chip
|
| - Network on chip
|
| I'm looking forward to the next X on chip which I'm not
| aware of.
| trissylegs wrote:
| Ive heard the term "Cluster on a chip" which I guess
| would apply here too.
| zh3 wrote:
| Looks efficient, at least on the face of it. Certainly seems
| credible (Dave Ditzel) and as a way to lowering the
| cost/improving the efficiency of targeted ad-serving they could
| be on to a winner.
| russellbeattie wrote:
| Off topic thought: "Esperanto Technologies" is apparently the
| name of the company, in case you're confused with the headline
| like I was. I was amused to discover their offices are literally
| 3 blocks away from where I live. (So is YCombinator and a
| thousand other tech companies, so not that surprising, really,
| but amusing.)
|
| At this point I think we need to go back to descriptive old-
| school 70s company names like, "West Coast Microprocessor
| Solutions", "Digital Logic, Inc.", "Mountain View Artificial
| Intelligence Laboratories", etc.
|
| You know, something that would blend into this map:
| https://s3.amazonaws.com/rumsey5/silicon/11492000.jpg
|
| Edit: Looking at that map, some of the company names are
| fantastically generic! "Electronics Corporation", "California
| Devices", "General Technology", "Test International".
| drmpeg wrote:
| Descriptive would be "Bus Stop Systems". No, drop the
| "Systems", just "Bus Stop".
| Causality1 wrote:
| Given recent trends I'm just glad it isn't named Esperantr
| Technologr.
| iamtedd wrote:
| Esprantly Technify.
___________________________________________________________________
(page generated 2021-08-29 23:00 UTC)