[HN Gopher] Ascenium Wants to Reinvent the CPU - and Kill Instru...
___________________________________________________________________
Ascenium Wants to Reinvent the CPU - and Kill Instruction Sets
Altogether
Author : rbanffy
Score : 109 points
Date : 2021-07-14 11:50 UTC (1 days ago)
(HTM) web link (www.tomshardware.com)
(TXT) w3m dump (www.tomshardware.com)
| fouc wrote:
| Nice, NISC for "No Instruction Set Computing".
|
| I can see it possibly disrupting the CPU industry in 10-20 years.
| Seems like a classic scenario right out of Clayton M.
| Christensen's "Innovator's Dilemma" book.
| pavlov wrote:
| If you run serverless NoSQL on NISC before the rooster crows,
| you must go outside and weep bitterly.
| spijdar wrote:
| So I won't pretend to be anywhere close to having the domain
| knowledge to really understand this, but this feels like a sort
| of more logical extreme version of the rational behind VLIW
| processors like Itanium, e.g. "remove hard logic from the
| microprocessor and put it at the compiler level"
|
| It was my understanding that this approach failed, partially
| because it's really hard to compute/guess certain things about
| code ahead-of-time, and modern CPUs with their branch predictors,
| prefetchers and speculative execution do more at runtime than a
| compiler could effectively do. Has this changed enough for this
| to be _generally_ useful, or are they hoping to market this for
| niche use-cases?
| flatiron wrote:
| This reminded me especially of
| https://en.m.wikipedia.org/wiki/Transmeta
| mhh__ wrote:
| Transmeta is probably the only approach that could really
| make VLIW work in a general purpose system, I think. Dynamic
| processors get a good chunk of their speed from being able to
| (aside from schedule, maybe not as well as a compiler, in the
| good times, but from being able to _re_ -schedule in adverse
| conditions) - if you have the power of software to be able to
| do that speculation then I think it could work.
|
| Hard to know with Transmeta because they had to implement X86
| which is a legal minefield - FWIW I've read a lot of
| Transmeta engineers saying that they were completely sold on
| the idea but just couldn't make it stick in time - I'm too
| young to have been around when it was being produced so I
| don't know.
| my123 wrote:
| And then that design philosophy ended up at NVIDIA. (in the
| Tegra processor family, with the Denver, Denver 2 and
| Carmel CPU cores)
|
| I wonder what will happen next... they didn't release a new
| CPU core since 2018 now. (and they are on the gen using
| stock Arm cores in the cycle, so Tegra Orin gets
| Cortex-A78AE)
| beginrescueend wrote:
| Yeah, when I saw it was a Software-Defined Instruction set, I
| immediately thought of Transmeta, as well.
| Taniwha wrote:
| I think that Transmeta (and the others working on the same
| stuff at the same time) had fixed instruction sets, along
| with hardware and software to recompile x86 code into that
| instruction set.
|
| The difference here I think is not that the chips don't have
| an instruction set, they do, but they don't have an
| ARCHITECTURAL instruction set, the next version of the chip
| will have a different instruction set and a matching LLVM
| back end - they expect you to recompile your code for every
| new CPU.
|
| What I don't see in the literature is any mention of MMUs and
| system level stuff - I'm sure it's there
| hermitdev wrote:
| Didn't we already have this with Java? Write once, run
| anywhere and the hardware would magically adapt the
| incoming code to run on the new hardware? Except now, it's
| LLVM byte-code instead of Java (and x86 asm)?
|
| I'm not trying to be cynical here, but I can see how this
| would sound like that. I guess I'm just confused about what
| this actually is and how it is new/different than all of
| the things that have been tried before.
| Taniwha wrote:
| I think the difference (my impression from little data
| :-) is that this is (mostly) not JIT but instead more in
| depth static compilation of basic blocks into code.
|
| The big change here is the abandonment of the concept of
| an architectural ISA - it depends on software people (all
| of them/us) giving up assembler - I think it's probably
| the right way to approach high ILP VLIW-like cpus - it
| means you don't get hung up on your old CPU designs as
| you move forwards
| giantrobot wrote:
| I think it's both hard to guess things about the code but also
| the _data_ that 's coming in. The "Sufficiently Advanced
| Compiler" could make a lot of good decisions if it knew ahead
| of time the shape of the data, e.g. a big uninterrupted batch
| of fixed transactions.
|
| That's why the target for this sort of technology is GPGPU
| work. The streams of data are very regular and largely a series
| of batch jobs.
|
| For interactive systems with millions of context switches and
| branches all the "Sufficiently Advanced Compilers" all fall
| down. There's just not enough commonality between consecutive
| operations for ahead of time optimizations to occur. Hardware
| that does great in batch jobs ends up suboptimal for the
| insanity that is interactive code.
| denton-scratch wrote:
| When they say "kill instruction sets", do they mean "abandon
| trying to replicate the instruction sets of other manufacturers"?
|
| I don't grok how a processor could not have _instructions_ ; I
| thought an instruction set was simply the set of instructions
| implemented by some processor model. I read the article because I
| was intrigued at the idea of a processor with no instructions. Or
| perhaps, with some kind of fluid instruction set, or loadable
| instruction set.
|
| But that doesn't seem to be what it's about; the article says
| that what they want to abandon is "deep pipelines".
| gfody wrote:
| there are whole other paradigms, eg
| https://m.youtube.com/watch?v=O3tVctB_VSU
| geocrasher wrote:
| Do I understand correctly that this reinvention of the CPU moves
| the microcode to the code being ran (such as an OS kernel) rather
| than the CPU itself, giving the compiler the responsibility to
| use the CPU efficiently?
| ginko wrote:
| >The company, helmed by Peter Foley, CEO and co-founder, who
| previously worked on Apple's Apple I and Apple II computers as
| well as a long list of hardware-design focused companies.
|
| This doesn't seem right.. As far as I know the Apple I was pretty
| much exclusively designed by Steve Wozniak.
| luma wrote:
| According to his LinkedIn, he "Developed chips for the Mac and
| Mac II, including the Apple Sound Chip." Presumably the author
| here didn't catch the difference between the Mac and the
| original Apple computers.
| homarp wrote:
| http://www.byrdsight.com/apple-macintosh/ describes his work
| at Apple.
|
| Nothing on Apple 1, mostly on Mac
| Taniwha wrote:
| I've worked with Pete, he's the real deal, worked on early MAC
| hardware, did the 'hobbit' chip - a Crisp implementation
| intended for the Newton (cancelled after working silicon came
| back)
| GeorgeTirebiter wrote:
| Hobbit was used by EO for the EO 440 and EO 880.
| https://en.wikipedia.org/wiki/EO_Personal_Communicator and
| the chips, for their time, were astounding:
| https://en.wikipedia.org/wiki/AT%26T_Hobbit
| Taniwha wrote:
| I don't think that they used the chip that Pete did for
| Apple
| mkj wrote:
| Are the hyperscalers (proclaimed target market?) likely to be
| willing to give up control of the compiler stack to a third party
| like that? Generally the trend seems to be keeping software
| expertise in-house.
| rbanffy wrote:
| They could write their own compilers, as long as the thing is
| well documented.
| notacoward wrote:
| They'll insist that the toolchain be open source, then they'll
| make their own local modifications which they "never get around
| to" releasing back.
| incrudible wrote:
| Here's a comprehensive list of every time the "magic compiler
| will make our CPU competitive"-approach worked out:
| Animats wrote:
| Yes. This has been tried before. A lot. It's straightforward to
| put a lot of loosely coupled compute units on a single chip, or
| at least a single box. The question is, then what?
|
| - ILLIAC 4 (64-cpu mainframe, 1970s): "A matrix multiply is a
| master's thesis, a matrix inversion is a PhD thesis, and a
| compiler may be beyond the power of the human mind".
|
| - Connection Machine. (SIMD in lockstep, which just wasn't that
| useful.)
|
| - NCube (I tried using one of those, 64 CPUs in an array, each
| with local memory, message passing hardware. It was donated to
| Stanford because some oil company couldn't find a use for it.
| Someone got a chess program going, which works as a distributed
| search problem.
|
| - The Cell CPU in the Playstation. (Not enough memory per CPU
| to do much locally, and slow access to main memory. Tied up the
| entire staff of Sony Computer Entertainment America for years
| trying to figure out a way to make it useful.)
|
| - Itanium. (I went to a talk once by the compiler group from
| HP. Optimal instruction ordering for the thing seemed to be NP-
| hard, without easy approximate solutions.)
|
| Those are just the major ones that made it to production.
|
| But then came GPUs, which do useful things with a lot of
| loosely coupled compute units on a single chip. GPUs have
| turned out to be useful for a reasonable range of compute-heavy
| tasks other than graphics. They're good for neural nets, which
| are a simple inner loop with massive parallelism and not too
| much data sharing. So there may now be a market in this space
| for architectures which failed at "general purpose computing".
| notacoward wrote:
| Also Multiflow and more recently Convey.
|
| http://www.multiflowthebook.com/ https://www.dmagazine.com/pu
| blications/d-ceo/2012/december/c...
|
| Multiflow definitely reached production. I even worked at a
| company that contracted to write some software for them
| (though I personally wasn't involved). They were not a total
| flop, but obviously not a stellar long-term success either.
| I'm not sure if Convey actually reached production, but their
| approach seems much more similar to what Ascenium is trying
| to do.
| incrudible wrote:
| > The Cell CPU in the Playstation. (Not enough memory per CPU
| to do much locally, and slow access to main memory. Tied up
| the entire staff of Sony Computer Entertainment America for
| years trying to figure out a way to make it useful.)
|
| At least towards EOL, developers _did_ figure out what to use
| it for, like using the massive readback performance on the
| SRAM for tiled deferred shading:
|
| https://de.slideshare.net/DICEStudio/spubased-deferred-
| shadi...
|
| Good luck teaching _that_ trick to your magic compiler.
| dragontamer wrote:
| NVidia's CUDA into PTX into SASS is pretty darn good.
|
| The SASS assembly on NVidia's instruction set seems to be
| manually creating read/write barriers at the assembly program
| level instead of leaving it up to the decoder. PTX exists as
| the intermediate step before SASS for a reason: all of that
| read/write barrier placement is extremely complicated.
|
| CUDA cheats by making the programmer do a significant amount of
| the heavy lifting: the programmer needs to write in an
| implicitly parallel style. But once written in that manner, the
| compiler / computer can execute the code in parallel. The
| biggest win for NVidia was convincing enough programmers to
| change paradigms and write code differently.
| dleslie wrote:
| ICC
|
| When Intel shipped a compiler that artificially, and
| purposefully, crippled performance of resulting binaries when
| run on an AMD CPU.
|
| Not what you had in mind, but that dirty trick compiler did
| give Intel an advantage.
| incrudible wrote:
| > Not what you had in mind, but that dirty trick compiler did
| give Intel and advantage.
|
| You can only pull off that trick if you _already_ effectively
| own the market. Otherwise nobody would use your compiler, at
| least not exclusively.
| rbanffy wrote:
| There were dozens of incidents of compilers that looked for
| known benchmark code and optimized the hell out of those
| cases. At the time I remember some debate as to whether it
| was "fair", on one side people saying it was not, on the
| other people saying that they _actually_ improved
| benchmark-like code.
| spijdar wrote:
| While ICC did do AMD processors dirty by intentionally
| disabling optimizations that they (technically) supported in
| a clearly subversive move, this "worked" much better as a
| tactic because ICC did legitimately produce better code for
| Intel CPUs than either MSVC or GCC, and can still produce
| better optimizations.
|
| At least years ago, if you really wanted to rice up your
| system on Gentoo, you'd combine ICC with -O3 and get a small
| but measurable performance bump.
| gnufx wrote:
| The Intel compiler(s) -- I don't know whether ifort and icc
| are really distinct -- may produce better code than GCC,
| and vice versa. The bottom line I got for a set of Fortran
| benchmarks on SKX was essentially a tie with options as
| similar as I could make them. (It's a set that seemed to be
| used for marketing proprietary compilers.) If icc is as
| reliable as ifort, I wouldn't want to build my OS with it.
| fulafel wrote:
| It has somewhat worked out for GPUs, if you phrase it more
| charitably (eg "our chip needs specialized compilers and
| programming languages to perfom well, but will pay off well
| enough that a critical mass of developers will use them").
|
| Not that GPUs and their proprietary fragmented & buggy tooling
| are that nice for developers even now, 15-20 years into the
| attempt, and the vast majority of apps still don't bother with
| it. And of course the whole GPGPU thing was just riding on the
| wing of gaming for most of its existence so had a really long
| artificial runway.
| robmccoll wrote:
| Extracting enough fine grain parallelism is hard. Manually
| creating it is tedious and difficult. Memory is slow. Latency
| hiding is hard. Ahead-of-time speculation is hard. Runtime
| speculation is hard. Good luck.
| dragontamer wrote:
| Your comment is very high quality, despite being short.
|
| > Extracting enough fine grain parallelism is hard. Manually
| creating it is tedious and difficult.
|
| A big win for NVidia was tricking enough programmers into
| manually describing large swaths of parallelism available for a
| GPU to take advantage of.
|
| Manually creating fine grained parallelism through classical
| structures (mutex, semaphores, async, etc. etc.) is tedious and
| probably counterproductive. But it seems like describing
| __some__ extremely common forms of parallelism is in fact very
| useful in certain applications (matrix multiplications).
|
| OpenMP, CUDA, OpenCL, (and even Fortran/Matlab to some
| extent)... even Tensorflow... are showing easy ways to describe
| a parallel problem.
|
| CPUs seem to be the best architecture at discovering latent
| parallelism in general purpose code that the original
| programmer wasn't aware of. (Typical programmers are not aware
| of the multiple-pipelines that a modern CPU tries to fill up...
| and yet that code continues to accelerate from generation-to-
| generation... now with Apple's M1 proving that even 8-way
| decode is reasonable and provides performance gains). CPU
| designers seem to be experts at finding fine-grain parallelism
| as it is, and I don't think anyone can beat them at it. (Only
| another CPU designer will probably make something better).
|
| ---------
|
| But for these alternative compute paradigms: such as GPU-
| compute, there are far fewer programmers and compiler experts
| working on that architecture. Its far easier to find little
| gold-nuggets here and there that lead to improved execution.
|
| SIMD-languages such as C-Star, Lisp-Star, CUDA, OpenCL,
| OpenMP... have been describing parallel patterns for decades
| now. And yet, it only feels like we've begun to scratch that
| surface.
| georgeecollins wrote:
| Everything you are saying makes good sense, but I wanted to
| ask: Is taking a computer program and then re-configuring for
| some exotic and complex CPU a great problem for AI?
|
| I bet people on this forum know why that is / isn't true and I
| am curious.
| dragontamer wrote:
| If you consider "compiler theory" an AI problem, then sure.
|
| But most of us just call that compiler theory. If you're
| going down into FPGA routing, then maybe we also call that
| synthesis (slightly different than compilers, but very very
| similar in problem scope).
|
| Today's compilers have an idea of how modern CPUs execute
| instructions, and optimize accordingly. (Intel's ICC compiler
| is pretty famous for doing this specifically for Intel
| chips). Any FPGA synthesis will similarly be written to
| optimize for resources on the FPGA.
|
| ------------
|
| EDIT: Just so that you know... compiler theory and
| optimization is largely the study of graphs turning into
| other, provably equivalent but more efficient, graphs.
|
| All computer programs can be described as a graph traversal.
| Reconfiguring the graph to be shorter / smaller / more
| parallel (more instruction-level parallelism for the CPU to
| discover), etc. etc. leads to faster execution.
|
| You don't really need deep-learning or an AI to do a lot of
| these graph operations. There's a few NP-complete problems
| (knapsack problem: minimizing memory moves by optimizing
| register allocation) along the way that maybe an AI could try
| to solve. But I have my doubts that throwing a deep-learning
| AI at NP complete problems is the best strategy (especially
| given the strength of modern 3SAT solvers)
| brandmeyer wrote:
| > If you're going down into FPGA routing, then maybe we
| also call that synthesis (slightly different than
| compilers, but very very similar in problem scope).
|
| Synthesis is the process of taking constructs in a high-
| level language (SystemVerilog, VHDL, etc) and translating
| the code into the primitives provided by the target
| architecture (LUT, blockram, DSP, distributed RAM, muxes,
| etc). I agree that this looks like an ordinary compiler
| pipeline.
|
| Place and route is a totally different beast. There is
| potential for an AlphaGo-like structure to help out here
| tremendously, where an AI has been trained to produce a
| cost estimate that aids a decision process and steers the
| search space.
| dragontamer wrote:
| > Place and route is a totally different beast. There is
| potential for an AlphaGo-like structure to help out here
| tremendously, where an AI has been trained to produce a
| cost estimate that aids a decision process and steers the
| search space.
|
| As far as I can tell, that's just another typical "Damn
| it, yet ANOTHER NP-complete problem..." that pops up
| every few minutes when looking through compiler code.
|
| EDIT: Place and route is finite in scope (there's only a
| finite number of blocks on any FPGA or ASIC design). I
| haven't though too hard about the problem, but it seems
| like its similar to knapsack problems. It also seems
| related to non-NP complete problems like Planar Graphs
| (https://en.wikipedia.org/wiki/Planar_graph).
|
| The current state-of-the-art 3SAT solvers are a different
| branch of AI than deep learning. I'm sure deep learning
| can have some application somewhere, but... the state of
| the art is pretty damn good.
|
| There's probably a lot of provably optimal subproblems
| (ex: Planar) that are simpler than NP-complete. But then
| the final solution is going to be some complicated beast
| of a problem that comes down to guess and check (of
| which, 3SAT solvers are the current state of the art).
| jcranmer wrote:
| Your assessment is pretty much correct; it boils down to
| a constrained optimization problem (a la linear
| programming or SAT). And this is a family of problem-
| solving techniques for which we already have decently
| powerful algorithms.
|
| Historically, elements of constrained optimization have
| been considered AI--most intro to AI courses will
| probably cover some variant of backtracking solver or
| various hill-climbing techniques--but it doesn't really
| mesh with what modern AI tends to focus on. Deep learning
| isn't likely to provide any wowza-level results in such a
| space.
| gopalv wrote:
| > You don't really need deep-learning or an AI to do a lot
| of these graph operations.
|
| That is right, but we're probably not hunting for
| absolutely optimal solutions every time.
|
| It usually boils down to heuristics - everytime there's a
| graph reorganization problem where a sub-optimal, but
| better solution is viable, the solutions always involve
| short-cuts derived from the domain knowledge in there.
|
| This is usually very similar to the AI training operations
| where there's a guess from an engineer, experiments to
| validate and modify the heuristic guess to improve the
| results - finally to give up when the results look
| acceptable.
|
| AIs would do well at parameter tuning, at the very least.
|
| My register allocation work would have near infinite loops
| reordering priorities, until I started forcing invariants
| into the system which didn't exist in the original design.
|
| And the usual arguments about "code is compiled once,
| executed millions of times" doesn't quite apply to a JIT.
| freemint wrote:
| A more technical description:
|
| Hardware: DSP slices (Multiply ACcumulate) and look up tables
| (LUTs) in fixed function blocks connected to registers hooked up
| to a cross bar switch.
|
| Software: LLVM target of what amounts to an FPGA with a ton of
| DSP slices
|
| Company: Has been around for a long time 2005(?) under a slightly
| different company name, huge marketing push due to raising of
| funds.
|
| Learn more: https://llvm.org/ProjectsWithLLVM/Ascenium.pdf
| baybal2 wrote:
| Very informative. Basically they have reinvented an FPGA with
| fixed blocks...
|
| > The idea of a compiler-based software solution embedded in an
| architecture would theoretically allow the Aptos processor to
| interpret workload instructions and distribute them across
| processing resources in such a way that the amount of work
| being parallelised is as close as possible to the theoretical
| maximum, whilst taking advantage of having much lesser
| architectural inefficiencies than instruction-based processors.
|
| Whether there will be an instruction set, or not, you will have
| to have some kind of convention for the bitstream for the re-
| programmable logic, and very likely you will re-implement a
| kind of an instruction to do that with high level of
| efficiency.
|
| Modern CPUs already all run microcode on some kinds of a
| programmable fabric in the front-end to give some degree of
| programmability for how the front-end, and back-end interact.
| jcranmer wrote:
| The actual hardware details are kind of danced around a lot
| in both this article and the somewhat more detailed article
| it links to (https://www.nextplatform.com/2021/07/12/gutting-
| decades-of-a...).
|
| From what I can tell, it appears to be a mix between a
| systolic array and Itanium. The systolic array piece is that
| it's a grid of (near?) identical ALUs that communicate with
| adjacent ALUs in lieu of registers. But it also seems that
| there's an instruction stream element as well, something like
| the function that each ALU is processing changes every clock
| cycle or few? It definitely sounds like there's some non-
| spatial component that requires faster reconfiguration times
| than the FPGA reconfiguration logic.
|
| As for viability, as another commenter points out, GPUs are
| currently the competitor to beat. And beating them requires
| either having stellar results in the don't-rewrite-code space
| or targeting use cases that GPUs don't do well at. The latter
| cases involve things like divergent branches or
| irregular/indirect memory accesses, and these cases tend to
| be handled very poorly by compilers and very well by the
| standard superscalar architecture anyways.
| dragontamer wrote:
| Not even that.
|
| Xilinx's processors have a VLIW instruction set feeding a
| SIMD-DSP surrounded by reconfigurable LUTs.
|
| Ascenium will have to try to displace Xilinx's processors,
| which are clearly aiming at this "systolic processor" like
| architecture.
|
| https://www.xilinx.com/support/documentation/white_papers/w
| p...
| freemint wrote:
| To be fair DSPs generally tend to be VLIW even without
| SIMD as seen in this video on the Sega Saturn
| https://www.youtube.com/watch?v=wU2UoxLtIm8 .
| desireco42 wrote:
| So I am not an expert, but from their description, they move more
| complex parts to compiler... To me that sounds a little bit like
| Forth on the Chip, just accumulator and some basic instructions?
| Is this fair to say or I am not understanding it?
| femto wrote:
| Related to Reconfigurable Computing?
|
| https://en.wikipedia.org/wiki/Reconfigurable_computing
|
| It was big in the 1990's, but never took off. Maybe it's time has
| come?
| jerf wrote:
| I post this in the spirit of being corrected if I am wrong, the
| Internet's core skill.
|
| If I'm reading this correctly, the competition for this
| technology isn't CPUs, it's GPUs. If that's the case, it seems
| like they'd be able to show a lot of artificially inflated
| numbers vs. CPUs simply by running already-known-GPU-heavy
| workloads on their hardware vs. CPUs and show massive performance
| gains vs. the CPUs... but what about an optimal GPU algorithm?
|
| Traditionally, making the hardware that can support
| reconfigurability at a highly granular level (FPGAs, for
| instance) has never been able to catch up with conventional
| silicon, because the you can't recover the _quite_ significant
| costs of all that reconfigurability with improved architecture.
| The conventional silicon will still beat you with much faster
| speeds even if you take a penalty vs. some theoretical maximum
| they could have achieved with some other architecture.
|
| Plus, again, GPUs _really_ cut into the whole space of "things
| CPU can't do well". They haven't claimed the whole space, but
| they certainly put their marker down on the second-most
| profitable segment of that space. Between a strong CPU and a
| strong GPU you've got quite a lot of computing power across a
| very wide set of possible computation tasks.
|
| I am not an expert in this space, but it seems like the only
| algorithms I've seen lately people proposing custom hardware for
| at scale are neural network evaluation. By "at scale", I mean,
| there's always "inner loop" sorts of places where someone has
| some specific task they need an ASIC for, encryption being one of
| the classic examples of something that has swung back and forth
| between being done in CPUs and being done in ASICs for decades
| now. But those problems aren't really a candidate for this sort
| of tech because they are _so_ in need of hyperoptimization that
| they do in fact go straight to ASICs. It doesn 't seem to me like
| there's a _huge_ middle ground here anymore between stuff worth
| dedicating the effort to make ASICs for (which almost by
| definition will outperform anything else), and the stuff covered
| by some combination of CPU & GPU.
|
| So... how am I misreading this, oh great and mightily contentious
| Internet?
| pclmulqdq wrote:
| You are correct that the main competing technology is
| throughput compute (GPUs, TPUs, ML processors) rather than
| "general-purpose" CPU-style compute with batch size of 1. The
| way we use CPUs is fairly unique, though, where we time share a
| lot of different code on each core of a system. Trying to move
| CPU applications to an FPGA-style device would be a disaster.
|
| I used to believe that FPGAs could compete with GPUs in
| throughput compute cases thanks to high-level synthesis and
| OpenCL, but it looks like most applications have gone with
| something more ASIC-like than general-purpose (eg TPUs and
| matrix units in GPUs for ML).
| dragontamer wrote:
| CPUs __ARE__ an ASIC.
|
| You can get a commodity CPU today at 4.7+ GHz clock and 64 MB
| of L3 cache (AMD Ryzen 9 5950X). There's no FPGA / configurable
| logic in the world that comes even close. AMD's even shown off
| a test chip with 2x96 MB of L3 cache.
|
| GPUs __ARE__ an ASIC, but with a different configuration than
| CPUs. An AMD MI100 comes with 120 compute units, each CU has
| 4x4x16 SIMD-lanes and each lane has 256 x 32-bit __registers__.
| (AMD runs each lane once-per-4 clock ticks)
|
| That's 32MB of __registers__, accessible in once every 4th
| clock tick. Sure, a clock of 1.5GHz is much slower but you
| ain't getting this number of registers or SRAM on any FPGA.
|
| > I am not an expert in this space, but it seems like the only
| algorithms I've seen lately people proposing custom hardware
| for at scale are neural network evaluation
|
| The FPGA / ASIC stuff is almost always "pipelined systolic
| array". The systolic array is even more parallel than a GPU,
| and requires data to move "exactly" as planned. The idea is
| that instead of moving data to / from central registers, you
| move data across the systolic array to each area that needs it.
|
| ----------
|
| CPUs: Today's CPUs __discover__ parallelism in standard
| assembly code (aka: Instruction level parallelism). It spends a
| lot of energy "decoding", which is really "scheduling" which
| instruction should run in which of some 8 to 16 pipelines
| (depending on AMD Zen vs Skylake vs Apple M1). Each pipeline
| has different characteristics (this one can add / multiply, but
| another one can add / multiply / divide). Sometimes some
| instructions take up the whole pipeline (divide), other times,
| multiply takes 5 clock ticks but can "take an instruction"
| every clock tick.
|
| CPUs spend a huge amount of effort discovering the optimal,
| parallel, execution for any chunk of assembly code, and today
| is scanning ~400 to 700 instructions to do so (the rough size
| of a typical CPU's reorder buffer). That's why branch
| prediction is so important.
|
| In effect: CPUs are today's ultimate MIMD (multiple
| instructions / multiple data) computers... with a big decoder
| in front ensuring the various pipelines remain full.
|
| -----------
|
| GPUs: SIMD execution. If parallelism is known ahead of time,
| this is a better architecture. Instead of spending so much
| space and energy on discovering parallelism, you have the
| programmer explicitly write a parallel program from the start.
|
| GPUs are the today's ultimate SIMD computers (single
| instruction / multiple data) computers.
|
| -------------
|
| Systolic Array: Will never be a general processor, must be
| designed for each specific task at hand. Can only work if all
| data-movements are set in stone ahead of time (such as matrix
| multiplication). Under these conditions, is even more parallel
| and efficient than a GPU.
| moonbug wrote:
| Remind what the ASIC acronym means
| dragontamer wrote:
| Application specific integrated circuit.
|
| The Intel / AMD CPUs are an ASIC that execute x86 as
| quickly as possible. Turns out to be a very important
| application :-)
|
| My point being: you aren't going to beat a CPU at a CPU's
| job. A CPU is an ASIC that executes assembly language as
| quickly as possible. Everything in that chip is designed to
| accelerate the processing of assembly language.
|
| --------
|
| You win vs CPUs by grossly changing the architecture.
| Traditionally, by using a systolic array. (Which is very
| difficult to make a general purpose processor for).
| Brian_K_White wrote:
| I think it's at least a fair question to ask what makes
| "execute x86" such an important application?
|
| Sure the x86 cpu is the best way to execute x86
| instructions, but so what?
|
| I do not actually care about x86. I don't write it or
| read it.
|
| I write bash and c and kicad and ooenscad and markdown
| etc..., and really even those are just todays convenient
| means of expression. The same "what's so untouchable
| about x86" is true for c.
|
| I actually care about manipulating data.
|
| I don't mean databases, I mean everything, like the input
| from a sensor and the output to an actuator or display is
| all just manipulating data at the lowest level.
|
| Maybe this new architecture idea can not perform my
| freecad modelling task faster or more efficiently than my
| i7, but I see nothing about the macroscopic job that
| dictates it's already being mapped to hardware in the
| most elegant way possible translating to x86 ISA and
| executing on an x86 asic.
| dragontamer wrote:
| > I think it's at least a fair question to ask what makes
| "execute x86" such an important application?
|
| x86, ARM, POWER9, and RISC-V are all the same class of
| assembly languages. There's really not much difference
| today in their architectures.
|
| All of them are heavily pipelined, heavily branch
| predicted, superscalar out-of-order speculative
| processors with cache coherence / snooping to provide
| some kind of memory model that's standardizing upon
| Acquire/Release semantics. (Though x86 remains in Total-
| store ordering model instead).
|
| It has been demonstrated that this architecture is the
| fastest for executing high level code from Bash, C, Java,
| Python, etc. etc. Any language that compiles down into a
| set of registers / jumps / calls (including indirect
| calls) that supports threads of execution are inevitably
| going to look a hell of a lot like x86 / ARM / POWER9.
|
| ----------
|
| If you're willing to change to OpenCL / CUDA, then you
| can execute on SIMD-computers such as NVidia Ampere or
| AMD CDNA. Its a completely different execution model than
| x86 / ARM / POWER9 / RISC-V, with a different language to
| support the differences in performance (ex: x86 / POWER9
| have very fast spinlocks. CUDA / OpenCL has very fast
| thread-barriers).
|
| There's a CUDA-like compiler for x86 AVX + ARM-NEON
| called "ispc" for people who want to have CUDA-like
| programming on a CPU. But it executes slower, because
| CPUs have much smaller SIMD-arrays than a GPU. (but
| there's virtually no latency, because x86 AVX / ARM-NEON
| SIMD registers are in the same core as the rest of their
| register space. Like... 1 clock or 2 clocks latency but
| nothing like the 10,000+ clock ticks to communicate to a
| remote GPU)
|
| ----------
|
| Look, if webservers and databases and Java JITs / Bash
| interpreters / Python interpreters could be executed by
| something different (ex: a systolic array), I'm sure
| someone would have tried by now.
|
| But look at the companies who made Java: IBM and Sun.
| What kind of computers did they make? POWER9 / SPARC.
| That's the fruit of their research: the computer
| architecture that best suites Java programming according
| to at least two different sets of researchers.
|
| And what is POWER9? Its a heavily pipelined, heavily
| branch predicted, superscalar out-of-order speculative
| core with cache-coherent acquire/release semantics for
| multicore communication. Basically the same model as a
| x86 processor.
|
| POWER9 even has the same AES-acceleration and 128-bit
| vector SIMD units similar to x86's AVX or SSE
| instructions.
|
| You get a few differences (SMT4 on POWER9 and bigger L3
| cache), but the overall gameplan is extremely similar to
| x86.
| gnufx wrote:
| There are some strange assertions there apart from the
| definition of ASIC. ISAs aren't assembly languages. I
| could believe there are may be more ARM and,
| particularly, RISC-V chips without all those features
| than with. Since when has C the language specified what
| it compiles to, to exclude Lisp machines? I read about
| Oak (later Java) on my second or third generation of
| SPARC workstation, when I'd used RS/6000. Few people
| remember what Sun did market to run Java exclusively. IBM
| might argue about the similarity of our POWER9 to x86 but
| I don't much care.
| mhh__ wrote:
| In comparing with CPUs I would also want to see thorough
| measurement of latency too.
| dw-im-here wrote:
| Yes you are
___________________________________________________________________
(page generated 2021-07-15 23:02 UTC)