[HN Gopher] Ascenium Wants to Reinvent the CPU - and Kill Instru...
       ___________________________________________________________________
        
       Ascenium Wants to Reinvent the CPU - and Kill Instruction Sets
       Altogether
        
       Author : rbanffy
       Score  : 109 points
       Date   : 2021-07-14 11:50 UTC (1 days ago)
        
 (HTM) web link (www.tomshardware.com)
 (TXT) w3m dump (www.tomshardware.com)
        
       | fouc wrote:
       | Nice, NISC for "No Instruction Set Computing".
       | 
       | I can see it possibly disrupting the CPU industry in 10-20 years.
       | Seems like a classic scenario right out of Clayton M.
       | Christensen's "Innovator's Dilemma" book.
        
         | pavlov wrote:
         | If you run serverless NoSQL on NISC before the rooster crows,
         | you must go outside and weep bitterly.
        
       | spijdar wrote:
       | So I won't pretend to be anywhere close to having the domain
       | knowledge to really understand this, but this feels like a sort
       | of more logical extreme version of the rational behind VLIW
       | processors like Itanium, e.g. "remove hard logic from the
       | microprocessor and put it at the compiler level"
       | 
       | It was my understanding that this approach failed, partially
       | because it's really hard to compute/guess certain things about
       | code ahead-of-time, and modern CPUs with their branch predictors,
       | prefetchers and speculative execution do more at runtime than a
       | compiler could effectively do. Has this changed enough for this
       | to be _generally_ useful, or are they hoping to market this for
       | niche use-cases?
        
         | flatiron wrote:
         | This reminded me especially of
         | https://en.m.wikipedia.org/wiki/Transmeta
        
           | mhh__ wrote:
           | Transmeta is probably the only approach that could really
           | make VLIW work in a general purpose system, I think. Dynamic
           | processors get a good chunk of their speed from being able to
           | (aside from schedule, maybe not as well as a compiler, in the
           | good times, but from being able to _re_ -schedule in adverse
           | conditions) - if you have the power of software to be able to
           | do that speculation then I think it could work.
           | 
           | Hard to know with Transmeta because they had to implement X86
           | which is a legal minefield - FWIW I've read a lot of
           | Transmeta engineers saying that they were completely sold on
           | the idea but just couldn't make it stick in time - I'm too
           | young to have been around when it was being produced so I
           | don't know.
        
             | my123 wrote:
             | And then that design philosophy ended up at NVIDIA. (in the
             | Tegra processor family, with the Denver, Denver 2 and
             | Carmel CPU cores)
             | 
             | I wonder what will happen next... they didn't release a new
             | CPU core since 2018 now. (and they are on the gen using
             | stock Arm cores in the cycle, so Tegra Orin gets
             | Cortex-A78AE)
        
           | beginrescueend wrote:
           | Yeah, when I saw it was a Software-Defined Instruction set, I
           | immediately thought of Transmeta, as well.
        
           | Taniwha wrote:
           | I think that Transmeta (and the others working on the same
           | stuff at the same time) had fixed instruction sets, along
           | with hardware and software to recompile x86 code into that
           | instruction set.
           | 
           | The difference here I think is not that the chips don't have
           | an instruction set, they do, but they don't have an
           | ARCHITECTURAL instruction set, the next version of the chip
           | will have a different instruction set and a matching LLVM
           | back end - they expect you to recompile your code for every
           | new CPU.
           | 
           | What I don't see in the literature is any mention of MMUs and
           | system level stuff - I'm sure it's there
        
             | hermitdev wrote:
             | Didn't we already have this with Java? Write once, run
             | anywhere and the hardware would magically adapt the
             | incoming code to run on the new hardware? Except now, it's
             | LLVM byte-code instead of Java (and x86 asm)?
             | 
             | I'm not trying to be cynical here, but I can see how this
             | would sound like that. I guess I'm just confused about what
             | this actually is and how it is new/different than all of
             | the things that have been tried before.
        
               | Taniwha wrote:
               | I think the difference (my impression from little data
               | :-) is that this is (mostly) not JIT but instead more in
               | depth static compilation of basic blocks into code.
               | 
               | The big change here is the abandonment of the concept of
               | an architectural ISA - it depends on software people (all
               | of them/us) giving up assembler - I think it's probably
               | the right way to approach high ILP VLIW-like cpus - it
               | means you don't get hung up on your old CPU designs as
               | you move forwards
        
         | giantrobot wrote:
         | I think it's both hard to guess things about the code but also
         | the _data_ that 's coming in. The "Sufficiently Advanced
         | Compiler" could make a lot of good decisions if it knew ahead
         | of time the shape of the data, e.g. a big uninterrupted batch
         | of fixed transactions.
         | 
         | That's why the target for this sort of technology is GPGPU
         | work. The streams of data are very regular and largely a series
         | of batch jobs.
         | 
         | For interactive systems with millions of context switches and
         | branches all the "Sufficiently Advanced Compilers" all fall
         | down. There's just not enough commonality between consecutive
         | operations for ahead of time optimizations to occur. Hardware
         | that does great in batch jobs ends up suboptimal for the
         | insanity that is interactive code.
        
       | denton-scratch wrote:
       | When they say "kill instruction sets", do they mean "abandon
       | trying to replicate the instruction sets of other manufacturers"?
       | 
       | I don't grok how a processor could not have _instructions_ ; I
       | thought an instruction set was simply the set of instructions
       | implemented by some processor model. I read the article because I
       | was intrigued at the idea of a processor with no instructions. Or
       | perhaps, with some kind of fluid instruction set, or loadable
       | instruction set.
       | 
       | But that doesn't seem to be what it's about; the article says
       | that what they want to abandon is "deep pipelines".
        
         | gfody wrote:
         | there are whole other paradigms, eg
         | https://m.youtube.com/watch?v=O3tVctB_VSU
        
       | geocrasher wrote:
       | Do I understand correctly that this reinvention of the CPU moves
       | the microcode to the code being ran (such as an OS kernel) rather
       | than the CPU itself, giving the compiler the responsibility to
       | use the CPU efficiently?
        
       | ginko wrote:
       | >The company, helmed by Peter Foley, CEO and co-founder, who
       | previously worked on Apple's Apple I and Apple II computers as
       | well as a long list of hardware-design focused companies.
       | 
       | This doesn't seem right.. As far as I know the Apple I was pretty
       | much exclusively designed by Steve Wozniak.
        
         | luma wrote:
         | According to his LinkedIn, he "Developed chips for the Mac and
         | Mac II, including the Apple Sound Chip." Presumably the author
         | here didn't catch the difference between the Mac and the
         | original Apple computers.
        
           | homarp wrote:
           | http://www.byrdsight.com/apple-macintosh/ describes his work
           | at Apple.
           | 
           | Nothing on Apple 1, mostly on Mac
        
         | Taniwha wrote:
         | I've worked with Pete, he's the real deal, worked on early MAC
         | hardware, did the 'hobbit' chip - a Crisp implementation
         | intended for the Newton (cancelled after working silicon came
         | back)
        
           | GeorgeTirebiter wrote:
           | Hobbit was used by EO for the EO 440 and EO 880.
           | https://en.wikipedia.org/wiki/EO_Personal_Communicator and
           | the chips, for their time, were astounding:
           | https://en.wikipedia.org/wiki/AT%26T_Hobbit
        
             | Taniwha wrote:
             | I don't think that they used the chip that Pete did for
             | Apple
        
       | mkj wrote:
       | Are the hyperscalers (proclaimed target market?) likely to be
       | willing to give up control of the compiler stack to a third party
       | like that? Generally the trend seems to be keeping software
       | expertise in-house.
        
         | rbanffy wrote:
         | They could write their own compilers, as long as the thing is
         | well documented.
        
         | notacoward wrote:
         | They'll insist that the toolchain be open source, then they'll
         | make their own local modifications which they "never get around
         | to" releasing back.
        
       | incrudible wrote:
       | Here's a comprehensive list of every time the "magic compiler
       | will make our CPU competitive"-approach worked out:
        
         | Animats wrote:
         | Yes. This has been tried before. A lot. It's straightforward to
         | put a lot of loosely coupled compute units on a single chip, or
         | at least a single box. The question is, then what?
         | 
         | - ILLIAC 4 (64-cpu mainframe, 1970s): "A matrix multiply is a
         | master's thesis, a matrix inversion is a PhD thesis, and a
         | compiler may be beyond the power of the human mind".
         | 
         | - Connection Machine. (SIMD in lockstep, which just wasn't that
         | useful.)
         | 
         | - NCube (I tried using one of those, 64 CPUs in an array, each
         | with local memory, message passing hardware. It was donated to
         | Stanford because some oil company couldn't find a use for it.
         | Someone got a chess program going, which works as a distributed
         | search problem.
         | 
         | - The Cell CPU in the Playstation. (Not enough memory per CPU
         | to do much locally, and slow access to main memory. Tied up the
         | entire staff of Sony Computer Entertainment America for years
         | trying to figure out a way to make it useful.)
         | 
         | - Itanium. (I went to a talk once by the compiler group from
         | HP. Optimal instruction ordering for the thing seemed to be NP-
         | hard, without easy approximate solutions.)
         | 
         | Those are just the major ones that made it to production.
         | 
         | But then came GPUs, which do useful things with a lot of
         | loosely coupled compute units on a single chip. GPUs have
         | turned out to be useful for a reasonable range of compute-heavy
         | tasks other than graphics. They're good for neural nets, which
         | are a simple inner loop with massive parallelism and not too
         | much data sharing. So there may now be a market in this space
         | for architectures which failed at "general purpose computing".
        
           | notacoward wrote:
           | Also Multiflow and more recently Convey.
           | 
           | http://www.multiflowthebook.com/ https://www.dmagazine.com/pu
           | blications/d-ceo/2012/december/c...
           | 
           | Multiflow definitely reached production. I even worked at a
           | company that contracted to write some software for them
           | (though I personally wasn't involved). They were not a total
           | flop, but obviously not a stellar long-term success either.
           | I'm not sure if Convey actually reached production, but their
           | approach seems much more similar to what Ascenium is trying
           | to do.
        
           | incrudible wrote:
           | > The Cell CPU in the Playstation. (Not enough memory per CPU
           | to do much locally, and slow access to main memory. Tied up
           | the entire staff of Sony Computer Entertainment America for
           | years trying to figure out a way to make it useful.)
           | 
           | At least towards EOL, developers _did_ figure out what to use
           | it for, like using the massive readback performance on the
           | SRAM for tiled deferred shading:
           | 
           | https://de.slideshare.net/DICEStudio/spubased-deferred-
           | shadi...
           | 
           | Good luck teaching _that_ trick to your magic compiler.
        
         | dragontamer wrote:
         | NVidia's CUDA into PTX into SASS is pretty darn good.
         | 
         | The SASS assembly on NVidia's instruction set seems to be
         | manually creating read/write barriers at the assembly program
         | level instead of leaving it up to the decoder. PTX exists as
         | the intermediate step before SASS for a reason: all of that
         | read/write barrier placement is extremely complicated.
         | 
         | CUDA cheats by making the programmer do a significant amount of
         | the heavy lifting: the programmer needs to write in an
         | implicitly parallel style. But once written in that manner, the
         | compiler / computer can execute the code in parallel. The
         | biggest win for NVidia was convincing enough programmers to
         | change paradigms and write code differently.
        
         | dleslie wrote:
         | ICC
         | 
         | When Intel shipped a compiler that artificially, and
         | purposefully, crippled performance of resulting binaries when
         | run on an AMD CPU.
         | 
         | Not what you had in mind, but that dirty trick compiler did
         | give Intel an advantage.
        
           | incrudible wrote:
           | > Not what you had in mind, but that dirty trick compiler did
           | give Intel and advantage.
           | 
           | You can only pull off that trick if you _already_ effectively
           | own the market. Otherwise nobody would use your compiler, at
           | least not exclusively.
        
             | rbanffy wrote:
             | There were dozens of incidents of compilers that looked for
             | known benchmark code and optimized the hell out of those
             | cases. At the time I remember some debate as to whether it
             | was "fair", on one side people saying it was not, on the
             | other people saying that they _actually_ improved
             | benchmark-like code.
        
           | spijdar wrote:
           | While ICC did do AMD processors dirty by intentionally
           | disabling optimizations that they (technically) supported in
           | a clearly subversive move, this "worked" much better as a
           | tactic because ICC did legitimately produce better code for
           | Intel CPUs than either MSVC or GCC, and can still produce
           | better optimizations.
           | 
           | At least years ago, if you really wanted to rice up your
           | system on Gentoo, you'd combine ICC with -O3 and get a small
           | but measurable performance bump.
        
             | gnufx wrote:
             | The Intel compiler(s) -- I don't know whether ifort and icc
             | are really distinct -- may produce better code than GCC,
             | and vice versa. The bottom line I got for a set of Fortran
             | benchmarks on SKX was essentially a tie with options as
             | similar as I could make them. (It's a set that seemed to be
             | used for marketing proprietary compilers.) If icc is as
             | reliable as ifort, I wouldn't want to build my OS with it.
        
         | fulafel wrote:
         | It has somewhat worked out for GPUs, if you phrase it more
         | charitably (eg "our chip needs specialized compilers and
         | programming languages to perfom well, but will pay off well
         | enough that a critical mass of developers will use them").
         | 
         | Not that GPUs and their proprietary fragmented & buggy tooling
         | are that nice for developers even now, 15-20 years into the
         | attempt, and the vast majority of apps still don't bother with
         | it. And of course the whole GPGPU thing was just riding on the
         | wing of gaming for most of its existence so had a really long
         | artificial runway.
        
       | robmccoll wrote:
       | Extracting enough fine grain parallelism is hard. Manually
       | creating it is tedious and difficult. Memory is slow. Latency
       | hiding is hard. Ahead-of-time speculation is hard. Runtime
       | speculation is hard. Good luck.
        
         | dragontamer wrote:
         | Your comment is very high quality, despite being short.
         | 
         | > Extracting enough fine grain parallelism is hard. Manually
         | creating it is tedious and difficult.
         | 
         | A big win for NVidia was tricking enough programmers into
         | manually describing large swaths of parallelism available for a
         | GPU to take advantage of.
         | 
         | Manually creating fine grained parallelism through classical
         | structures (mutex, semaphores, async, etc. etc.) is tedious and
         | probably counterproductive. But it seems like describing
         | __some__ extremely common forms of parallelism is in fact very
         | useful in certain applications (matrix multiplications).
         | 
         | OpenMP, CUDA, OpenCL, (and even Fortran/Matlab to some
         | extent)... even Tensorflow... are showing easy ways to describe
         | a parallel problem.
         | 
         | CPUs seem to be the best architecture at discovering latent
         | parallelism in general purpose code that the original
         | programmer wasn't aware of. (Typical programmers are not aware
         | of the multiple-pipelines that a modern CPU tries to fill up...
         | and yet that code continues to accelerate from generation-to-
         | generation... now with Apple's M1 proving that even 8-way
         | decode is reasonable and provides performance gains). CPU
         | designers seem to be experts at finding fine-grain parallelism
         | as it is, and I don't think anyone can beat them at it. (Only
         | another CPU designer will probably make something better).
         | 
         | ---------
         | 
         | But for these alternative compute paradigms: such as GPU-
         | compute, there are far fewer programmers and compiler experts
         | working on that architecture. Its far easier to find little
         | gold-nuggets here and there that lead to improved execution.
         | 
         | SIMD-languages such as C-Star, Lisp-Star, CUDA, OpenCL,
         | OpenMP... have been describing parallel patterns for decades
         | now. And yet, it only feels like we've begun to scratch that
         | surface.
        
         | georgeecollins wrote:
         | Everything you are saying makes good sense, but I wanted to
         | ask: Is taking a computer program and then re-configuring for
         | some exotic and complex CPU a great problem for AI?
         | 
         | I bet people on this forum know why that is / isn't true and I
         | am curious.
        
           | dragontamer wrote:
           | If you consider "compiler theory" an AI problem, then sure.
           | 
           | But most of us just call that compiler theory. If you're
           | going down into FPGA routing, then maybe we also call that
           | synthesis (slightly different than compilers, but very very
           | similar in problem scope).
           | 
           | Today's compilers have an idea of how modern CPUs execute
           | instructions, and optimize accordingly. (Intel's ICC compiler
           | is pretty famous for doing this specifically for Intel
           | chips). Any FPGA synthesis will similarly be written to
           | optimize for resources on the FPGA.
           | 
           | ------------
           | 
           | EDIT: Just so that you know... compiler theory and
           | optimization is largely the study of graphs turning into
           | other, provably equivalent but more efficient, graphs.
           | 
           | All computer programs can be described as a graph traversal.
           | Reconfiguring the graph to be shorter / smaller / more
           | parallel (more instruction-level parallelism for the CPU to
           | discover), etc. etc. leads to faster execution.
           | 
           | You don't really need deep-learning or an AI to do a lot of
           | these graph operations. There's a few NP-complete problems
           | (knapsack problem: minimizing memory moves by optimizing
           | register allocation) along the way that maybe an AI could try
           | to solve. But I have my doubts that throwing a deep-learning
           | AI at NP complete problems is the best strategy (especially
           | given the strength of modern 3SAT solvers)
        
             | brandmeyer wrote:
             | > If you're going down into FPGA routing, then maybe we
             | also call that synthesis (slightly different than
             | compilers, but very very similar in problem scope).
             | 
             | Synthesis is the process of taking constructs in a high-
             | level language (SystemVerilog, VHDL, etc) and translating
             | the code into the primitives provided by the target
             | architecture (LUT, blockram, DSP, distributed RAM, muxes,
             | etc). I agree that this looks like an ordinary compiler
             | pipeline.
             | 
             | Place and route is a totally different beast. There is
             | potential for an AlphaGo-like structure to help out here
             | tremendously, where an AI has been trained to produce a
             | cost estimate that aids a decision process and steers the
             | search space.
        
               | dragontamer wrote:
               | > Place and route is a totally different beast. There is
               | potential for an AlphaGo-like structure to help out here
               | tremendously, where an AI has been trained to produce a
               | cost estimate that aids a decision process and steers the
               | search space.
               | 
               | As far as I can tell, that's just another typical "Damn
               | it, yet ANOTHER NP-complete problem..." that pops up
               | every few minutes when looking through compiler code.
               | 
               | EDIT: Place and route is finite in scope (there's only a
               | finite number of blocks on any FPGA or ASIC design). I
               | haven't though too hard about the problem, but it seems
               | like its similar to knapsack problems. It also seems
               | related to non-NP complete problems like Planar Graphs
               | (https://en.wikipedia.org/wiki/Planar_graph).
               | 
               | The current state-of-the-art 3SAT solvers are a different
               | branch of AI than deep learning. I'm sure deep learning
               | can have some application somewhere, but... the state of
               | the art is pretty damn good.
               | 
               | There's probably a lot of provably optimal subproblems
               | (ex: Planar) that are simpler than NP-complete. But then
               | the final solution is going to be some complicated beast
               | of a problem that comes down to guess and check (of
               | which, 3SAT solvers are the current state of the art).
        
               | jcranmer wrote:
               | Your assessment is pretty much correct; it boils down to
               | a constrained optimization problem (a la linear
               | programming or SAT). And this is a family of problem-
               | solving techniques for which we already have decently
               | powerful algorithms.
               | 
               | Historically, elements of constrained optimization have
               | been considered AI--most intro to AI courses will
               | probably cover some variant of backtracking solver or
               | various hill-climbing techniques--but it doesn't really
               | mesh with what modern AI tends to focus on. Deep learning
               | isn't likely to provide any wowza-level results in such a
               | space.
        
             | gopalv wrote:
             | > You don't really need deep-learning or an AI to do a lot
             | of these graph operations.
             | 
             | That is right, but we're probably not hunting for
             | absolutely optimal solutions every time.
             | 
             | It usually boils down to heuristics - everytime there's a
             | graph reorganization problem where a sub-optimal, but
             | better solution is viable, the solutions always involve
             | short-cuts derived from the domain knowledge in there.
             | 
             | This is usually very similar to the AI training operations
             | where there's a guess from an engineer, experiments to
             | validate and modify the heuristic guess to improve the
             | results - finally to give up when the results look
             | acceptable.
             | 
             | AIs would do well at parameter tuning, at the very least.
             | 
             | My register allocation work would have near infinite loops
             | reordering priorities, until I started forcing invariants
             | into the system which didn't exist in the original design.
             | 
             | And the usual arguments about "code is compiled once,
             | executed millions of times" doesn't quite apply to a JIT.
        
       | freemint wrote:
       | A more technical description:
       | 
       | Hardware: DSP slices (Multiply ACcumulate) and look up tables
       | (LUTs) in fixed function blocks connected to registers hooked up
       | to a cross bar switch.
       | 
       | Software: LLVM target of what amounts to an FPGA with a ton of
       | DSP slices
       | 
       | Company: Has been around for a long time 2005(?) under a slightly
       | different company name, huge marketing push due to raising of
       | funds.
       | 
       | Learn more: https://llvm.org/ProjectsWithLLVM/Ascenium.pdf
        
         | baybal2 wrote:
         | Very informative. Basically they have reinvented an FPGA with
         | fixed blocks...
         | 
         | > The idea of a compiler-based software solution embedded in an
         | architecture would theoretically allow the Aptos processor to
         | interpret workload instructions and distribute them across
         | processing resources in such a way that the amount of work
         | being parallelised is as close as possible to the theoretical
         | maximum, whilst taking advantage of having much lesser
         | architectural inefficiencies than instruction-based processors.
         | 
         | Whether there will be an instruction set, or not, you will have
         | to have some kind of convention for the bitstream for the re-
         | programmable logic, and very likely you will re-implement a
         | kind of an instruction to do that with high level of
         | efficiency.
         | 
         | Modern CPUs already all run microcode on some kinds of a
         | programmable fabric in the front-end to give some degree of
         | programmability for how the front-end, and back-end interact.
        
           | jcranmer wrote:
           | The actual hardware details are kind of danced around a lot
           | in both this article and the somewhat more detailed article
           | it links to (https://www.nextplatform.com/2021/07/12/gutting-
           | decades-of-a...).
           | 
           | From what I can tell, it appears to be a mix between a
           | systolic array and Itanium. The systolic array piece is that
           | it's a grid of (near?) identical ALUs that communicate with
           | adjacent ALUs in lieu of registers. But it also seems that
           | there's an instruction stream element as well, something like
           | the function that each ALU is processing changes every clock
           | cycle or few? It definitely sounds like there's some non-
           | spatial component that requires faster reconfiguration times
           | than the FPGA reconfiguration logic.
           | 
           | As for viability, as another commenter points out, GPUs are
           | currently the competitor to beat. And beating them requires
           | either having stellar results in the don't-rewrite-code space
           | or targeting use cases that GPUs don't do well at. The latter
           | cases involve things like divergent branches or
           | irregular/indirect memory accesses, and these cases tend to
           | be handled very poorly by compilers and very well by the
           | standard superscalar architecture anyways.
        
             | dragontamer wrote:
             | Not even that.
             | 
             | Xilinx's processors have a VLIW instruction set feeding a
             | SIMD-DSP surrounded by reconfigurable LUTs.
             | 
             | Ascenium will have to try to displace Xilinx's processors,
             | which are clearly aiming at this "systolic processor" like
             | architecture.
             | 
             | https://www.xilinx.com/support/documentation/white_papers/w
             | p...
        
               | freemint wrote:
               | To be fair DSPs generally tend to be VLIW even without
               | SIMD as seen in this video on the Sega Saturn
               | https://www.youtube.com/watch?v=wU2UoxLtIm8 .
        
       | desireco42 wrote:
       | So I am not an expert, but from their description, they move more
       | complex parts to compiler... To me that sounds a little bit like
       | Forth on the Chip, just accumulator and some basic instructions?
       | Is this fair to say or I am not understanding it?
        
       | femto wrote:
       | Related to Reconfigurable Computing?
       | 
       | https://en.wikipedia.org/wiki/Reconfigurable_computing
       | 
       | It was big in the 1990's, but never took off. Maybe it's time has
       | come?
        
       | jerf wrote:
       | I post this in the spirit of being corrected if I am wrong, the
       | Internet's core skill.
       | 
       | If I'm reading this correctly, the competition for this
       | technology isn't CPUs, it's GPUs. If that's the case, it seems
       | like they'd be able to show a lot of artificially inflated
       | numbers vs. CPUs simply by running already-known-GPU-heavy
       | workloads on their hardware vs. CPUs and show massive performance
       | gains vs. the CPUs... but what about an optimal GPU algorithm?
       | 
       | Traditionally, making the hardware that can support
       | reconfigurability at a highly granular level (FPGAs, for
       | instance) has never been able to catch up with conventional
       | silicon, because the you can't recover the _quite_ significant
       | costs of all that reconfigurability with improved architecture.
       | The conventional silicon will still beat you with much faster
       | speeds even if you take a penalty vs. some theoretical maximum
       | they could have achieved with some other architecture.
       | 
       | Plus, again, GPUs _really_ cut into the whole space of  "things
       | CPU can't do well". They haven't claimed the whole space, but
       | they certainly put their marker down on the second-most
       | profitable segment of that space. Between a strong CPU and a
       | strong GPU you've got quite a lot of computing power across a
       | very wide set of possible computation tasks.
       | 
       | I am not an expert in this space, but it seems like the only
       | algorithms I've seen lately people proposing custom hardware for
       | at scale are neural network evaluation. By "at scale", I mean,
       | there's always "inner loop" sorts of places where someone has
       | some specific task they need an ASIC for, encryption being one of
       | the classic examples of something that has swung back and forth
       | between being done in CPUs and being done in ASICs for decades
       | now. But those problems aren't really a candidate for this sort
       | of tech because they are _so_ in need of hyperoptimization that
       | they do in fact go straight to ASICs. It doesn 't seem to me like
       | there's a _huge_ middle ground here anymore between stuff worth
       | dedicating the effort to make ASICs for (which almost by
       | definition will outperform anything else), and the stuff covered
       | by some combination of CPU  & GPU.
       | 
       | So... how am I misreading this, oh great and mightily contentious
       | Internet?
        
         | pclmulqdq wrote:
         | You are correct that the main competing technology is
         | throughput compute (GPUs, TPUs, ML processors) rather than
         | "general-purpose" CPU-style compute with batch size of 1. The
         | way we use CPUs is fairly unique, though, where we time share a
         | lot of different code on each core of a system. Trying to move
         | CPU applications to an FPGA-style device would be a disaster.
         | 
         | I used to believe that FPGAs could compete with GPUs in
         | throughput compute cases thanks to high-level synthesis and
         | OpenCL, but it looks like most applications have gone with
         | something more ASIC-like than general-purpose (eg TPUs and
         | matrix units in GPUs for ML).
        
         | dragontamer wrote:
         | CPUs __ARE__ an ASIC.
         | 
         | You can get a commodity CPU today at 4.7+ GHz clock and 64 MB
         | of L3 cache (AMD Ryzen 9 5950X). There's no FPGA / configurable
         | logic in the world that comes even close. AMD's even shown off
         | a test chip with 2x96 MB of L3 cache.
         | 
         | GPUs __ARE__ an ASIC, but with a different configuration than
         | CPUs. An AMD MI100 comes with 120 compute units, each CU has
         | 4x4x16 SIMD-lanes and each lane has 256 x 32-bit __registers__.
         | (AMD runs each lane once-per-4 clock ticks)
         | 
         | That's 32MB of __registers__, accessible in once every 4th
         | clock tick. Sure, a clock of 1.5GHz is much slower but you
         | ain't getting this number of registers or SRAM on any FPGA.
         | 
         | > I am not an expert in this space, but it seems like the only
         | algorithms I've seen lately people proposing custom hardware
         | for at scale are neural network evaluation
         | 
         | The FPGA / ASIC stuff is almost always "pipelined systolic
         | array". The systolic array is even more parallel than a GPU,
         | and requires data to move "exactly" as planned. The idea is
         | that instead of moving data to / from central registers, you
         | move data across the systolic array to each area that needs it.
         | 
         | ----------
         | 
         | CPUs: Today's CPUs __discover__ parallelism in standard
         | assembly code (aka: Instruction level parallelism). It spends a
         | lot of energy "decoding", which is really "scheduling" which
         | instruction should run in which of some 8 to 16 pipelines
         | (depending on AMD Zen vs Skylake vs Apple M1). Each pipeline
         | has different characteristics (this one can add / multiply, but
         | another one can add / multiply / divide). Sometimes some
         | instructions take up the whole pipeline (divide), other times,
         | multiply takes 5 clock ticks but can "take an instruction"
         | every clock tick.
         | 
         | CPUs spend a huge amount of effort discovering the optimal,
         | parallel, execution for any chunk of assembly code, and today
         | is scanning ~400 to 700 instructions to do so (the rough size
         | of a typical CPU's reorder buffer). That's why branch
         | prediction is so important.
         | 
         | In effect: CPUs are today's ultimate MIMD (multiple
         | instructions / multiple data) computers... with a big decoder
         | in front ensuring the various pipelines remain full.
         | 
         | -----------
         | 
         | GPUs: SIMD execution. If parallelism is known ahead of time,
         | this is a better architecture. Instead of spending so much
         | space and energy on discovering parallelism, you have the
         | programmer explicitly write a parallel program from the start.
         | 
         | GPUs are the today's ultimate SIMD computers (single
         | instruction / multiple data) computers.
         | 
         | -------------
         | 
         | Systolic Array: Will never be a general processor, must be
         | designed for each specific task at hand. Can only work if all
         | data-movements are set in stone ahead of time (such as matrix
         | multiplication). Under these conditions, is even more parallel
         | and efficient than a GPU.
        
           | moonbug wrote:
           | Remind what the ASIC acronym means
        
             | dragontamer wrote:
             | Application specific integrated circuit.
             | 
             | The Intel / AMD CPUs are an ASIC that execute x86 as
             | quickly as possible. Turns out to be a very important
             | application :-)
             | 
             | My point being: you aren't going to beat a CPU at a CPU's
             | job. A CPU is an ASIC that executes assembly language as
             | quickly as possible. Everything in that chip is designed to
             | accelerate the processing of assembly language.
             | 
             | --------
             | 
             | You win vs CPUs by grossly changing the architecture.
             | Traditionally, by using a systolic array. (Which is very
             | difficult to make a general purpose processor for).
        
               | Brian_K_White wrote:
               | I think it's at least a fair question to ask what makes
               | "execute x86" such an important application?
               | 
               | Sure the x86 cpu is the best way to execute x86
               | instructions, but so what?
               | 
               | I do not actually care about x86. I don't write it or
               | read it.
               | 
               | I write bash and c and kicad and ooenscad and markdown
               | etc..., and really even those are just todays convenient
               | means of expression. The same "what's so untouchable
               | about x86" is true for c.
               | 
               | I actually care about manipulating data.
               | 
               | I don't mean databases, I mean everything, like the input
               | from a sensor and the output to an actuator or display is
               | all just manipulating data at the lowest level.
               | 
               | Maybe this new architecture idea can not perform my
               | freecad modelling task faster or more efficiently than my
               | i7, but I see nothing about the macroscopic job that
               | dictates it's already being mapped to hardware in the
               | most elegant way possible translating to x86 ISA and
               | executing on an x86 asic.
        
               | dragontamer wrote:
               | > I think it's at least a fair question to ask what makes
               | "execute x86" such an important application?
               | 
               | x86, ARM, POWER9, and RISC-V are all the same class of
               | assembly languages. There's really not much difference
               | today in their architectures.
               | 
               | All of them are heavily pipelined, heavily branch
               | predicted, superscalar out-of-order speculative
               | processors with cache coherence / snooping to provide
               | some kind of memory model that's standardizing upon
               | Acquire/Release semantics. (Though x86 remains in Total-
               | store ordering model instead).
               | 
               | It has been demonstrated that this architecture is the
               | fastest for executing high level code from Bash, C, Java,
               | Python, etc. etc. Any language that compiles down into a
               | set of registers / jumps / calls (including indirect
               | calls) that supports threads of execution are inevitably
               | going to look a hell of a lot like x86 / ARM / POWER9.
               | 
               | ----------
               | 
               | If you're willing to change to OpenCL / CUDA, then you
               | can execute on SIMD-computers such as NVidia Ampere or
               | AMD CDNA. Its a completely different execution model than
               | x86 / ARM / POWER9 / RISC-V, with a different language to
               | support the differences in performance (ex: x86 / POWER9
               | have very fast spinlocks. CUDA / OpenCL has very fast
               | thread-barriers).
               | 
               | There's a CUDA-like compiler for x86 AVX + ARM-NEON
               | called "ispc" for people who want to have CUDA-like
               | programming on a CPU. But it executes slower, because
               | CPUs have much smaller SIMD-arrays than a GPU. (but
               | there's virtually no latency, because x86 AVX / ARM-NEON
               | SIMD registers are in the same core as the rest of their
               | register space. Like... 1 clock or 2 clocks latency but
               | nothing like the 10,000+ clock ticks to communicate to a
               | remote GPU)
               | 
               | ----------
               | 
               | Look, if webservers and databases and Java JITs / Bash
               | interpreters / Python interpreters could be executed by
               | something different (ex: a systolic array), I'm sure
               | someone would have tried by now.
               | 
               | But look at the companies who made Java: IBM and Sun.
               | What kind of computers did they make? POWER9 / SPARC.
               | That's the fruit of their research: the computer
               | architecture that best suites Java programming according
               | to at least two different sets of researchers.
               | 
               | And what is POWER9? Its a heavily pipelined, heavily
               | branch predicted, superscalar out-of-order speculative
               | core with cache-coherent acquire/release semantics for
               | multicore communication. Basically the same model as a
               | x86 processor.
               | 
               | POWER9 even has the same AES-acceleration and 128-bit
               | vector SIMD units similar to x86's AVX or SSE
               | instructions.
               | 
               | You get a few differences (SMT4 on POWER9 and bigger L3
               | cache), but the overall gameplan is extremely similar to
               | x86.
        
               | gnufx wrote:
               | There are some strange assertions there apart from the
               | definition of ASIC. ISAs aren't assembly languages. I
               | could believe there are may be more ARM and,
               | particularly, RISC-V chips without all those features
               | than with. Since when has C the language specified what
               | it compiles to, to exclude Lisp machines? I read about
               | Oak (later Java) on my second or third generation of
               | SPARC workstation, when I'd used RS/6000. Few people
               | remember what Sun did market to run Java exclusively. IBM
               | might argue about the similarity of our POWER9 to x86 but
               | I don't much care.
        
         | mhh__ wrote:
         | In comparing with CPUs I would also want to see thorough
         | measurement of latency too.
        
         | dw-im-here wrote:
         | Yes you are
        
       ___________________________________________________________________
       (page generated 2021-07-15 23:02 UTC)