[HN Gopher] What's new in CPUs since the 80s? (2015)
___________________________________________________________________
What's new in CPUs since the 80s? (2015)
Author : snvzz
Score : 140 points
Date : 2022-04-20 06:11 UTC (2 days ago)
(HTM) web link (danluu.com)
(TXT) w3m dump (danluu.com)
| susrev wrote:
| The vast expanses of text with no formatting rules always makes
| it hard for me to follow along. Added some simple rules that make
| it much easier to read.
|
| p { maxwidth: 1000px, text-align: center, margin-left: auto,
| margin-right: auto }
|
| body { text-align: center }
|
| MUCH easier (code snippets = toast)
| jimjambw wrote:
| What about reader mode?
| jcadam wrote:
| My Ryzen 9 is just a bit more performant than the MOS6502 my
| Apple ][ was rocking back in the 80s.
|
| Ok... it also has things like a built-in FPU, multiple cores,
| cache, pipelining, branch prediction, more address space, more
| registers, and manufactured with a significantly better (and
| smaller) process.
| phendrenad2 wrote:
| Nice article. Kind of low-hanging fruit though. A comparison
| between CPUs in 2022 vs CPUs in 2002 would be much more
| interesting. ;)
| xmprt wrote:
| Not really low hanging fruit if the last time you studied CPU
| design was in a university course. I personally found a lot of
| the information pretty interesting.
| nimbius wrote:
| my favourite addition since the 80s has been the unrelenting,
| unquestioned, ubiquitous and permanent inclusion of numerous
| iterations of poorly planned and executed management engine
| frameworks designed to completely ablate the user from the
| experience of general computing in the service of perpetual and
| mandatory DRM and security theatricality. the best aspect of this
| new feature is that not only is your processor effectively
| indifferent from a rented pressure washer, but on a long enough
| timeline the indelible omniscient slavemaster to which your IO is
| subservient can and will always find itself hacked. One of the
| biggest features missing from 80s processors was the ability to
| watch a C level cog from a multinational conglomerate squirm in
| an overstarched tom ford three piece as tech journalists
| methodically connect the dots between a corpulent scumbag and a
| hastily constructed excuse to hobble user freedoms at the behest
| of the next minions movie to arrive finally at a conclusion that
| takes said chipmakers stock to the barber.
|
| oh and chips come with light up fans and crap now but theres no
| open standard on how to control the light color so everyone just
| leaves it in Liberace disco mode so its like a wonderful little
| rainbow coloured pride parade is marching through my case.
| Kuinox wrote:
| Is that a GPT3 generated text ? It must be.
| Epiphany21 wrote:
| >oh and chips come with light up fans and crap now but theres
| no open standard on how to control the light color so everyone
| just leaves it in Liberace disco mode
|
| This is why I continue to pay a premium for workstation/server
| grade hardware even when I'm assembling the system myself.
| hengheng wrote:
| Are you alright
| kccqzy wrote:
| > With substantial hardware effort, Google was able to avoid
| interference, but additional isolation features could allow this
| to be done at higher efficiency with less effort.
|
| This is surprising to me. Running any workload that's not your
| own will trash the CPU caches and will make your workload slower.
|
| Consider for example your performance sensitive code has nothing
| to do for the next 500 microseconds. If the core runs some other
| best effort work, it _will_ trash the CPU caches, so that after
| that 500 microseconds, even when that other work is immediately
| preempted by the kernel, your performance sensitive code is now
| dealing with a cold cache.
| ant6n wrote:
| 2015? I think there's a date missing on the page.
| MBCook wrote:
| Near the end in one section the the author provides an update
| and refers to it being 2016, a year since they wrote the
| article.
| gotaquestion wrote:
| The section on power really understates the complexity.
| Throttling didn't appear until the mid-90's as a coarse clock
| gating chipwide. Voltage/frequency scaling appeared a few years
| later (gradual P-state transitions). Then power control units
| monitored key activity signals and could not only scale the
| voltage, but estimate power and target specific blocks (e.g.,
| turning off L1 D$).
|
| There are some more details in there but that's the main gist.
| The power control unit is its own operating system!
| Aardwolf wrote:
| From the top of my head (before reading the article):
|
| caches, pipelining, branch prediction, memory protections, SIMD,
| floating point at all, hyper threading, multi-core, needing
| cooling fins or let alone active cooling
|
| I wonder how much I've forgotten
| yvdriess wrote:
| All of those already existed by the 80s.
| ithkuil wrote:
| US patent for the technology behind hyper-threading was
| granted to Kenneth Okin at Sun Microsystems in November 1994
| code_biologist wrote:
| I don't want to dismiss hyper-threading as trite -- it's
| not, especially in implementation, but it is pretty
| obvious.
|
| Prior to 1994 the CPU-memory speed delta wasn't so bad that
| you needed to cover for stalled execution units constantly.
| Looking at the core clock vs FSB of 1994 Intel chips is a
| great throwback! [1] Then CPU speed exploded relative to
| memory, as was probably anticipated by forward looking CPU
| architects in 1994.
|
| With slow memory there are a few obvious changes you make
| to the degree you need to cover for load stalls: 1) OoO
| execution 2) data prefetching 3) find other computation
| (that likely has its own memory stalls) to interleave. On
| the thread level is a pretty obvious grain to interleave
| work, if deeply non-trivial to actually implement.
|
| Performance oriented programmers have always had to think
| about memory access patterns. Not new since the 80s to need
| to be friendly to your architecture there.
|
| [1] https://en.wikipedia.org/wiki/Pentium#Pentium
| formerly_proven wrote:
| CDC 6600 ran ten threads on one processor in a way that
| seems a lot like the Niagara T1 on paper.
| varjag wrote:
| Most of that (possibly all) existed by 1980s. The Z80 in my
| Spectrum had no heatsink ;)
| [deleted]
| amelius wrote:
| The riddance of segmented memory.
| infogulch wrote:
| Have you heard of this newfangled device called a Graphics
| Processing Unit and "VRAM"?
| amelius wrote:
| Yeah, but that's GPU, not CPU. Hopefully we will see
| similar progress there in the next 40 years.
| vardump wrote:
| The alternative wasn't that great either. Having just 16
| address bits allowed you 64 kB of data memory and code memory
| that were stuck into same RAM area is a lot worse alternative
| (for example 8051 was like that). Or 65816 style banks, ugh.
|
| If you had to have _just 16_ address bits, having code (CS),
| stack (SS), data (DS), extra (ES), etc. segments was actually
| pretty nice. Memory copying and scanning operations were
| natural without needing to swap bank in the innermost loop.
|
| Of course if you could afford 32-bit addressing, there's no
| comparison. Flat memory space is the best option, but I don't
| think it came for free.
| fpoling wrote:
| The segmented memory may come back to provide a cheap way to
| sandbox code within a process.
| VyseofArcadia wrote:
| > Even though incl is a single instruction, it's not guaranteed
| to be atomic. Internally, incl is implemented as a load followed
| by an add followed by an store.
|
| I've heard the joke that RISC won because modern CISC processors
| are just complex interfaces on top of RISC cores, but there
| really is some truth to that, isn't there?
| terafo wrote:
| > _but there really is some truth to that, isn 't there?_
|
| There was an Arm version of AMD's Zen, it was called K12, but
| it never made it to the market since AMD had to choose their
| bets very carefully back then.
| [deleted]
| gchadwick wrote:
| Interesting point about L1 cache sizes and the relationship
| between page size
|
| > Also, first-level caches are usually limited by the page size
| times the associativity of the cache. If the cache is smaller
| than that, the bits used to index into the cache are the same
| regardless if whether you're looking at the virtual address or
| the physical address, so you don't have to do a virtual to
| physical translation before indexing into the cache. If the cache
| is larger than that, you have to first do a TLB lookup to index
| into the cache (which will cost at least one extra cycle), or
| build a virtually indexed cache (which is possible, but adds
| complexity and coupling to software). You can see this limit in
| modern chips. Haswell has an 8-way associative cache and 4kB
| pages. Its l1 data cache is 8 * 4kB = 32kB.
|
| Having helped build the virtually indexed cache of the arm A55 I
| can confirm it's a complete nightmare and I can see why Intel and
| AMD have kept to the L1 data cache limit required to avoid it.
|
| Interestingly Apple may have gone down the virtually indexed
| route (or possibly some other cunning design corner) for the M1
| with their 128 kB data cache. However I believe they standardized
| on 16k pages which would allow still allow physical indexing with
| an 8 way associative cache. So what do they do when they're
| running x86 code with 4k pages? Does they drop 75% of their L1
| cache to maintain physical indexing? Do they aggressively try and
| merge the x86 4k pages into 16k pages with some slow back-up when
| they can't do that? Maybe they've gone with some special purpose
| hardware support for emulating x86 4k pages on their 16k page
| architecture. Have they just indeed implemented a virtually
| indexed cache?
| zozbot234 wrote:
| > Do they aggressively try and merge the x86 4k pages into 16k
| pages with some slow back-up when they can't do that?
|
| This does not seem feasible because the 16k pages on ARM are
| not "huge" pages; it's a completely different arrangement of
| the virtual address space and page tables. The two are not
| interoperable.
| heavyset_go wrote:
| Please add an RSS or Atom feed to your blog :)
| [deleted]
| jeffbee wrote:
| 2015. A good exercise would be "What's new in CPUs since 2015?" A
| few I can think of: branch target alignment has returned as a key
| to achieving peak optimization, after a brief period of
| irrelevance on x86; x86 user-space monitor/wait/pause have
| exposed for the first time explicit power controls to user
| programs.
|
| One thing I would have added to "since the 80s" is the x86
| timestamp counter. It really changed the way we get timing
| information.
| flakiness wrote:
| Big.Little-like architecture? Even intel has adopted that in
| their 12 gen.
|
| I believe a lot has happened around mobile and power as well.
| Apple boasts their progress every year, and at least some of
| them are real. But they are too secretive to talk about that. I
| hope some competitors have written some related papers. For
| example, the OP talks about dark silicon. What's going on
| around it these days?
| terafo wrote:
| titzer wrote:
| Spectre. It was a vulnerability before 2015, but not known
| publicly until early 2018. It's hugely disruptive to
| microarchitecture, particularly with crossing kernel/user space
| boundaries, separating state between hyperthreads, etc.
| dragontamer wrote:
| L3 caches have grown monstrously.
|
| The new AMD Ryzen 5800x3d has 96MB of L3 cache. This is so
| monstrous that the 2048x entry TLB with 4kB pages only can
| access 8MB.
|
| That's right, you run out of TLB-entries before you run out of
| L3 cache these days. (Or you start using hugepages damn it)
|
| ----------
|
| I think Intel's PEXT and PDEP was introduced around 2015-era.
| But AMD chips now execute PEXT / PDEP quickly, so its now
| feasible to use it on most people's modern systems (assuming
| Zen3 or a 2015+ era Intel CPU). Obviously those instructions
| don't exist in ARM / POWER9 world, but they're really fun to
| experiment with.
|
| PEXT / PDEP are effectively bitwise-gather and bitwise-scatter
| instructions, and can be used to perform extremely fast and
| arbitrary bit-permutations. I played around with them to
| implement some relational-database operations (join, select,
| etc. etc.) over bit-relations for the 4-coloring theorem. (Just
| a toy to amuse myself with. A 16-bit bitset of
| "0001_1111_0000_0000" means "(Var1 == Color4 and Var2==Color1)
| or (Var2==Color2)".
|
| There's probably some kind of tight relational algebra /
| automatic logic proving / binary decision diagram / stuffs that
| you can do with PEXT/PDEP. It really seems like an unexplored
| field.
|
| ----
|
| EDIT: Oh, another big one. ARMv8 and POWER9 standardized upon
| the C++11 memory model of acquire-release. This was inevitable
| because Java and C++ standardized upon the memory model in the
| 00s / early 10s, so chips inevitably would be tailored for that
| model.
| seoaeu wrote:
| > That's right, you run out of TLB-entries before you run out
| of L3 cache these days.
|
| This is more reasonable than it sounds. A TLB _miss_ can in
| many cases be faster than a L3 cache _hit_
| jeffbee wrote:
| It's also misleading because it has 8 cores and each of
| them has 2048 l2 TLB entries. Altogether they can cover
| 64MiB of memory with small pages.
| dragontamer wrote:
| But 5800x3D has 96MB of L3. So even if all 8 cores are
| independently working on different memory addresses, you
| still can't cover all 96MB of L3 with the TLB.
|
| EDIT: Well, unless you use 2MB hugepages of course.
| jeffbee wrote:
| That's another thing which is recent. Before Haswell, x86
| cores had almost no huge TLB entries. IvyBridge only had
| 32 in 2MiB mode, compared to 64 + 512 in 4KiB mode.
| dragontamer wrote:
| Are you sure? TLB misses mean a pagewalk. Sure, the
| directory tree is probably in L3 cache, but repeatedly
| pagewalking through L3 to find a memory address is going to
| be slower than just fetching it from the in core TLB.
|
| I know that modern cores have dedicated page walking units
| these days, but I admit that I've never tested the speed of
| them.
| seoaeu wrote:
| It only takes ~200KB to store page tables for 96MB of
| address space. So the page table entries might mostly
| stay in the L1 and L2 caches
| jcranmer wrote:
| Intel PT is another thing that's worth calling out since 2015
| (see the other article on the front page right now,
| https://news.ycombinator.com/item?id=31121319, for something
| that benefits from it).
|
| It does look like Hardware Lock Elision/Transactional Memory is
| something that seems like it will be consigned to the dustbins
| of history (again).
| jeffbee wrote:
| Intel did not ship even one working implementation of TSX, so
| it's not like anyone will be inconvenienced that they
| cancelled it.
| luhn wrote:
| One thing I've always wondered: How do pre-compiled binaries take
| advantage of new instructions, if they do at all? Since the
| compiler needs to create a binary that will work on any modern-
| ish machine, is there a way to use new instructions without
| breaking compatibility for older CPUs?
| Jasper_ wrote:
| Some compilers have a dynamic dispatch for this; you run the
| "cpuinfo" instruction and check for capability bits, and then
| dispatch to the version you can support. Some dynamic linkers
| can even link different versions of a function depending on the
| CPU capabilities -- gcc has a special
| __attribute__((__target__()) pragma for this.
| jeabays wrote:
| Unfortunately, the answer is usually just to recompile.
| runnerup wrote:
| You'd need to recompile the binary to take advantage of new
| instructions.
|
| The compiler alone, but also the code, can create branches
| where the binary checks if certain instructions are available
| and if they are not, use a less optimal operation.
|
| Backwards compatibility for modern binaries basically. But not
| forward ability to see the future instructions that haven't
| been invented yet.
|
| Not all binaries are fully backwards compatible. If you're
| missing AVX, a surprising number of games won't run. Sometimes
| only because the launcher won't run, even though the game plays
| without AVX.
| HideousKojima wrote:
| I've actually sometimes seen this as an argument in favor of
| JITed languages like C# and Java, that you can take advantage
| of newer CPU features and instructions etc. without having to
| recompile. In practice languages that compile to native
| binaries still win at performance, but it was interesting to
| see it turned into a talking point.
| wvenable wrote:
| JIT languages still have a bit of a trade off.
|
| But for a pure pre-compiled example there is Apple Bitcode
| which is meant to be compiled to the destination
| architecture before run. It's mandatory for Apple watchOS
| apps and when they released watch with a 64bit CPU they
| just recompiled all the apps.
___________________________________________________________________
(page generated 2022-04-22 23:00 UTC)