[HN Gopher] Intel's Lion Cove Architecture Preview
       ___________________________________________________________________
        
       Intel's Lion Cove Architecture Preview
        
       Author : zdw
       Score  : 159 points
       Date   : 2024-06-04 03:11 UTC (19 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | andrewia wrote:
       | I'm very interested to see independent testing of cores without
       | SMT/hyperthreading. Of course it's one less function for the
       | hardware and thread scheduler to worry about. But hyperthreading
       | was a useful way to share resources between multiple threads that
       | had light-to-intermediate workloads. Synthetic benchmarks might
       | show an improvement but I'm interested to see what everyday
       | workloads, like web browsing while streaming a video, will react.
        
         | adrian_b wrote:
         | I was surprised that disabling SMT has improved by a few
         | percents the Geekbench 6 multi-threaded results on a Zen 3
         | (5900X) CPU.
         | 
         | While there are also other tasks where SMT does not bring
         | advantages, for the compilation of a big software project SMT
         | does bring an obvious performance improvement, of about 20% for
         | the same Zen 3 CPU.
         | 
         | In any case, Intel has said that they have designed 2 versions
         | of the Lion Cove core, one without SMT for laptop/desktop
         | hybrid CPUs and one with SMT for server CPUs with P cores (i.e.
         | for the successor of Granite Rapids, which will be launched
         | later this year, using P-cores similar to those of Meteor
         | Lake).
        
           | papichulo2023 wrote:
           | Probably because the benchmark is not using all cores so the
           | cores hit the cache more often.
        
         | dagmx wrote:
         | Generally HT/SMT has never been favored for high utilization
         | needs or low wattage needs.
         | 
         | On the high utilization end, stuff like offline rendering or
         | even some realtime games, would have significant performance
         | degradation when HT/SMT are enabled. It was incredibly
         | noticeable when I worked in film.
         | 
         | And on the low wattage end, it ends up causing more overhead
         | versus just dumping the jobs on an E core.
        
           | The_Colonel wrote:
           | > And on the low wattage end, it ends up causing more
           | overhead versus just dumping the jobs on an E core.
           | 
           | For most of the HT's existence there weren't any E cores
           | which conflicts with your "never" in the first sentence.
        
             | dagmx wrote:
             | It doesn't because a lot of low wattage silicon doesn't
             | support HT/SMT anyway.
             | 
             | The difference is that now low wattage doesn't have to mean
             | low performance, and getting back that performance is
             | better suited to E cores than introducing HT.
        
               | The_Colonel wrote:
               | > It doesn't
               | 
               | Saying "no" doesn't magically remove your contradiction.
               | E cores didn't exist in laptop/PC/server CPUs before 2022
               | and using HT was a decent way to increase capacity to
               | handle many (e.g. IO) threads without expensive context
               | switches. I'm not saying E cores are a bad solution, but
               | somehow you're trying to erase historical context of HT
               | (or more likely just sloppy writing which you don't want
               | to admit).
        
               | dagmx wrote:
               | I've explained what I meant. You've interjected your own
               | interpretation of my comment and then gotten huffy about
               | it.
               | 
               | We could politely discuss it or you can continue being
               | rude by making accusations of sloppy writing and denials.
        
           | jeffbee wrote:
           | Backend-bound workloads that amount to hours of endless
           | multiplication are not that common. For workloads that are
           | just grab-bags of hundreds of unrelated tasks on a machine,
           | which describes the entire "cloud" thing and most internal
           | crud at every company, HT significantly increases the
           | apparent capacity of the machine.
        
         | mmaniac wrote:
         | The need for hyperthreading has diminished with increasing core
         | counts and shrinking power headroom. You can just run those
         | tasks on E cores now and save energy.
        
         | pjmlp wrote:
         | Since side-channel attacks became a common thing, there is
         | hardly a reason to keep hyperthreading around.
         | 
         | It was a product of its time, a way to get cheap multi-cores
         | when getting real cores was too expensive for regular consumer
         | products.
         | 
         | Besides the security issues, for high performance workloads
         | they have always been an issue, stealing resources across
         | shared CPU units.
        
           | sapiogram wrote:
           | > there is hardly a reason to keep hyperthreading around.
           | 
           | Performance is still a reason. Anecdote: I have a pet project
           | that involves searching for chess puzzles, and hyperthreading
           | improves throughput 22%. Not massive, but definitely not
           | nothing.
        
             | magnio wrote:
             | You mean 4 cores 8 threads give 22% more throughput than 8
             | cores 8 threads or 4 cores 4 threads?
        
               | rbanffy wrote:
               | Remember core to core coordination takes longer than
               | between threads of the same core.
        
               | sapiogram wrote:
               | 4c/8t gives more throughput than 4c/4t.
        
             | pjmlp wrote:
             | My experience with high performance computing is that the
             | shared execution units and smaller caches are worse than
             | dedicated cores.
        
               | rbanffy wrote:
               | As always, the answer is "it depends". If you are getting
               | too many cache misses, and are memory bound, adding more
               | threads will not help you much. If you have idling
               | processor backends, with FP integer or memory units
               | sitting there doing nothing, adding more threads might
               | extract more performance from the part.
        
           | binkHN wrote:
           | For what it's worth, for security reasons, OpenBSD disables
           | hyperthreading by default.
        
       | klooney wrote:
       | > The removal of Hyperthreading makes a lot of sense for Lunar
       | Lake to both reduce the die size of the version of Lion Cove
       | found in Lunar Lake along with simplifying the job of Thread
       | Director.
       | 
       | And, you know, stop the security vulnerability bleeding.
        
         | andrewia wrote:
         | I don't think hyperthreading was the bulk of the attack
         | surface. It definitely presented opportunities for processes to
         | get out of bounds, but I think preemptive scheduling is the
         | bulk of the issue. That genie not going back in the bottle
         | another way to significantly improve processor performance for
         | the same amount of instructions.
        
           | nextaccountic wrote:
           | I think the real problem is cache sharing and hyperthreading
           | kind of depends on it, so it was only ever secure to run two
           | threads from the same security domain in the same core
        
             | antoniojtorres wrote:
             | Newbie question, if the cores share an L3 cache, does that
             | factor in the branch prediction vulnerabilities? Or is the
             | data affected by the vulnerability stay in caches closer to
             | the individual core? I assume so otherwise all cores would
             | be impacted but I'm unclear where it does sit
        
       | andrewia wrote:
       | It's interesting to see that modern processor optimization still
       | revolves around balancing hardware for specific tasks. In this
       | case, the vector scheduler has been separated from the integer
       | scheduler, and the integer pipeline has been made much wider. I'm
       | sure it made sense for this revision, but I wonder if things will
       | change in a few generations in the pendulum will swing back to
       | simplifying and integrating more parts of the arithmrtic
       | scheduler(s) and ALUs.
       | 
       | It's also interesting to see that FPGA integration hasn't gone
       | far, and good vector performance is still important (if less
       | important than integer). I wonder what percentage of consumer and
       | professional workloads make significant use of vector operations,
       | and how much GPU and FPGA offload would alleviate the need for
       | good vector performance. I only know of vector operations in the
       | context of multimedia processing, which is also suited for GPU
       | acceleration.
        
         | dogma1138 wrote:
         | AMD tried that with HSA in the past it doesn't really work.
         | Unless your CPU can magically offload vector processing to the
         | GPU or another sub-processor you are still reliant on new code
         | to get this working which means you break backward
         | compatibility with previously compiled code.
         | 
         | The best case scenario here is if you can have the compiler do
         | all the heavy lifting but more realistically you'll end up
         | having to make developers switch to a whole new programming
         | paradigm.
        
           | andrewia wrote:
           | I understand that you can't convince developers to
           | rewrite/recompile their applications for a processor that
           | breaks compatibility. I'm wondering how many existing
           | applications would be negatively impacted by cutting down
           | vector throughput. With some searching, I see that some
           | applications make mild use of it like Firefox. However there
           | are applications that would negatively affected, such as
           | noise suppression in Microsoft Teams, and crypto acceleration
           | in libssl and the Linux kernel. Acceleration of crypto
           | functions seems essential enough to warrant not touching
           | vector throughput, so it seems vector operations are here to
           | stay in CPUs.
        
             | alexhutcheson wrote:
             | Modern hash table implementations use vector instructions
             | for lookups:
             | 
             | - Folly: https://github.com/facebook/folly/blob/main/folly/
             | container/...
             | 
             | - Abseil: https://abseil.io/about/design/swisstables
        
               | josephg wrote:
               | Sure; but it's hard to do and very few programs get
               | optimised to this point. Before reaching for vector
               | instructions, I'll:
               | 
               | - Benchmark, and verify that the code is hot.
               | 
               | - Rewrite from Python, Ruby, JS into a systems language
               | (if necessary). Honorary mention for C# / Go / Java,
               | which are often fast enough.
               | 
               | - Change to better data structures. Bad data structure
               | choices are still so common.
               | 
               | - Reduce heap allocations. They're more expensive than
               | you think, especially when you take into account the
               | effect on the cpu cache
               | 
               | Do those things well, and you can often get 3 or more
               | orders of magnitude improved performance. At that point,
               | is it worth reaching for SIMD intrinsics? Maybe. But I
               | just haven't written many programs where fast code
               | written in a fast language (c, rust, etc) still wasn't
               | fast enough.
               | 
               | I think it would be different if languages like rust had
               | a high level wrapper around simd that gave you similar
               | performance to hand written simd. But right now, simd is
               | horrible to use and debug. And you usually need to write
               | it per-architecture. Even Intel and amd need different
               | code paths because Intel has dumped avx2.
               | 
               | Outside generic tools like Unicode validation, json
               | parsing and video decoding, I doubt modern simd gets much
               | use. Llvm does what it can but ....
        
               | nickpeterson wrote:
               | Indeed, people really fixate on "slow languages" but for
               | all but the most demanding of applications, the right
               | algorithm and data structures makes the lions share of
               | the difference.
        
             | dogma1138 wrote:
             | Newer SoCs come with co-processors such as NPUs so it's
             | just a question of how long it would take for those
             | workloads to move there.
             | 
             | And this would highly depend on how ubiquitous they'll
             | become and how standardized the APIs will be so you won't
             | have to target IHV specific hardware through their own
             | libraries all the time.
             | 
             | Basically we need a DirectX equivalent for general purpose
             | accelerated compute.
        
               | rbanffy wrote:
               | It's a lot more work to push data to a GPU or NPU than to
               | just to a couple vector ops. Crypto is important enough
               | many architectures have hardware accelerators just for
               | that.
        
               | dogma1138 wrote:
               | For servers no, but we're talking about endpoints here.
               | Also this isn't only about reducing the existing vector
               | bandwidth but also about not increasing it outside of
               | dedicated co-processors.
        
           | JonChesterfield wrote:
           | Persuading people to write their C++ as a graph for
           | heterogeneous execution hasn't gone well. The machinery works
           | though, and it's the right thing for heterogeneous compute,
           | so should see adoption from XLA / pytorch etc.
        
           | hajile wrote:
           | I think the answer here is dedicated cores of different types
           | on the same die.
           | 
           | Some cores will be high-performance, OoO CPU cores.
           | 
           | Now you make another core with the same ISA, but built for a
           | different workload. It should be in-order. It should have a
           | narrow ALU with fairly basic branch prediction. Most of the
           | core will be occupied with two 1024-bit SIMD units and a
           | 8-16x SMT implementation to hide the latency of the threads.
           | 
           | If your CPU and/or OS detects that a thread is packed with
           | SIMD instructions, it will move the thread over to the wide,
           | slow core with latency hiding. Normal threads with low SIMD
           | instruction counts will be put through the high-performance
           | CPU core.
        
             | celrod wrote:
             | Different vector widths for different cores isn't currently
             | feasible, even with SVE. So all cores would need to support
             | 1024-bit SIMD.
             | 
             | I think it's reasonable for the non-SIMD focused cores to
             | do so via splitting into multiple micro-ops or
             | double/quadruple/whatever pumping.
             | 
             | I do think that would be an interesting design to
             | experiment with.
        
             | Symmetry wrote:
             | I've had thoughts along the same lines, but this would
             | require big changes in kernel schedulers, ELF to provide
             | the information, and probably other things.
        
               | soulbadguy wrote:
               | +1 : Heterogeneous/Non uniform core configuration always
               | require a lot of very complex adjustment to the kernel
               | schedulers and core binding policies. Even now after
               | almost a decade of big-little (from arm) configuration
               | and/or chiplet design(from amd) the (linux) kernel
               | scheduling still requires a lot tuning for things like
               | games etc... Adding cores with very different performance
               | characteristics would probably require the thread
               | scheduling to be delegated to the CPU it self with only
               | hint from the kernel scheduler
        
               | hajile wrote:
               | There are a couple methods that could be used.
               | 
               | Static analysis would probably work in this case because
               | the in-order core would be _very_ GPU-like while the
               | other core would not.
               | 
               | In cases where performance characteristics are closer,
               | the OS could switch cores, monitor the runtimes, and add
               | metadata about which core worked best (potentially even
               | about which core worked best at which times).
        
             | dogma1138 wrote:
             | This is what happening now with NPUs and other co-
             | processors. Just not fully OS managed / directed yet but
             | Microsoft is most likely working on that part at least.
             | 
             | The key part is that now there are far more use cases than
             | there were in the early dozer days and that the current
             | main CPU design does not compromise on vector performance
             | like the original AMD design did (outside of extreme cases
             | of very wide vector instructions).
             | 
             | And they are also targeting new use cases such as edge
             | compute AI rather than trying to push the industry to move
             | traditional applications towards GPU compute with HSA.
        
         | gumby wrote:
         | > good vector performance is still important (if less important
         | than integer)
         | 
         | This is in part (major part IMHO) because few languages support
         | vector operations as first class operators. We are still
         | trapped in the tyranny that assumes a C abstract machine.
         | 
         | And so because so few languages support vectors, the
         | instruction mix doesn't emphasize it, therefore there's less
         | incentive to work on new language paradigms, and we remained
         | tapped in a suboptimal loop.
         | 
         | I'm not claiming there are any villains here, we're just stuck
         | in a hill-climbing failure.
        
           | davedx wrote:
           | But isn't that why we have things like CUDA? Who exactly is
           | "we" here, people who only have access to CPU's? :)
        
             | gumby wrote:
             | I'm not saying that you cannot write vector code, but that
             | it's typically a special case. CUDA APIs and annotations
             | are bolted onto existing languages rather than reflecting
             | languages with vector operations as natural first class
             | operations.
             | 
             | C or Java have no concept of `a + b` being a vector
             | operation the way a language like, say, APL does. You can
             | come closer in C++, but in the end the memory model of C
             | and C++ hobbles you. FORTRAN is better in this regard.
        
               | davedx wrote:
               | Makes sense. I guess that's why some python libs use it
               | under the hood
        
               | chasil wrote:
               | I see two options from this perspective.
               | 
               | It is always possible to inline assembler in C, and
               | present vector operators as functions in a library.
               | 
               | Otherwise, R does perceive vectors, so another language
               | that performs well might be a better choice. Julia comes
               | to mind, but I have little familiarity with it.
               | 
               | With Java, linking the JRE via JNI would be an (ugly)
               | option.
        
             | NekkoDroid wrote:
             | When the data is generated on CPU shoveling it to the GPU
             | to do possibly a single or few vector operations and then
             | shoveling it back to the CPU to continue is most likely
             | going to be more expensive than the time saved.
             | 
             | And CUDA is Nvidia specific.
        
               | Filligree wrote:
               | Doesn't CUDA also let you execute on the CPU? I wonder
               | how efficiently.
        
               | HarHarVeryFunny wrote:
               | No - a CUDA program consists of parts that run on the CPU
               | as well as on the GPU, but the CPU (aka host) code is
               | just orchestrating the process - allocating memory,
               | copying data to/from the GPU, and queuing CUDA kernels to
               | run on the GPU. All the work (i.e. running kernels) is
               | done on the GPU.
               | 
               | There are other libraries (e.g. OpenMP, Intel's oneAPI)
               | and languages (e.g. SYCL) that do let the same code be
               | run on either CPU or GPU.
        
             | rbanffy wrote:
             | When you use a GPU, you are using a different processor
             | with a different ISA, running its own barebones OS, with
             | which you communicate mostly by pushing large blocks of
             | memory through the PCIe bus. It's a very different feel
             | from, say, adding AVX512 instructions to your program flow.
        
           | dan-robertson wrote:
           | It's not obvious that that's what's happened here. Eg vector
           | scheduling is separated but there are more units for actually
           | doing certain vector operations. It may be that lots of
           | vector workloads are more limited by memory bandwidth than
           | ILP so adding another port to the scheduler mightn't add
           | much. Being able to run other parts of the cpu faster when
           | vectorised instructions aren't being used could be worth a
           | lot.
        
             | packetlost wrote:
             | That matches with recent material I've read on vectorized
             | workloads: memory bandwidth can become the limiting factor.
        
               | semi-extrinsic wrote:
               | Always nice to see people rediscovering the roofline
               | model.
        
         | Symmetry wrote:
         | As CPU cores get larger and larger it makes sense to always
         | keep looking for opportunities to decouple things. AMD went
         | with separate schedulers in the Athalon three architectural
         | overhauls ago and hasn't reversed their decision.
        
         | frutiger wrote:
         | > It's interesting to see that modern processor optimization
         | still revolves around balancing hardware for specific tasks
         | 
         | Asking sincerely: what's specifically so interesting about
         | that? That is what I would naively expect.
        
           | xw390112 wrote:
           | It's also important to note that in modern hardware the
           | processor core proper is just one piece in a very large
           | system.
           | 
           | Hardware designers are adding a lot of speciality hardware,
           | they're just not putting it into the core, which also makes a
           | lot of sense.
           | 
           | https://www.researchgate.net/figure/Architectural-
           | specializa...
        
         | jandrewrogers wrote:
         | The CPU vector performance is important for throughput-oriented
         | processing of data e.g. databases. A powerful vector
         | implementation gives you most of the benefits of an FPGA for a
         | tiny fraction of the effort but has fewer limitations than a
         | GPU. This hits a price-performance sweet spot for a lot of
         | workloads and the CPU companies have been increasingly making
         | this a first-class "every day" feature of their processors.
        
       | KingOfCoders wrote:
       | Main thing for me, they take the Apple hint and again increase
       | caches (and add another cache layer L0)
        
         | trynumber9 wrote:
         | I don't think the L0 is the "new" cache. It's still 48K like
         | the old L1. The 9 cycle 192K cache about halfway between the
         | old L1 and L2 is really the new cache level in size and
         | latency. And that's now called L1.
        
           | hajile wrote:
           | L0 latency is 4 cycles while old L1 was 5 cycles, so it's
           | definitely a new cache even though the size is the same.
           | 
           | Apple does a lot better. I'm not sure about newer chips, but
           | M1 has the same 192kb of L1, but with the 4 cycle latency of
           | Intel's tiny 48kb cache.
        
       | formerly_proven wrote:
       | End of the unified scheduler is big news, Intel had this since
       | Core. Both AMD and Apple have split the scheduling up for a long
       | time.
        
       | bee_rider wrote:
       | Now that the floating point vector units have gotten large enough
       | to engage in mitosis, I wonder if there will be room in the
       | future for little SIMD units stuck on the integer side. Bring
       | back SSE, haha.
        
       | Symmetry wrote:
       | Removing a core's SMT aka "hyperthreading" has some modest
       | hardware savings but but biggest cost is that it makes testing
       | and validation much more complicated. Given the existence of
       | Intel's p-cores I'm not surprised they're getting rid of it.
        
         | dralley wrote:
         | From Intel's perspective, I doubt that's true, when taking into
         | consideration the constant stream of side channel
         | vulnerabilities they were needing to deal with.
        
           | Symmetry wrote:
           | It's exactly the potential for side channel vulnerabilities
           | that makes SMT so hard to get right.
        
             | rbanffy wrote:
             | You can always do like Sun and IBM and dilute the side
             | channel in too many other threads to make it reliable.
             | IIRC, both POWER10 and SPARC do 8 threads per core.
        
               | Symmetry wrote:
               | It's also a matter of workload. For a database where
               | threads are often waiting on trips to RAM then SMT can
               | provide a very large boost to performance.
        
         | dan-robertson wrote:
         | I think you're saying that testing smt makes it expensive,
         | which sounds mostly right to me, though I can imagine some
         | arguments that it isn't much worse. When I first read your
         | comment, I thought it said the opposite - that removing smt
         | requires expensive testing and validation work.
        
         | trynumber9 wrote:
         | Lion Cove doesn't remove hyper-threading in general. Some
         | variants will have it so Intel still has to validate it
         | (eventually). But the first part shipping, Lunar Lake, removes
         | hyper-threading. It may have saved them validation time for
         | this particular part but they'll still have to do it.
        
           | adrian_b wrote:
           | Intel has said that they have designed two different versions
           | of the Lion Cove core.
           | 
           | One with SMT, for server CPUs, and one in which all circuits
           | related to SMT are completely removed, for smaller area and
           | power consumption, to be used in hybrid CPUs for laptops and
           | desktops, together with Skymont cores.
        
       | therealmarv wrote:
       | And the main question: Is it overall better (faster and less
       | power hungry) than a reduced instruction set, powerful laptop ARM
       | based CPU which is around the corner (Qualcomm)? Guessing not...
        
         | Rinzler89 wrote:
         | There's still an insurmountable amount of apps that are still
         | X86 exclusive both new and legacy. So the chip not beating the
         | best of ARM in benchmarks is largely irelevant.
         | 
         | A Ferrari will beat a tractor on every test bench numbers and
         | every track, but I can't plow the field with a Ferrari, so any
         | improvements in tractor technology is still welcome despite
         | they'll never beat Ferraris.
        
         | jeffbee wrote:
         | Amazing that you can look at an ISA like ARM and say "reduced
         | instruction set". It has 1300+ opcodes.
        
           | mhh__ wrote:
           | At best ARM is regular rather than reduced
        
           | JonChesterfield wrote:
           | Aarch64 looks a lot like x86-64 to me. Deep pipelines, loads
           | of silicon spent on branch prediction, vector units.
        
           | aidenn0 wrote:
           | It seems that "RISC" has just become a synonym for "load-
           | store architecture"
           | 
           | Non-embedded POWER implementations are around 1000 opcodes,
           | depending on the features supported, and even MIPS eventually
           | got a square-root instruction.
        
           | yjftsjthsd-h wrote:
           | I used to think this too, but apparently RISC isn't about the
           | number of instructions, but the complexity or execution time
           | of each; as https://en.wikipedia.org/wiki/Reduced_instruction
           | _set_comput... puts it,
           | 
           | > The key operational concept of the RISC computer is that
           | each instruction performs only one function (e.g. copy a
           | value from memory to a register).
           | 
           | and in fact that page even mentions at https://en.wikipedia.o
           | rg/wiki/Reduced_instruction_set_comput... that
           | 
           | > Some CPUs have been specifically designed to have a very
           | small set of instructions--but these designs are very
           | different from classic RISC designs, so they have been given
           | other names such as minimal instruction set computer (MISC)
           | or transport triggered architecture (TTA).
        
         | ffgjgf1 wrote:
         | IIRC they claimed that it supposedly is
        
           | therealmarv wrote:
           | Would be great if true! Competition always good for us
           | consumers.
        
         | yjftsjthsd-h wrote:
         | What are you guessing from? Historically, generation for
         | generation, x86 is good at performance and awful at power
         | consumption. Even when Apple (not aarch64 in general, just
         | Apple) briefly pulled ahead on both, subsequent x86 chips kept
         | winning on raw performance, even as they got destroyed on perf
         | per Watt.
        
           | dur-randir wrote:
           | >kept winning on raw performance
           | 
           | 13900K lost couple of % in single tread performance, which
           | lead to 14900K being so overclocked/overvolted that it lead
           | to it being useless for what it's made for - crunching
           | numbers. See https://www.radgametools.com/oodleintel.htm.
        
       | SilverBirch wrote:
       | I'm just wondering, how much of this analysis is real? I mean,
       | how much of this analysis is a true weighing up of design
       | decisions and performance, and how much is copy/pasted Intel
       | marketing bumf?
        
         | antoniojtorres wrote:
         | That's typically the question with all marketing (probably
         | cherry-picked) benchmarks. Have to wait until independent
         | reviewers get their hands on them.
        
         | dur-randir wrote:
         | Nothing real. They cherry-pick for what they're trying to sell
         | you this time - advent of 2nd core, +HT, E-cores, -HT, you-
         | name-it-next-time. Without independent benchmark + additional
         | interpretation for your particular workload you can safely
         | ignore all their claims.
        
       | JonChesterfield wrote:
       | The dropping hyperthreading is interesting. I'm hoping x64 will
       | trend in the other direction - I'd like four or eight threads
       | scheduled across a pool of execution units as a way of hiding
       | memory latency. Sort of a step towards the GPU model.
        
         | jandrewrogers wrote:
         | This has not been successful historically because software
         | developers struggle to design code that has mechanical sympathy
         | with these architectural models. These types of processors will
         | run idiomatic CPU code with mediocre efficiency, which provides
         | an element of backward compatibility, but achieving their
         | theoretical efficiency requires barrel-processing style data
         | structures and algorithms, which isn't something widely taught
         | or known to software developers.
         | 
         | I like these architectures but don't expect it to happen. Every
         | attempt to produce a CPU that works this way has done poorly
         | because software developers don't know how to use them
         | properly. The hardware companies have done the "build it and
         | they will come" thing multiple times and it has never panned
         | out for them. You would need a killer app that would strongly
         | incentivize software developers to learn how design efficient
         | code for these architectures.
        
           | bri3d wrote:
           | I think another huge issue is the combination of general-
           | purpose/multi-tenant computing, especially with
           | virtualization in the mix, and cache. Improvements gained
           | from masking memory latency by issuing instructions from more
           | thread execution contexts are suddenly lost when each of
           | those thread contexts is accessing a totally different part
           | of memory and evicting the cache line that the next thread
           | wanted anyway.
           | 
           | In many ways barrel processing and hyperthreading are
           | remarkably similar to VLIW - they're two _very_ different
           | sides of the same coin. At the end of the day, both work well
           | for situations where multiple simultaneous instruction
           | streams need to access relatively small amounts of adjacent
           | data. Realtime signal processing, dedicated databases, and
           | gaming are obvious applications here, which I think is why
           | VLIW has done so well in DSP and hyperthreading did well in
           | gaming and databases. Once the simultaneous instruction
           | streams are accessing completely disparate parts of main
           | memory, it all blows up.
           | 
           | Plus, in a multi-tenant environment, the security issues
           | inherent to sharing even more execution state context (and
           | therefore side channels) between domains also become
           | untenable fairly quickly.
        
           | pradn wrote:
           | Hyperthreading is also useful for simply making a little
           | progress on more threads. Even if you're not pinning a core
           | by making full use of two hyperthreads, you can handle double
           | the threads at once. Now, I don't know how important it is,
           | but I assume that for desktop applications, this could lead
           | to a snappier experience. Most code on the desktop is just
           | waiting for memory loads and context switching.
           | 
           | Of course, the big elephant in the room is security - timing
           | attacks on shared cores is a big problem. Sharing anything is
           | a big problem for security conscious customers.
           | 
           | Maybe it's the case of the server leading the client here.
        
             | immibis wrote:
             | Just this morning I benchmarked my 7960X on RandomX (i.e.
             | Monero mining). That's a 24-core CPU. With 48 cores, it
             | gets about 10kH/s and uses about 100 watts extra. With 48
             | threads, about 15kH/s and 150 watts. It does make sense
             | with the nature of RandomX.
             | 
             | Another benchmark I've done is .onion vanity address
             | mining. Here it's about a 20% improvement in total
             | throughput when using hyperthreading. It's definitely not
             | useless.
             | 
             | However, I didn't compare to a scenario with hyperthreading
             | disabled in the BIOS. Are you telling me the threads get
             | 20-50% faster, each, with it disabled?
        
         | Covzire wrote:
         | I've been disabling HT for security reasons on all my machines
         | ever since the first vulnerabilities appeared, with no
         | discernible negative impact on daily usage.
        
           | groos wrote:
           | The gain from HT for doing large builds has been 20% at most.
           | Daily usage is indistinguishable, as you say.
        
             | MisterTea wrote:
             | HT really only benefits you if you spend a lot of time
             | waiting on IO. E.g. file serving where you're waiting on
             | slow spinning disks.
        
           | binkHN wrote:
           | For security reasons, OpenBSD disables hyperthreading by
           | default.
        
           | duffyjp wrote:
           | I'm running a quite old CPU currently, a 6-core Haswell-E
           | i7-5930K. Disabling HT gave me a huge boost in gaming
           | workloads. I'm basically never doing anything that pegs the
           | entire CPU so getting that extra 10-15% IPC for single
           | threaded tasks is huge.
           | 
           | And like you said, the vulnerability consideration makes HT a
           | hard sell for workloads that would benefit (hypervisors).
           | 
           | That i7 is a huge downgrade from what I was running before
           | (long story) so I'm looking forward to Arrow Lake and I like
           | everything I've read. In addition to removing HT, they're
           | supporting Native JEDEC DDR5-6400 Memory so XMP won't be
           | necessary. I've never liked XMP/Expo...
        
         | AzzyHN wrote:
         | To summarize Intel's reasoning, the extra hardware (and I guess
         | associated firmware or whatever it'd be called) required to
         | manage hyperthreading in their P-cores takes up package space
         | and power, meaning the core can't boost as high.
         | 
         | And since hyperthreading only increases IPC by 30% (according
         | to Intel), they're planning on making up the loss of threads
         | with more E-cores.
         | 
         | But we'll have to see how that turns out, especially since
         | Intel's first chiplet design (the Core Ultra series 1) had
         | performance degradations compared to their 13th Gen mobile
         | counterparts
        
         | killerstorm wrote:
         | That's what SPARC and Power architectures do now, with 8
         | threads per core.
         | 
         | The only issues with these architectures is that they are
         | priced as "high-end server stuff" while x64 is priced like a
         | commodity.
        
       | osigurdson wrote:
       | From an investor perspective, I'd rather hear: "we pay more for
       | talent than our competition". That is the first step toward
       | greatness.
        
       | tambourine_man wrote:
       | > Moving to a more customizable design will allow Intel to better
       | optimize their P-cores for specific designs moving forward
       | 
       | Amateur question: is that due to the recent advances in
       | interconnect or are they designing multiple versions of this
       | chip?
        
         | wmf wrote:
         | Cheese is talking about different chips (e.g. laptop, desktop,
         | and server) that contain Lion Cove cores. Intel doesn't really
         | reuse chips between different segments.
        
       | ein0p wrote:
       | Once (and if) Intel manages to get out of the woods with
       | lithography, they will have an incredibly strong product lineup.
        
       ___________________________________________________________________
       (page generated 2024-06-04 23:02 UTC)