[HN Gopher] Intel's Lion Cove Architecture Preview
___________________________________________________________________
Intel's Lion Cove Architecture Preview
Author : zdw
Score : 159 points
Date : 2024-06-04 03:11 UTC (19 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| andrewia wrote:
| I'm very interested to see independent testing of cores without
| SMT/hyperthreading. Of course it's one less function for the
| hardware and thread scheduler to worry about. But hyperthreading
| was a useful way to share resources between multiple threads that
| had light-to-intermediate workloads. Synthetic benchmarks might
| show an improvement but I'm interested to see what everyday
| workloads, like web browsing while streaming a video, will react.
| adrian_b wrote:
| I was surprised that disabling SMT has improved by a few
| percents the Geekbench 6 multi-threaded results on a Zen 3
| (5900X) CPU.
|
| While there are also other tasks where SMT does not bring
| advantages, for the compilation of a big software project SMT
| does bring an obvious performance improvement, of about 20% for
| the same Zen 3 CPU.
|
| In any case, Intel has said that they have designed 2 versions
| of the Lion Cove core, one without SMT for laptop/desktop
| hybrid CPUs and one with SMT for server CPUs with P cores (i.e.
| for the successor of Granite Rapids, which will be launched
| later this year, using P-cores similar to those of Meteor
| Lake).
| papichulo2023 wrote:
| Probably because the benchmark is not using all cores so the
| cores hit the cache more often.
| dagmx wrote:
| Generally HT/SMT has never been favored for high utilization
| needs or low wattage needs.
|
| On the high utilization end, stuff like offline rendering or
| even some realtime games, would have significant performance
| degradation when HT/SMT are enabled. It was incredibly
| noticeable when I worked in film.
|
| And on the low wattage end, it ends up causing more overhead
| versus just dumping the jobs on an E core.
| The_Colonel wrote:
| > And on the low wattage end, it ends up causing more
| overhead versus just dumping the jobs on an E core.
|
| For most of the HT's existence there weren't any E cores
| which conflicts with your "never" in the first sentence.
| dagmx wrote:
| It doesn't because a lot of low wattage silicon doesn't
| support HT/SMT anyway.
|
| The difference is that now low wattage doesn't have to mean
| low performance, and getting back that performance is
| better suited to E cores than introducing HT.
| The_Colonel wrote:
| > It doesn't
|
| Saying "no" doesn't magically remove your contradiction.
| E cores didn't exist in laptop/PC/server CPUs before 2022
| and using HT was a decent way to increase capacity to
| handle many (e.g. IO) threads without expensive context
| switches. I'm not saying E cores are a bad solution, but
| somehow you're trying to erase historical context of HT
| (or more likely just sloppy writing which you don't want
| to admit).
| dagmx wrote:
| I've explained what I meant. You've interjected your own
| interpretation of my comment and then gotten huffy about
| it.
|
| We could politely discuss it or you can continue being
| rude by making accusations of sloppy writing and denials.
| jeffbee wrote:
| Backend-bound workloads that amount to hours of endless
| multiplication are not that common. For workloads that are
| just grab-bags of hundreds of unrelated tasks on a machine,
| which describes the entire "cloud" thing and most internal
| crud at every company, HT significantly increases the
| apparent capacity of the machine.
| mmaniac wrote:
| The need for hyperthreading has diminished with increasing core
| counts and shrinking power headroom. You can just run those
| tasks on E cores now and save energy.
| pjmlp wrote:
| Since side-channel attacks became a common thing, there is
| hardly a reason to keep hyperthreading around.
|
| It was a product of its time, a way to get cheap multi-cores
| when getting real cores was too expensive for regular consumer
| products.
|
| Besides the security issues, for high performance workloads
| they have always been an issue, stealing resources across
| shared CPU units.
| sapiogram wrote:
| > there is hardly a reason to keep hyperthreading around.
|
| Performance is still a reason. Anecdote: I have a pet project
| that involves searching for chess puzzles, and hyperthreading
| improves throughput 22%. Not massive, but definitely not
| nothing.
| magnio wrote:
| You mean 4 cores 8 threads give 22% more throughput than 8
| cores 8 threads or 4 cores 4 threads?
| rbanffy wrote:
| Remember core to core coordination takes longer than
| between threads of the same core.
| sapiogram wrote:
| 4c/8t gives more throughput than 4c/4t.
| pjmlp wrote:
| My experience with high performance computing is that the
| shared execution units and smaller caches are worse than
| dedicated cores.
| rbanffy wrote:
| As always, the answer is "it depends". If you are getting
| too many cache misses, and are memory bound, adding more
| threads will not help you much. If you have idling
| processor backends, with FP integer or memory units
| sitting there doing nothing, adding more threads might
| extract more performance from the part.
| binkHN wrote:
| For what it's worth, for security reasons, OpenBSD disables
| hyperthreading by default.
| klooney wrote:
| > The removal of Hyperthreading makes a lot of sense for Lunar
| Lake to both reduce the die size of the version of Lion Cove
| found in Lunar Lake along with simplifying the job of Thread
| Director.
|
| And, you know, stop the security vulnerability bleeding.
| andrewia wrote:
| I don't think hyperthreading was the bulk of the attack
| surface. It definitely presented opportunities for processes to
| get out of bounds, but I think preemptive scheduling is the
| bulk of the issue. That genie not going back in the bottle
| another way to significantly improve processor performance for
| the same amount of instructions.
| nextaccountic wrote:
| I think the real problem is cache sharing and hyperthreading
| kind of depends on it, so it was only ever secure to run two
| threads from the same security domain in the same core
| antoniojtorres wrote:
| Newbie question, if the cores share an L3 cache, does that
| factor in the branch prediction vulnerabilities? Or is the
| data affected by the vulnerability stay in caches closer to
| the individual core? I assume so otherwise all cores would
| be impacted but I'm unclear where it does sit
| andrewia wrote:
| It's interesting to see that modern processor optimization still
| revolves around balancing hardware for specific tasks. In this
| case, the vector scheduler has been separated from the integer
| scheduler, and the integer pipeline has been made much wider. I'm
| sure it made sense for this revision, but I wonder if things will
| change in a few generations in the pendulum will swing back to
| simplifying and integrating more parts of the arithmrtic
| scheduler(s) and ALUs.
|
| It's also interesting to see that FPGA integration hasn't gone
| far, and good vector performance is still important (if less
| important than integer). I wonder what percentage of consumer and
| professional workloads make significant use of vector operations,
| and how much GPU and FPGA offload would alleviate the need for
| good vector performance. I only know of vector operations in the
| context of multimedia processing, which is also suited for GPU
| acceleration.
| dogma1138 wrote:
| AMD tried that with HSA in the past it doesn't really work.
| Unless your CPU can magically offload vector processing to the
| GPU or another sub-processor you are still reliant on new code
| to get this working which means you break backward
| compatibility with previously compiled code.
|
| The best case scenario here is if you can have the compiler do
| all the heavy lifting but more realistically you'll end up
| having to make developers switch to a whole new programming
| paradigm.
| andrewia wrote:
| I understand that you can't convince developers to
| rewrite/recompile their applications for a processor that
| breaks compatibility. I'm wondering how many existing
| applications would be negatively impacted by cutting down
| vector throughput. With some searching, I see that some
| applications make mild use of it like Firefox. However there
| are applications that would negatively affected, such as
| noise suppression in Microsoft Teams, and crypto acceleration
| in libssl and the Linux kernel. Acceleration of crypto
| functions seems essential enough to warrant not touching
| vector throughput, so it seems vector operations are here to
| stay in CPUs.
| alexhutcheson wrote:
| Modern hash table implementations use vector instructions
| for lookups:
|
| - Folly: https://github.com/facebook/folly/blob/main/folly/
| container/...
|
| - Abseil: https://abseil.io/about/design/swisstables
| josephg wrote:
| Sure; but it's hard to do and very few programs get
| optimised to this point. Before reaching for vector
| instructions, I'll:
|
| - Benchmark, and verify that the code is hot.
|
| - Rewrite from Python, Ruby, JS into a systems language
| (if necessary). Honorary mention for C# / Go / Java,
| which are often fast enough.
|
| - Change to better data structures. Bad data structure
| choices are still so common.
|
| - Reduce heap allocations. They're more expensive than
| you think, especially when you take into account the
| effect on the cpu cache
|
| Do those things well, and you can often get 3 or more
| orders of magnitude improved performance. At that point,
| is it worth reaching for SIMD intrinsics? Maybe. But I
| just haven't written many programs where fast code
| written in a fast language (c, rust, etc) still wasn't
| fast enough.
|
| I think it would be different if languages like rust had
| a high level wrapper around simd that gave you similar
| performance to hand written simd. But right now, simd is
| horrible to use and debug. And you usually need to write
| it per-architecture. Even Intel and amd need different
| code paths because Intel has dumped avx2.
|
| Outside generic tools like Unicode validation, json
| parsing and video decoding, I doubt modern simd gets much
| use. Llvm does what it can but ....
| nickpeterson wrote:
| Indeed, people really fixate on "slow languages" but for
| all but the most demanding of applications, the right
| algorithm and data structures makes the lions share of
| the difference.
| dogma1138 wrote:
| Newer SoCs come with co-processors such as NPUs so it's
| just a question of how long it would take for those
| workloads to move there.
|
| And this would highly depend on how ubiquitous they'll
| become and how standardized the APIs will be so you won't
| have to target IHV specific hardware through their own
| libraries all the time.
|
| Basically we need a DirectX equivalent for general purpose
| accelerated compute.
| rbanffy wrote:
| It's a lot more work to push data to a GPU or NPU than to
| just to a couple vector ops. Crypto is important enough
| many architectures have hardware accelerators just for
| that.
| dogma1138 wrote:
| For servers no, but we're talking about endpoints here.
| Also this isn't only about reducing the existing vector
| bandwidth but also about not increasing it outside of
| dedicated co-processors.
| JonChesterfield wrote:
| Persuading people to write their C++ as a graph for
| heterogeneous execution hasn't gone well. The machinery works
| though, and it's the right thing for heterogeneous compute,
| so should see adoption from XLA / pytorch etc.
| hajile wrote:
| I think the answer here is dedicated cores of different types
| on the same die.
|
| Some cores will be high-performance, OoO CPU cores.
|
| Now you make another core with the same ISA, but built for a
| different workload. It should be in-order. It should have a
| narrow ALU with fairly basic branch prediction. Most of the
| core will be occupied with two 1024-bit SIMD units and a
| 8-16x SMT implementation to hide the latency of the threads.
|
| If your CPU and/or OS detects that a thread is packed with
| SIMD instructions, it will move the thread over to the wide,
| slow core with latency hiding. Normal threads with low SIMD
| instruction counts will be put through the high-performance
| CPU core.
| celrod wrote:
| Different vector widths for different cores isn't currently
| feasible, even with SVE. So all cores would need to support
| 1024-bit SIMD.
|
| I think it's reasonable for the non-SIMD focused cores to
| do so via splitting into multiple micro-ops or
| double/quadruple/whatever pumping.
|
| I do think that would be an interesting design to
| experiment with.
| Symmetry wrote:
| I've had thoughts along the same lines, but this would
| require big changes in kernel schedulers, ELF to provide
| the information, and probably other things.
| soulbadguy wrote:
| +1 : Heterogeneous/Non uniform core configuration always
| require a lot of very complex adjustment to the kernel
| schedulers and core binding policies. Even now after
| almost a decade of big-little (from arm) configuration
| and/or chiplet design(from amd) the (linux) kernel
| scheduling still requires a lot tuning for things like
| games etc... Adding cores with very different performance
| characteristics would probably require the thread
| scheduling to be delegated to the CPU it self with only
| hint from the kernel scheduler
| hajile wrote:
| There are a couple methods that could be used.
|
| Static analysis would probably work in this case because
| the in-order core would be _very_ GPU-like while the
| other core would not.
|
| In cases where performance characteristics are closer,
| the OS could switch cores, monitor the runtimes, and add
| metadata about which core worked best (potentially even
| about which core worked best at which times).
| dogma1138 wrote:
| This is what happening now with NPUs and other co-
| processors. Just not fully OS managed / directed yet but
| Microsoft is most likely working on that part at least.
|
| The key part is that now there are far more use cases than
| there were in the early dozer days and that the current
| main CPU design does not compromise on vector performance
| like the original AMD design did (outside of extreme cases
| of very wide vector instructions).
|
| And they are also targeting new use cases such as edge
| compute AI rather than trying to push the industry to move
| traditional applications towards GPU compute with HSA.
| gumby wrote:
| > good vector performance is still important (if less important
| than integer)
|
| This is in part (major part IMHO) because few languages support
| vector operations as first class operators. We are still
| trapped in the tyranny that assumes a C abstract machine.
|
| And so because so few languages support vectors, the
| instruction mix doesn't emphasize it, therefore there's less
| incentive to work on new language paradigms, and we remained
| tapped in a suboptimal loop.
|
| I'm not claiming there are any villains here, we're just stuck
| in a hill-climbing failure.
| davedx wrote:
| But isn't that why we have things like CUDA? Who exactly is
| "we" here, people who only have access to CPU's? :)
| gumby wrote:
| I'm not saying that you cannot write vector code, but that
| it's typically a special case. CUDA APIs and annotations
| are bolted onto existing languages rather than reflecting
| languages with vector operations as natural first class
| operations.
|
| C or Java have no concept of `a + b` being a vector
| operation the way a language like, say, APL does. You can
| come closer in C++, but in the end the memory model of C
| and C++ hobbles you. FORTRAN is better in this regard.
| davedx wrote:
| Makes sense. I guess that's why some python libs use it
| under the hood
| chasil wrote:
| I see two options from this perspective.
|
| It is always possible to inline assembler in C, and
| present vector operators as functions in a library.
|
| Otherwise, R does perceive vectors, so another language
| that performs well might be a better choice. Julia comes
| to mind, but I have little familiarity with it.
|
| With Java, linking the JRE via JNI would be an (ugly)
| option.
| NekkoDroid wrote:
| When the data is generated on CPU shoveling it to the GPU
| to do possibly a single or few vector operations and then
| shoveling it back to the CPU to continue is most likely
| going to be more expensive than the time saved.
|
| And CUDA is Nvidia specific.
| Filligree wrote:
| Doesn't CUDA also let you execute on the CPU? I wonder
| how efficiently.
| HarHarVeryFunny wrote:
| No - a CUDA program consists of parts that run on the CPU
| as well as on the GPU, but the CPU (aka host) code is
| just orchestrating the process - allocating memory,
| copying data to/from the GPU, and queuing CUDA kernels to
| run on the GPU. All the work (i.e. running kernels) is
| done on the GPU.
|
| There are other libraries (e.g. OpenMP, Intel's oneAPI)
| and languages (e.g. SYCL) that do let the same code be
| run on either CPU or GPU.
| rbanffy wrote:
| When you use a GPU, you are using a different processor
| with a different ISA, running its own barebones OS, with
| which you communicate mostly by pushing large blocks of
| memory through the PCIe bus. It's a very different feel
| from, say, adding AVX512 instructions to your program flow.
| dan-robertson wrote:
| It's not obvious that that's what's happened here. Eg vector
| scheduling is separated but there are more units for actually
| doing certain vector operations. It may be that lots of
| vector workloads are more limited by memory bandwidth than
| ILP so adding another port to the scheduler mightn't add
| much. Being able to run other parts of the cpu faster when
| vectorised instructions aren't being used could be worth a
| lot.
| packetlost wrote:
| That matches with recent material I've read on vectorized
| workloads: memory bandwidth can become the limiting factor.
| semi-extrinsic wrote:
| Always nice to see people rediscovering the roofline
| model.
| Symmetry wrote:
| As CPU cores get larger and larger it makes sense to always
| keep looking for opportunities to decouple things. AMD went
| with separate schedulers in the Athalon three architectural
| overhauls ago and hasn't reversed their decision.
| frutiger wrote:
| > It's interesting to see that modern processor optimization
| still revolves around balancing hardware for specific tasks
|
| Asking sincerely: what's specifically so interesting about
| that? That is what I would naively expect.
| xw390112 wrote:
| It's also important to note that in modern hardware the
| processor core proper is just one piece in a very large
| system.
|
| Hardware designers are adding a lot of speciality hardware,
| they're just not putting it into the core, which also makes a
| lot of sense.
|
| https://www.researchgate.net/figure/Architectural-
| specializa...
| jandrewrogers wrote:
| The CPU vector performance is important for throughput-oriented
| processing of data e.g. databases. A powerful vector
| implementation gives you most of the benefits of an FPGA for a
| tiny fraction of the effort but has fewer limitations than a
| GPU. This hits a price-performance sweet spot for a lot of
| workloads and the CPU companies have been increasingly making
| this a first-class "every day" feature of their processors.
| KingOfCoders wrote:
| Main thing for me, they take the Apple hint and again increase
| caches (and add another cache layer L0)
| trynumber9 wrote:
| I don't think the L0 is the "new" cache. It's still 48K like
| the old L1. The 9 cycle 192K cache about halfway between the
| old L1 and L2 is really the new cache level in size and
| latency. And that's now called L1.
| hajile wrote:
| L0 latency is 4 cycles while old L1 was 5 cycles, so it's
| definitely a new cache even though the size is the same.
|
| Apple does a lot better. I'm not sure about newer chips, but
| M1 has the same 192kb of L1, but with the 4 cycle latency of
| Intel's tiny 48kb cache.
| formerly_proven wrote:
| End of the unified scheduler is big news, Intel had this since
| Core. Both AMD and Apple have split the scheduling up for a long
| time.
| bee_rider wrote:
| Now that the floating point vector units have gotten large enough
| to engage in mitosis, I wonder if there will be room in the
| future for little SIMD units stuck on the integer side. Bring
| back SSE, haha.
| Symmetry wrote:
| Removing a core's SMT aka "hyperthreading" has some modest
| hardware savings but but biggest cost is that it makes testing
| and validation much more complicated. Given the existence of
| Intel's p-cores I'm not surprised they're getting rid of it.
| dralley wrote:
| From Intel's perspective, I doubt that's true, when taking into
| consideration the constant stream of side channel
| vulnerabilities they were needing to deal with.
| Symmetry wrote:
| It's exactly the potential for side channel vulnerabilities
| that makes SMT so hard to get right.
| rbanffy wrote:
| You can always do like Sun and IBM and dilute the side
| channel in too many other threads to make it reliable.
| IIRC, both POWER10 and SPARC do 8 threads per core.
| Symmetry wrote:
| It's also a matter of workload. For a database where
| threads are often waiting on trips to RAM then SMT can
| provide a very large boost to performance.
| dan-robertson wrote:
| I think you're saying that testing smt makes it expensive,
| which sounds mostly right to me, though I can imagine some
| arguments that it isn't much worse. When I first read your
| comment, I thought it said the opposite - that removing smt
| requires expensive testing and validation work.
| trynumber9 wrote:
| Lion Cove doesn't remove hyper-threading in general. Some
| variants will have it so Intel still has to validate it
| (eventually). But the first part shipping, Lunar Lake, removes
| hyper-threading. It may have saved them validation time for
| this particular part but they'll still have to do it.
| adrian_b wrote:
| Intel has said that they have designed two different versions
| of the Lion Cove core.
|
| One with SMT, for server CPUs, and one in which all circuits
| related to SMT are completely removed, for smaller area and
| power consumption, to be used in hybrid CPUs for laptops and
| desktops, together with Skymont cores.
| therealmarv wrote:
| And the main question: Is it overall better (faster and less
| power hungry) than a reduced instruction set, powerful laptop ARM
| based CPU which is around the corner (Qualcomm)? Guessing not...
| Rinzler89 wrote:
| There's still an insurmountable amount of apps that are still
| X86 exclusive both new and legacy. So the chip not beating the
| best of ARM in benchmarks is largely irelevant.
|
| A Ferrari will beat a tractor on every test bench numbers and
| every track, but I can't plow the field with a Ferrari, so any
| improvements in tractor technology is still welcome despite
| they'll never beat Ferraris.
| jeffbee wrote:
| Amazing that you can look at an ISA like ARM and say "reduced
| instruction set". It has 1300+ opcodes.
| mhh__ wrote:
| At best ARM is regular rather than reduced
| JonChesterfield wrote:
| Aarch64 looks a lot like x86-64 to me. Deep pipelines, loads
| of silicon spent on branch prediction, vector units.
| aidenn0 wrote:
| It seems that "RISC" has just become a synonym for "load-
| store architecture"
|
| Non-embedded POWER implementations are around 1000 opcodes,
| depending on the features supported, and even MIPS eventually
| got a square-root instruction.
| yjftsjthsd-h wrote:
| I used to think this too, but apparently RISC isn't about the
| number of instructions, but the complexity or execution time
| of each; as https://en.wikipedia.org/wiki/Reduced_instruction
| _set_comput... puts it,
|
| > The key operational concept of the RISC computer is that
| each instruction performs only one function (e.g. copy a
| value from memory to a register).
|
| and in fact that page even mentions at https://en.wikipedia.o
| rg/wiki/Reduced_instruction_set_comput... that
|
| > Some CPUs have been specifically designed to have a very
| small set of instructions--but these designs are very
| different from classic RISC designs, so they have been given
| other names such as minimal instruction set computer (MISC)
| or transport triggered architecture (TTA).
| ffgjgf1 wrote:
| IIRC they claimed that it supposedly is
| therealmarv wrote:
| Would be great if true! Competition always good for us
| consumers.
| yjftsjthsd-h wrote:
| What are you guessing from? Historically, generation for
| generation, x86 is good at performance and awful at power
| consumption. Even when Apple (not aarch64 in general, just
| Apple) briefly pulled ahead on both, subsequent x86 chips kept
| winning on raw performance, even as they got destroyed on perf
| per Watt.
| dur-randir wrote:
| >kept winning on raw performance
|
| 13900K lost couple of % in single tread performance, which
| lead to 14900K being so overclocked/overvolted that it lead
| to it being useless for what it's made for - crunching
| numbers. See https://www.radgametools.com/oodleintel.htm.
| SilverBirch wrote:
| I'm just wondering, how much of this analysis is real? I mean,
| how much of this analysis is a true weighing up of design
| decisions and performance, and how much is copy/pasted Intel
| marketing bumf?
| antoniojtorres wrote:
| That's typically the question with all marketing (probably
| cherry-picked) benchmarks. Have to wait until independent
| reviewers get their hands on them.
| dur-randir wrote:
| Nothing real. They cherry-pick for what they're trying to sell
| you this time - advent of 2nd core, +HT, E-cores, -HT, you-
| name-it-next-time. Without independent benchmark + additional
| interpretation for your particular workload you can safely
| ignore all their claims.
| JonChesterfield wrote:
| The dropping hyperthreading is interesting. I'm hoping x64 will
| trend in the other direction - I'd like four or eight threads
| scheduled across a pool of execution units as a way of hiding
| memory latency. Sort of a step towards the GPU model.
| jandrewrogers wrote:
| This has not been successful historically because software
| developers struggle to design code that has mechanical sympathy
| with these architectural models. These types of processors will
| run idiomatic CPU code with mediocre efficiency, which provides
| an element of backward compatibility, but achieving their
| theoretical efficiency requires barrel-processing style data
| structures and algorithms, which isn't something widely taught
| or known to software developers.
|
| I like these architectures but don't expect it to happen. Every
| attempt to produce a CPU that works this way has done poorly
| because software developers don't know how to use them
| properly. The hardware companies have done the "build it and
| they will come" thing multiple times and it has never panned
| out for them. You would need a killer app that would strongly
| incentivize software developers to learn how design efficient
| code for these architectures.
| bri3d wrote:
| I think another huge issue is the combination of general-
| purpose/multi-tenant computing, especially with
| virtualization in the mix, and cache. Improvements gained
| from masking memory latency by issuing instructions from more
| thread execution contexts are suddenly lost when each of
| those thread contexts is accessing a totally different part
| of memory and evicting the cache line that the next thread
| wanted anyway.
|
| In many ways barrel processing and hyperthreading are
| remarkably similar to VLIW - they're two _very_ different
| sides of the same coin. At the end of the day, both work well
| for situations where multiple simultaneous instruction
| streams need to access relatively small amounts of adjacent
| data. Realtime signal processing, dedicated databases, and
| gaming are obvious applications here, which I think is why
| VLIW has done so well in DSP and hyperthreading did well in
| gaming and databases. Once the simultaneous instruction
| streams are accessing completely disparate parts of main
| memory, it all blows up.
|
| Plus, in a multi-tenant environment, the security issues
| inherent to sharing even more execution state context (and
| therefore side channels) between domains also become
| untenable fairly quickly.
| pradn wrote:
| Hyperthreading is also useful for simply making a little
| progress on more threads. Even if you're not pinning a core
| by making full use of two hyperthreads, you can handle double
| the threads at once. Now, I don't know how important it is,
| but I assume that for desktop applications, this could lead
| to a snappier experience. Most code on the desktop is just
| waiting for memory loads and context switching.
|
| Of course, the big elephant in the room is security - timing
| attacks on shared cores is a big problem. Sharing anything is
| a big problem for security conscious customers.
|
| Maybe it's the case of the server leading the client here.
| immibis wrote:
| Just this morning I benchmarked my 7960X on RandomX (i.e.
| Monero mining). That's a 24-core CPU. With 48 cores, it
| gets about 10kH/s and uses about 100 watts extra. With 48
| threads, about 15kH/s and 150 watts. It does make sense
| with the nature of RandomX.
|
| Another benchmark I've done is .onion vanity address
| mining. Here it's about a 20% improvement in total
| throughput when using hyperthreading. It's definitely not
| useless.
|
| However, I didn't compare to a scenario with hyperthreading
| disabled in the BIOS. Are you telling me the threads get
| 20-50% faster, each, with it disabled?
| Covzire wrote:
| I've been disabling HT for security reasons on all my machines
| ever since the first vulnerabilities appeared, with no
| discernible negative impact on daily usage.
| groos wrote:
| The gain from HT for doing large builds has been 20% at most.
| Daily usage is indistinguishable, as you say.
| MisterTea wrote:
| HT really only benefits you if you spend a lot of time
| waiting on IO. E.g. file serving where you're waiting on
| slow spinning disks.
| binkHN wrote:
| For security reasons, OpenBSD disables hyperthreading by
| default.
| duffyjp wrote:
| I'm running a quite old CPU currently, a 6-core Haswell-E
| i7-5930K. Disabling HT gave me a huge boost in gaming
| workloads. I'm basically never doing anything that pegs the
| entire CPU so getting that extra 10-15% IPC for single
| threaded tasks is huge.
|
| And like you said, the vulnerability consideration makes HT a
| hard sell for workloads that would benefit (hypervisors).
|
| That i7 is a huge downgrade from what I was running before
| (long story) so I'm looking forward to Arrow Lake and I like
| everything I've read. In addition to removing HT, they're
| supporting Native JEDEC DDR5-6400 Memory so XMP won't be
| necessary. I've never liked XMP/Expo...
| AzzyHN wrote:
| To summarize Intel's reasoning, the extra hardware (and I guess
| associated firmware or whatever it'd be called) required to
| manage hyperthreading in their P-cores takes up package space
| and power, meaning the core can't boost as high.
|
| And since hyperthreading only increases IPC by 30% (according
| to Intel), they're planning on making up the loss of threads
| with more E-cores.
|
| But we'll have to see how that turns out, especially since
| Intel's first chiplet design (the Core Ultra series 1) had
| performance degradations compared to their 13th Gen mobile
| counterparts
| killerstorm wrote:
| That's what SPARC and Power architectures do now, with 8
| threads per core.
|
| The only issues with these architectures is that they are
| priced as "high-end server stuff" while x64 is priced like a
| commodity.
| osigurdson wrote:
| From an investor perspective, I'd rather hear: "we pay more for
| talent than our competition". That is the first step toward
| greatness.
| tambourine_man wrote:
| > Moving to a more customizable design will allow Intel to better
| optimize their P-cores for specific designs moving forward
|
| Amateur question: is that due to the recent advances in
| interconnect or are they designing multiple versions of this
| chip?
| wmf wrote:
| Cheese is talking about different chips (e.g. laptop, desktop,
| and server) that contain Lion Cove cores. Intel doesn't really
| reuse chips between different segments.
| ein0p wrote:
| Once (and if) Intel manages to get out of the woods with
| lithography, they will have an incredibly strong product lineup.
___________________________________________________________________
(page generated 2024-06-04 23:02 UTC)