[HN Gopher] Bulldozer, AMD's Crash Modernization: Front End and ...
___________________________________________________________________
Bulldozer, AMD's Crash Modernization: Front End and Execution
Engine
Author : ingve
Score : 183 points
Date : 2023-01-23 08:38 UTC (14 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| unity1001 wrote:
| AMD FX-8150 worked well for me for a loooong time. From gaming
| top titles of the time to programming to other tasks.
| heelix wrote:
| I had one running as an ESX server for a long time. Back then,
| this was the cheapest path to 8 cores and was ideal for carving
| up into smaller VMs. The 8 core/32M specs were what was the
| most you could put on a personal 'free' box.
| chaboud wrote:
| Well before Bulldozer had taped out, AMD briefed us on the high
| level processor plans (our software was included in BAPCo Sysmark
| and we'd been a very early adopter of AMD64, so we had lots of
| these meetings). I told them that it was a total mismatch for our
| workloads (audio and video production tools) and would prove to
| be a bad fit. They protested at first, as our inputs and outputs
| were fixed point formats and the doubled-up integer/logic units
| would allow us to thread more. That's when I spelled it out:
|
| 1. Our business logic and UI each basically operated in single
| threads.
|
| 2. Our intermediate audio and video processing buffers were
| floating point or moving in that direction in the future.
|
| 3. Wherever possible in filtering and effects, we avoided
| branching, relying on zero-folding in vector instructions.
|
| I went on to explain where they'd be in trouble for game engines,
| document processing, and other key use-cases.
|
| Their hardware architects slowly sank in demeanor over the course
| of that all-day meeting. I later heard from contacts at AMD that
| it was an "oh shit" moment for them. That said, having been at
| Sony at the time, we'd gone down the long dark road of the
| Cell...
|
| Sometimes people try bold and interesting things that don't work.
| Bulldozer was definitely one of those, and it should have been
| killed (or at least diverted), like Larrabee. Nonetheless, AMD
| survived, and their current architectures owe some of their
| thinking to the hard-learned lessons of Bulldozer.
| zozbot234 wrote:
| Meanwhile the "efficiency cores" in the latest processors (both
| x86 and ARM) are basically optimal for precisely those highly
| threaded integer workloads. Bulldozer was just too early for
| its time.
| brnt wrote:
| Hmm, that's an interesting idea: two different kind of
| (integer) cores (big and small) sharing an FPU...
| bee_rider wrote:
| Big cores and little cores sharing an AVX-512
| implementation could be funny.
| Symmetry wrote:
| I believe ARM's newest little cores do the same thing, two
| cores sharing an FPU and L2$.
| chaboud wrote:
| I totally agree that Bulldozer was ahead of its time. In some
| ways, it was a case of being _too_ clever, trying to do
| things under the hood for the FPU similarly to Hyper-
| Threading for decoder /front-end resources (which also caused
| problems). Getting integrated with the underlying software
| (e.g., the kernel and its scheduler) in big.LITTLE was a
| comparative revelation.
|
| Conversely, the Cell took this too far, requiring that high
| level software be deeply SPE and memory-shuffling aware,
| almost like having a bank of DSPs in-package. The
| instruction-set/functional difference between the PPE and SPE
| made for encumbrances that likely would never have been fully
| solved by the toolchain (even if the toolchain had shipped
| ahead of the processor rather than well afterwards).
|
| Larrabee made attempts to solve the instruction set
| difference problem (in a very Intel way), but the topology,
| cache behavior, and pure core size proved to be far too
| limiting in real applications. In the end, Intel missed on
| the balance of core/execution-unit size vs. count when
| compared to Nvidia and ATI. GPGPU progress has definitely
| demonstrated the value of toolchain abstraction, though, and
| adherence to a given instruction set (and the architectural
| implications that came from that) proved to be a limiting
| constraint for Larrabee.
|
| The last 20-25 years in compute architecture design have seen
| many interesting articulations of clever ideas. Some deaths
| were predictable, but each of the explorations should be
| commended for the ways that they pushed boundaries and
| cleared the path for their successors.
| cpgxiii wrote:
| I think you could draw somewhat different lessons from
| this, though:
|
| - Cell failed because it required too much work and was too
| incompatible with any other platform (save for DSPs).
|
| - Larrabee failed because KNC wasn't binary compatible and
| the toolchain was too much of a pain to work with, and
| KNL/KNM "failed" because they were too hard to get
| (although, I think "failed" is perhaps too strong here, the
| "lots of general-purpose cores with vector units" is
| exactly where Intel, AMD, and the aarch64 vendors have gone
| since). KNC's lack of binary compatibility is really what
| got things off on the wrong foot, the whole advantage it
| had over AMD/Nvidia GPUs was it amenability to GPU-
| unfriendly code, but I think the barrier of having to
| rebuild everything was prohibitive for many of the smaller
| users who would have been well-positioned to use it.
|
| - NVidia "won" in the GPGPU space because CUDA and its
| libraries have provided a stable-enough basis for complex
| applications to use, and the underlying hardware has
| evolved to be ever more performant and capable.
|
| - AMD "failed" in the GPGPU space because their toolchain
| abstraction is too weak and compatibility is poor. Unlike
| CUDA, there's no capable API that works on the whole range
| (even with OpenCL AMD's support is questionable at times,
| with two totally different driver stacks).
|
| I'd say the biggest lessons are: (1) compatibility with
| existing code is king, and (2) toolchain and hardware
| accessibility is critical. You can survive breaking (1) if
| your tools are good enough, but you can't win by breaking
| both of them.
| dist1ll wrote:
| Could you explain what zero-folding is?
| sounds wrote:
| Since OP didn't answer, it's probably multiplying terms by
| zero to mask them out.
| maximilianburke wrote:
| You don't need to multiply by zero; there are comparison
| operations which will generate a mask value which can then
| be used to select the appropriate value (`vsel` for
| altivec, `_mm_blend_ps` for sse4, or a combination of
| `_mm_and_ps` / `_mm_andnot_ps` / `_mm_or_ps` for ssse3 and
| earlier)
|
| At least on Xbox 360 / Cell, high performance number
| crunching code would often compute more than needed (ie:
| both the if-case and else-case) and then use a branchless
| select to pick the right result. It may seem like a waste
| but it was faster than the combination of branch
| misprediction as well as shuffling values from one register
| file to another to be able to do scalar comparisons.
| sounds wrote:
| Then they would have said "masking" instead of "folding"
|
| Thanks for mentioning masking though, in case anyone
| thought we didn't already know about that. ;-)
| chaboud wrote:
| Others have covered it here, using masks, multiplies, or
| horizontal operations to avoid branching by running the logic
| for all of the branches is common in vectorized code (and
| shader operations for similar reasons), where raw compute
| outruns branch mis-prediction costs for deeply pipelined
| processors.
|
| In older SSE architectures, memory alignment and register-to-
| register result latencies _sometimes_ made it easier /better
| to stride operations on partially-masked registers rather
| than reorganize the data. With high architectural register
| counts and newer shuffling instructions, that need is lower.
|
| Definitely avoid branching in tight compute code, though.
| Branching kills performance, sometimes in very data-dependent
| and annoying ways.
| ahartmetz wrote:
| ...and Lisa Su was, at IBM, the team lead of the Cell CPU. Zen,
| about ten years later, was designed to be largely "performance
| compatible" with Intel Core. Similar cache, buffer, decode, etc
| sizes and characteristics to Core. Lessons learned, I guess.
| phkahler wrote:
| >> Each module was designed to run two threads. The frontend,
| FPU, and L2 cache are shared by two threads, while the integer
| core and load/store unit are private to each thread.
|
| If I recall correctly, they ended up calling each module 2 cores.
| I thought in the beginning that each module would be a single
| core running 2 threads. I've always thought the problem was
| marketing. If Bulldozer was sold as having half the cores, it's
| multithreaded performance would have been spectacular while
| single thread was meh. But some have said those modules were too
| big (expensive) to be sold as a single core.
|
| I have no proof, but the use of the word "module" seems to
| support the confusion around what constitutes a core vs thread.
| johnklos wrote:
| There is enough actual stuff so that each integer core is, in
| fact, a separate, distinct core. The issue is that there was
| only technically one FPU core per two integer cores. AMD paid
| out $12 million because they called these individual cores, and
| the case was decided against AMD because of the single FPU per
| core module:
|
| https://www.theregister.com/2019/08/27/amd_chip_compensation...
| paulmd wrote:
| > and the case was decided against AMD because of the single
| FPU per core module:
|
| this has always been a massive oversimplification and
| misrepresentation from the AMD fanclub. the case was decided
| because of _shared resources_ including the L2 and the
| frontend (fetch-and-decode, etc) which very legitimately do
| impact performance in a way that "independent cores" do not.
|
| Citing the actual judgement:
|
| > Plaintiffs allege that the Bulldozer CPUs, advertised
| ashaving eight cores, actually contain eight "sub-processors"
| which share resources, such as L2 memory caches and floating
| point units ("FPUs"). Id. P 37-49. Plaintiffs allege that the
| sharing of resources in the Bulldozer CPUs results in
| bottlenecks during data processing, inhibiting the chips from
| "simultaneously multitask[ing]." Id. PP 38, 41. Plaintiffs
| allege that, because resources are shared between two
| "cores," the Bulldozer CPUs functionally only have four
| cores. Id. P 38-43. Therefore, Plaintiffs claim the products
| they purchased are inferior to the products as represented by
| the Defendant. Id. P 39
|
| This is completely correct: the chip only has one frontend
| per module which has to alternate between servicing the two
| "cores" and this does bottleneck their independent operation.
| It is, for example, not the same thing as a "core" used on
| Phenom and this significantly impacts performance to a hugely
| negative extent when the "second thread" is used on a given
| module.
|
| It's fine to _do_ it, this same approach is used in SPARC for
| example, but SPARC doesn 't market that as (eg) a "64 core
| processor", they market it as "8 cores, 64 _threads_ ". But
| AMD wanted to have the marketing bang of having "twice the
| cores" as intel (and you can see people representing that
| even in this thread, that programmers bought AMD for
| compiling because "it had twice the cores", etc). And that is
| not how anyone else has marketed CMT before or since, because
| it's really not a fair representation of what the hardware is
| doing.
|
| Alternately... if that's a core then Intel is definitely a
| core too, because basically CMT is just SMT with some
| resources pinned to specific threads, if you want to look at
| it like that. After all where is the definition that says an
| integer unit alone is what constitutes a core? Isn't it
| enough that there are two independent execution contexts
| which can execute simultaneously, and isn't it a good thing
| that one of them can use as many resources as possible rather
| than bottlenecking because of an execution unit "pinned" to
| another thread? If you accept the very expansive AMD
| definition of "core" then there's lots of weird stuff that
| shakes out too, and I think consumers would have found it
| very deceptive if Intel had done that like AMD, that's
| obviously not what a "core" is, but it is if you believe AMD.
|
| AMD did a sketch thing and got caught, end of story. No
| reason they should call it anything different than the other
| companies who implement CMT.
|
| I hate hate hate the "AMD got skewered because they didn't
| have an FPU" narrative. No, it was way more than the FPU, and
| plaintiffs said as much, and it's completely deceptive and
| misrepresentative to pretend that's the actual basis for the
| case. That's the fanclub downplaying and minimizing again,
| like they do everytime AMD pulls a sketchy move (like any
| company, there have been a fair few over the years). And that
| certainly can include _El Reg_ too. Everyone loves an
| underdog story.
|
| Companies getting their wrists slapped when they do sketchy
| shit is how the legal system prevents it from happening again
| and downplaying it as a fake suit over crap is bad for
| consumers as a whole and should not be done even for the
| "underdog". The goal shouldn't be to stay just on the right
| side of a grey area, it should be to market honestly and
| fairly... like the other companies that use CMT did. Simple
| as.
|
| To wit: NVIDIA had to pay out on the 3.5GB lawsuit even
| though their cards really does have 4GB. Why? Because it
| affected performance and the expectation isn't mere technical
| correctness, it's that you stay well on the right side of the
| line with your marketing's honesty. It was sketch and they
| got their wrist slapped. As did AMD.
| monocasa wrote:
| The UltraSparc T1 shared the one FPU and the one logical L2
| between all 32 threads/8 cores. L2 is very common to share
| between cores, the world has more or less converged on a
| shared L2 per core complex, so 4 to 8 cores. And you still
| see vector units shared between cores where it makes sense
| too, for instance Apple's AMX unit is shared between all of
| it's cores.
|
| It's really only the frontend and it's data path to L1
| that's a good argument here, but that's not actually listed
| in the complaint.
|
| And even then, I can see where AMD was going. The main
| point of SMT is to share backend resources that would
| otherwise be unused on a given cycle, but these have
| dedicated execution units so it really is a different
| beast.
| paulmd wrote:
| > And even then, I can see where AMD was going. The main
| point of SMT is to share backend resources that would
| otherwise be unused on a given cycle, but these have
| dedicated execution units so it really is a different
| beast.
|
| Sure, but wouldn't it be ideal that if a thread wasn't
| using its integer unit and the other thread had code that
| could run on it, you'd allow the other thread to run?
|
| "CMT" is literally just "SMT with dedicated resources"
| _and that 's a suboptimal choice because it impairs per-
| thread performance in situations where there's not
| anything to run on that unit_. Sharing is _better_.
|
| If the scheduler is insufficiently fair, that's a problem
| that can be solved. Guarantee that if there is enough
| work, that each thread gets one of the integer units, or
| guarantee a maximum latency of execution. But
| _preventing_ a thread from using an integer unit that 's
| available is just wasted cycles, and that's what CMT
| does.
|
| Again: CMT is not that different from SMT. It's SMT where
| resources are fixed to certain threads, and that's
| suboptimal from a scheduling perspective. And if you
| think that's enough to be called a "core", well, Intel
| has been making 8-core chips for a while then. Just 2
| cores per module ;)
|
| Consumers would not agree that's a core. And pinning some
| resources to a particular thread (while sharing most of
| the rest of the datapath) does not change that, actually
| it makes it _worse_.
|
| > It's really only the frontend and it's data path to L1
| that's a good argument here, but that's not actually
| listed in the complaint.
|
| That's just a summary ;) _El Reg_ themselves discussed
| the shared datapath when the suit was greenlit.
|
| https://www.theregister.com/2019/01/22/judge_green_lights
| _am...
|
| And you can note the "such as" in the summary, even. That
| is an expansive term, meaning "including but not limited
| to".
|
| If you feel that was not addressed in the lawsuit and it
| was incorrectly settled... please cite.
|
| Again: it's pretty simple, stay far far clear of
| deceptive marketing and it won't be a problem. Just like
| NVIDIA got slapped for "3.5GB" even though their cards
| did actually have 4GB.
|
| With AMD, "cores" that have to alternate their datapath
| on every other cycle are pretty damn bottlenecked and
| that's not what consumers generally think of as
| "independent cores".
| monocasa wrote:
| > Sure, but wouldn't it be ideal that if a thread wasn't
| using its integer unit and the other thread had code that
| could run on it, you'd allow the other thread to run?
|
| > "CMT" is literally just "SMT with dedicated resources"
| and that's a suboptimal choice because it impairs per-
| thread performance in situations where there's not
| anything to run on that unit. Sharing is better.
|
| > If the scheduler is insufficiently fair, that's a
| problem that can be solved. Guarantee that if there is
| enough work, that each thread gets one of the integer
| units, or guarantee a maximum latency of execution. But
| preventing a thread from using an integer unit that's
| available is just wasted cycles, and that's what CMT
| does.
|
| Essentially, no, what you're suggesting is a really poor
| choice for the gate count and numbers of execution units
| in a Jaguar. The most expensive parts are the ROBs and
| their associated bypass networks between the execution
| units. Doubling that combinatorial complexity would
| probably lead to a much larger, hotter single core that
| wouldn't clock nearly as fast (or have so many pipeline
| stages that branches are way more expensive (aka the
| netburst model)).
|
| > And you can note the "such as" in the summary, even.
| That is an expansive term, meaning "including but not
| limited to".
|
| Well, except that I argue it doesn't include those at
| all; shared L2 is extremely common, and shared FPU is
| common enough that people don't really bat an eye at it.
|
| > If you feel that was not addressed in the lawsuit and
| it was incorrectly settled... please cite.
|
| I'm going off your own citation. If you feel that after
| that these were brought up in the court case itself
| you're more than welcome to cite another example (ideally
| not a literal tabloid, but keeping the standards of the
| court documents you cited before).
|
| > With AMD, "cores" that have to alternate their datapath
| on every other cycle are pretty damn bottlenecked and
| that's not what consumers generally think of as
| "independent cores".
|
| That's not how these work. OoO Cores are rarely cranking
| away their frontends at full tilt, instead they tend to
| work in batches filling up a ROB with work that will then
| be executed as memory dependencies are resolved. The
| modern solution to taking advantage of that is to
| aggressively downclock the front end when not being used
| to save power, but I can see the idea of instead keeping
| it clocked with the rest of the logic and simply sharing
| it between two backends as a valid option.
| monocasa wrote:
| And they're not even the worse at that. The Ultrasparc T1
| only had the one FPU shared among it's 8 cores.
|
| But I guess $12M is just the high end of nuisance money for
| AMD.
| irusensei wrote:
| Cue the "fine wine" bulldozer memes. Are the memes grounded on
| reality though? Plenty of youtube videos from 2020 or later
| showing the 8350 displaying some impressive results for their
| age, specially on proper multithreading stuff.
| Maakuth wrote:
| 8350 is Piledriver, the design evolved from Bulldozer. It
| addressed many of the Bulldozer shortcomings.
| BlueTemplar wrote:
| Seems like where AMD went wrong is expecteing programmers to be
| able to easily take advantage of multithreading... while in
| reality the extra performance is often not worth the extra
| complexity, and sometimes it's even impossible to implement ?
| adrian_b wrote:
| No, where AMD went wrong was in marketing.
|
| For some years, and especially during the last six months
| before the Bulldozer launch in 2011, AMD has continuously
| made various claims about its performance, always stressing
| that it will crush Sandy Bridge, due to having a double
| number of "cores".
|
| Immediately after launch, all the AMD claims were proved to
| be shameless lies. There is absolutely no excuse for AMD.
| Intel had published accurate information about the
| performance of Sandy Bridge, one year in advance. Even
| without the advance publication, the performance of Sandy
| Bridge could have been easily extrapolated from the evolution
| of Penryn, Nehalem and Westmere.
|
| The performance of Bulldozer was determined by design
| decisions taken by AMD at least 3 to 4 years before its
| launch. Therefore, during the last 6 months, when they have
| lied the most, they knew perfectly well that they were lying
| and it should have been obvious to them that this is futile,
| because the lies will be exposed by independent benchmarks
| immediately after launch.
|
| For most operations, a so-called Bulldozer "module" (with 2
| "cores") had exactly the same execution resources as a Sandy
| Bridge core, while for a few operations, like integer
| multiplication, a Bulldozer "module" had even less execution
| resources than a Sandy Bridge core.
|
| The common FPU of 2 Bulldozer "cores" was more or less
| equivalent with the FPU of a single Sandy Bridge core.
|
| The Bulldozer integer "cores" were stripped down in
| comparison with the cores of the previous AMD CPUs, e.g. a
| new "core" had 2/3 of the addition throughput and 1/2 of the
| multiplication throughput of an old core.
|
| So 2 Bulldozer integer "cores" were together barely faster
| than one old AMD core, while a single Sandy Bridge core was
| faster than them, by having an equal addition throughput, but
| a double multiplication throughput.
|
| Bulldozer could have been a decent CPU, if only AMD, instead
| of describing it as a 4-"module"/8-"core" CPU, would have
| described it as a 4-core/8-thread CPU, which matched Sandy
| Bridge, except that in Sandy Bridge the 2 threads of a core
| shared dynamically most execution resources, while in
| Bulldozer only a part of the execution resources were shared
| dynamically, e.g. the FPU, while the rest of the execution
| resources were allocated statically to the 2 threads of a
| core, which is less efficient.
|
| Such a description would have set correctly the expectations
| of the users, and they would not have felt cheated.
|
| I am pretty certain that the huge disappointment that has
| accompanied the Bulldozer launch has hurt much more the AMD
| sales than they might have gained from the buyers lured by
| the false advertising about the 8-core monster AMD CPUs,
| which should easily beat the puny 4-core Intel CPUs.
| clamchowder wrote:
| (author here) The FPU is not quite equivalent to a Sandy
| Bridge FPU, but the FPU is one of the strongest parts of
| the Bulldozer core. Also, iirc multiply throughput is the
| same on Bulldozer and K10 at 1 per cycle. K10 uses two
| pipes to handle high precision multiply instructions that
| write to two registers, possibly because each pipe only has
| one result bus and writing two regs requires using both
| pipes's write ports. But that doesn't mean two multiplies
| can complete in a single cycle.
|
| With regard to expectations, I don't think AMD ever said
| that per-thread performance would be a match for Sandy
| Bridge. ST performance imo was Bulldozer's biggest problem.
| Calling it 8 cores or 4 cores does not change that. You
| could make a CPU with say, eight Jaguar cores, and market
| it as an eight core CPU. It would get crushed by 4c/8t Zen
| despite the core count difference and no sharing of core
| resources between threads (on Jaguar).
| StuffMaster wrote:
| This comment represents the analysis I remember
| understanding at the time. I was disappointed with AMD.
| BlueTemplar wrote:
| > So 2 Bulldozer integer "cores" were together barely
| faster than one old AMD core
|
| This sounds like a wild claim : I went from a better than
| midrange 3-core Phenom to a worse than midrange ~~8~~ 4
| core Bulldozer, and my singlethreaded performance not only
| did not nearly halve, but has even improved !
| Paianni wrote:
| Gaming and enthusiast machines are only a fraction of AMD's
| market, most consumers and clients didn't care about the
| features AMD's marketing department lied about.
| adrian_b wrote:
| AMD did not lie about some particular feature of
| Bulldozer.
|
| They lied by claiming that Bulldozer should be faster
| than Sandy Bridge, while knowing that it is completely
| impossible for their claims to be true.
|
| Even without running any benchmark, it was trivial to
| predict that at similar clock frequencies the "8-core"
| Bulldozer must be slower than the 4-core Sandy Bridge.
|
| For multi-threaded performance, the complete Bulldozer
| had between 50% and 100% of the number of arithmetic
| units of Sandy Bridge, so in the best case it could match
| but not exceed Sandy Bridge.
|
| For single-threaded integer performance, one Bulldozer
| "core" had between 25% and 50% of the number of
| arithmetic units of Sandy Bridge, so in the best case it
| could be only half as fast as Sandy Bridge.
|
| The only chance for Bulldozer to beat Sandy Bridge as in
| the AMD claims would have been a very high clock
| frequency, in the 5 to 7 GHz range.
|
| However, AMD never made the false claim that Bulldozer
| will beat Sandy Bridge by having a much higher clock
| frequency. They claimed that they will beat Intel by
| having more "cores", which implied more arithmetic units,
| while knowing very well that they had decided to remove a
| large part of the arithmetic units from their cores, at
| the same time when Intel was adding arithmetic units to
| their cores, instead of removing them.
| yvdriess wrote:
| Most gaming consumers got an AMD whether they wanted it
| or not, as the Jaguar cores were in both PS4 and Xbox
| One.
| selectodude wrote:
| But they specifically _weren 't_ that jacked up
| "clustered multi-thread" that Bulldozer had.
| zokier wrote:
| Well, one big aspect is gaming, and in particular consoles at
| that time only had few cores. That changed few years later
| with ps4/xb1, but of course it took bit longer for gamedevs
| to catch up.
|
| It didn't help that it was also time when Intel was at top of
| their game, still pushing significant single-thread perf
| improvements. Sandy Bridge, which was Bulldozers main
| competitor, was exceptionally successful launch on its own
| right
| [deleted]
| tormeh wrote:
| Well, multithreading got more important with time, so it makes
| sense to me.
| dist1ll wrote:
| > But as you might think, nobody at AMD envisioned it that way in
| the planning or design stages. No engineer would ever start
| working with the idea to "build a shit product";
|
| That made me chuckle. There are many reason for why a design
| turns sour, but actual malice is rarely on of them - although it
| may feel that way for the user.
| brnt wrote:
| On the other hand: I can very well imagine some senior staff
| rolling eyes but feeling unfree to 'spoil the mood' by
| explaining how certain choices are going to come back and
| create problems later on.
|
| I find it superinteresting, but also superhard to have that
| sort of feedback as things are ongoing. I've seen both happen
| (old hats hating progress too), but I was never successful (not
| in those larger above-my-paygrade-but-visible-nonetheless) in
| having a conversation about how we're having the discussion and
| whether or not every point/everyone is seriously hearda and
| considered. There can be a 'mood' that is overriding, and I
| haven't found a way to override the mood. Things tend to have
| to break down before that can happen.
| CoastalCoder wrote:
| I've experienced that as well, albeit in the software realm.
|
| It reminds me of the myth of Cassandra.
| deaddodo wrote:
| The myth of Cassandra?
| Tomte wrote:
| Greek mythology:
| https://en.m.wikipedia.org/wiki/Cassandra
| moffkalast wrote:
| The myth that Cassandra DB is faster than Redis for large
| databases, but nobody believes it. /s
| fallous wrote:
| I've found that when in discussions where choices made now
| will create problems in the future it's useful to raise that
| issue by couching it as a potential issue of priorities. Yes,
| the short-term solution may create problems in the long-term
| but sometimes the short-term priorities are more important.
| Taking on a tech debt isn't inherently bad, it's when you
| take it on unknowingly or without a full understanding of the
| terms of that debt that it goes bad.
|
| Just as with any other debt, the cost in the long term must
| be balanced against the needs of the short term. And as with
| any other debt, if someone takes on that debt and then
| decides to ignore repayment there will be much wailing and
| gnashing of teeth when the bill comes due. ;)
| muxator wrote:
| However, more often than not the person having to repay
| that debt is not going to be the one that created it.
|
| The fairness of this dynamic has at least to be taken into
| consideration.
| [deleted]
| Tempest1981 wrote:
| What does the title mean: "Crash Modernization"?
| nkurz wrote:
| Crash here means something like "speedy" or "rushed". It refers
| to AMD's attempt to modernize their architecture in a hurry,
| and doesn't have anything to do with how recovers from or deals
| with crashes. I was also initially confused.
| leoc wrote:
| Note also that that the expression should be read as "crash
| (modernisation program)", not as "(crash modernisation)
| program"
| projektfu wrote:
| Modernizing very fast and suddenly, under duress.
|
| adjective - "marked by a concerted effort and effected in the
| shortest possible time especially to meet emergency conditions"
| - Merriam Webster dictionary.
| sylware wrote:
| jez... this web site content is really good. I even can view it
| cleanely and properly with a noscritp/basic (x)html browser like
| wikipedia.
|
| (irony ahead) What I will be able to complain about then?! :)
| johnklos wrote:
| I remember deliberating between AMD's eight core FX-8150 and the
| latest Intel Sandy Bridge in early 2012.
|
| Even though lots of people consider Bulldozer to be an AMD
| misstep, I don't regret my choice: AMD's eight integer cores
| performed favorably compared with the four real cores plus four
| hyperthreads of the Intel. Considering I bought it to be used
| primarily for compiling, it was a good choice.
|
| This is an interesting read, and seeing the design choices with
| knowledge of how things turned out later makes it even more
| interesting.
| digitallyfree wrote:
| I had a programmer friend who got the FX-8150 in 2012 for
| compiling - pricewise it was basically equivalent to a i5 but
| you got 8 integer cores in return. It's a decent choice if you
| prioritize core count over single-core/FP performance.
|
| Efficiency wise those desktop CPUs were space heaters, with the
| FX-9590 coming to mind. Pre-Zen laptops were also notorious for
| poor battery life.
| paulmd wrote:
| > I had a programmer friend who got the FX-8150 in 2012 for
| compiling - pricewise it was basically equivalent to a i5 but
| you got 8 integer cores in return.
|
| but the integer cores shared a frontend which alternated
| between feeding the two cores, so using all 8 threads (2
| threads per module) reduced per-thread performance even
| further from the already abysmal 1TPM performance levels.
|
| The FX-8150 failed to outperform a 2500K in compiling let
| alone a 2600K. In fact it was _significantly_ slower than a
| Phenom II X6, which was much closer to the 2600K while the
| FX-8150 failed to even match the 2500K.
|
| https://images.anandtech.com/graphs/graph4955/41699.png (may
| need to hit enter or open-in-private-window to force a
| referrerless load)
|
| You can see how hard that shared frontend hits - Phenom with
| 6 "modules" curbstomps the Bulldozer in compiles despite
| having "fewer threads". It's not just that Bulldozer IPC is
| bad per-thread and it's made up with more threads, _using
| more threads actually causes further reductions of the IPC_
| which destroy a lot of the benefits of having those threads
| in the first place. It was not a good processor _even in
| heavily multithreaded loads_ , outside some happy cases.
|
| Frankly there was just as much of a fanboy effect in those
| days as in the early days of Ryzen, which really wasn't
| justified by the actual performance. Most actual productivity
| loads used AVX which strongly favored Intel over Ryzen due to
| the half-rate AVX2, the 5820K did great in those workloads
| for a very affordable price and was also much better in
| gaming/etc, plus had massive overclocking headroom. But
| people tend to laser-focus on the handful of very specific
| workloads Ryzen 1000 did OK at. Intel was generally much more
| potent per-core if you did video encoding, for example, or in
| later Cinebench variants that used AVX. But Ryzen looked real
| good in pre-AVX cinebench R11.5 and R15 so that's what people
| cited.
|
| People bought Bulldozer because they had always bought AMD
| and that was the brand they identified with on a social
| level. It was a terrible product and only slightly mitigated
| by the Piledriver series (FX-8350)... which had to go up
| against Ivy Bridge and later Haswell, where it was just as
| far behind.
|
| https://images.anandtech.com/graphs/graph6396/51121.png
|
| Same story as Phenom. Bulldozer would have been a competitive
| product if it got to market 2 years earlier and didn't need
| another couple years to fix the deficiencies.
| metalspot wrote:
| i used an fx-6300 for years and it was fine. for most use cases
| the practical need for desktop cpu performance plateaued 10
| years ago. during that time adding more memory and moving from
| hdd to ssd made a much larger difference in user experience
| than increasing the cpu performance.
|
| the biggest problem for amd was not the architecture of the
| cpu, but the lack of integrated graphics on high end parts, and
| power efficiency on mobile. both of which were caused by being
| stuck on an older process.
|
| it's also worth remembering that internally amd had a grand
| vision of unifying the cpu and gpu (HSA) which would have added
| a bunch of fp compute to the integrated chip, but that vision
| required a lot of other things to go right, and a lot of them
| went wrong, so it never happened.
| cptskippy wrote:
| I agree and I also think the timing of it's release helped as
| well. It came out about a year before the new Xbox and
| Playstation consoles were released and which were both using
| AMD silicon. So from a gaming perspective it was passable.
|
| It could run last gen console ports well and didn't struggle
| much with next gen. I had a FX-8320E that was used almost
| exclusively for gaming and it was fine.
| jcelerier wrote:
| > for most use cases the practical need for desktop cpu
| performance plateaued 10 years ago
|
| This drives me insane, literally every day my computer spends
| minutes - likely hours in total in a day - blasting at 100%
| CPU and it's not a ten years old one. Hell, I got access to a
| threadripper 5990X and even that one I sometimes end up in
| situations where I have to wait for CPU operations to finish,
| even for basic life things such as uncompressing archives or
| editing pictures so I really cannot understand
| dale_glass wrote:
| A lot of core functionality like dealing with zip files
| hasn't been updated for modern times, and doesn't support
| multiple cores. So it's likely you're using just one core
| to uncompress stuff.
| projektfu wrote:
| High memory use will still peg a CPU at 100%. They don't
| get many opportunities to do other things while waiting for
| cache refills. Increasing core speed/single thread
| performance will not help much in memory-bound tasks.
| MobiusHorizons wrote:
| Good for you, you actually use a workstation for its
| intended purpose. You would be surprised how rare that is.
| BlueTemplar wrote:
| I am still using a FX-8xxx (and probably will until RISC-V will
| beat it), and haven't even bothered overclocking it yet.
|
| Anyone knows what kind of issues having "The frontend, FPU, and
| L2 cache [] shared by two threads" causes compared to 2 real
| cores / 2 hyper threads on 1 core / 4 hyper threads on 2 cores
| ?
| Dalewyn wrote:
| My ELI5 understanding is it's a problem of throughput and
| resource availability.
|
| Long story short, running two threads with only enough
| hardware to run just one at full speed means any thread that
| tries to run at full speed will not be able to.
| deaddodo wrote:
| It would work fine with double integer loads, or heavily
| integer based work with the occasional FPU necessity. The
| problem wasn't throughput as much as the latter: it didn't
| have enough FPU resources and would get deadlocked waiting
| on availability; so would behave in the worst case (as a
| single core) more often than not.
| paulmd wrote:
| The problem isn't execution unit throughput at all, it's
| decode, where running the second thread on the module
| would bottleneck the decoder and force it to alternate
| between servicing the two threads. And this doesn't
| really matter what is running on the other core, it
| simply cannot service both cores at the same time
| regardless of instruction type even if the first core
| isn't using all its decode. If it has to decode a macro-
| op, it can even stall the other thread for multiple
| cycles.
|
| > Each decoder can handle four instructions per clock
| cycle. The Bulldozer, Piledriver and Excavator have one
| decoder in each unit, which is shared between two cores.
| When both cores are active, the decoders serve each core
| every second clock cycle, so that the maximum decode rate
| is two instructions per clock cycle per core.
| Instructions that belong to different cores cannot be
| decoded in the same clock cycle. The decode rate is four
| instructions per clock cycle when only one thread is
| running in each execution unit.
|
| ...
|
| > On Bulldozer, Piledriver and Excavator, the shared
| decode unit can handle four instructions per clock cycle.
| It is alternating between the two threads so that each
| thread gets up to four instructions every second clock
| cycle, or two instructions per clock cycle on average.
| This is a serious bottleneck because the rest of the
| pipeline can handle up to four instructions per clock.
|
| > The situation gets even worse for instructions that
| generate more than one macro-op each. All instructions
| that generate more than two macro-ops are handled with
| microcode. The microcode sequencer blocks the decoders
| for several clock cycles so that the other thread is
| stalled in the meantime.
|
| https://www.agner.org/optimize/microarchitecture.pdf
| clamchowder wrote:
| (author here) With hyperthreading/SMT, all of the above are
| shared, along with OoO execution buffers, integer execution
| units, and load/store unit.
|
| I wouldn't say there are issues with sharing those
| components. More that the non-shared parts were too small,
| meaning Bulldozer came up quite short in single threaded
| performance.
| paulmd wrote:
| Read the "bottlenecks in bulldozer" section, particularly the
| parts about instruction decode (never say x86 doesn't have
| any overhead) and instruction fetch. Or see the excerpt I
| posted below.
|
| https://www.agner.org/optimize/microarchitecture.pdf
| rjsw wrote:
| I am still using a six core Bulldozer as my main development
| machine, still happy with my choice. For compiling stuff it
| works well.
| selectodude wrote:
| You must be the beneficiary of very cheap electricity.
| mlvljr wrote:
| [dead]
___________________________________________________________________
(page generated 2023-01-23 23:01 UTC)