[HN Gopher] Intel 3rd gen Xeon Scalable (Ice Lake): generational...
___________________________________________________________________
Intel 3rd gen Xeon Scalable (Ice Lake): generationally big,
competitively small
Author : totalZero
Score : 60 points
Date : 2021-04-06 16:33 UTC (6 hours ago)
(HTM) web link (www.anandtech.com)
(TXT) w3m dump (www.anandtech.com)
| ChuckMcM wrote:
| I found the news of Intel releasing this chip quite encouraging.
| If they have enough capacity on their 10nm node to put it into
| production then they have tamed many of the problems that were
| holding them back. My hope is that Gelsinger's renewed attention
| to engineering excellence will allow the folks who know how to
| iron out a process to work more freely than they did under the
| previous leadership.
|
| That said, fixing Intel is a three step process right? First they
| have to get their process issues under control (seems like they
| are making progress there). Second, they need to figure out the
| third party use of that process so that they can bank some some
| of revenue that is out there from the chip shortage. And finally,
| they need to answer the "jelly bean" market, and by that we know
| that "jelly bean" type processors have become powerful enough to
| be the only processor in a system so Intel needs to play there or
| it will lose that whole segment to Nvidia/ARM.
| sitkack wrote:
| If they price it right, it could be amazing. Computing is
| mostly about economics. The new node sizes greatly increase the
| production capacity. Half the dimension in x and y gets you 4x
| the transistors on the same wafer. It is like making 4x the
| number of fabs.
|
| It also has speed and power advantages.
|
| I think this release is excellent news on many levels.
| Retric wrote:
| Intel _10nm_ is really just a marketing term at this point
| and has nothing to do with transistor density.
| [deleted]
| judge2020 wrote:
| Production for a datacenter CPU is not the same as production
| for datacenter + enthusiast-grade consumer CPUs like Zen 3
| currently achieves, unfortunately. Rocket lake being backported
| to 14nm is still not a good sign for actual production volume,
| although it probably means next generation will be 10nm all the
| way.
| willis936 wrote:
| Datacenter CPUs are much larger than consumer parts and yield
| goes down with the square of the die area. They start with
| these because the margins go up faster than square of the die
| area.
| Robotbeat wrote:
| But modern techniques exist to deal with problems in a
| large die (ie testing and then segmenting off cores with
| mistakes on them), so the fact they're starting with large
| chip die sizes doesn't really tell you much, no?
| erik wrote:
| The top end most profitable SKUs are fully enabled dies.
| That they are now able to ship dies this large is a good
| sign. The 10nm laptop chips they have produced so far
| were rumored to have atrocious yield.
| knz_ wrote:
| > Rocket lake being backported to 14nm is still not a good
| sign for actual production volume,
|
| I'm not seeing a good reason for thinking this is the case.
| Server CPUs are harder to fab (much larger die area) and they
| need to fab more of them (desktop CPUs are relatively niche
| compared to mobile and server CPUs).
|
| If anything this is a sign that 10nm is fully ready.
| bushbaba wrote:
| I assume for intel, they make more server CPUs per year than
| the entirety of AMDs output.
| bayindirh wrote:
| Intel has a momentum and some cult following, but they're
| no match for AMD in certain aspects like PCI lanes, memory
| channels, and some types of computation which favors AMD's
| architecture.
|
| Day by day, more data centers get AMD systems by choice or
| by requirement (Oh, you want 8xA100 nVidia made modules
| with maximum performance. You need an AMD CPU since it has
| more PCI lanes, for example).
|
| You don't see much AMD server CPUs around because first
| generation and most of the second generation has completely
| bought by FAANG, Dropbox, et al.
|
| As the productions ramps up with newer generation, we can
| buy the overflowing parts after most of the production is
| gobbled up by these buyers.
| zepmck wrote:
| In terms of PCI lanes efficiency, there is no competition
| between Intel and AMD. Intel is much ahead wrt AMD. Don't
| not be impressed about number of lanes available on the
| board.
| cptskippy wrote:
| I can only assume you're referring to Intel's Rocket Lake
| storage demonstration they tweeted out. This was using
| PCMark 10's Quick Storage Benchmark which is more CPU
| bound than anything else.
|
| All of the other benchmarks in the PCMark test suite push
| the bottleneck down to the storage device.
|
| One would think Intel might want to build a storage array
| that could stress the PCIe lanes but then that might show
| an entirely different picture than the one Intel is
| portraying.
| bayindirh wrote:
| > Don't not be impressed about number of lanes available
| on the board.
|
| When you configure the system full-out with GPUs & HBAs,
| the number of lanes becomes a matter of necessity rather
| than a spec which you drool over.
|
| A PCIe lane is a PCIe lane. Its capacity, latency and
| speed is fixed, and you need these with minimum number of
| PCIe switches to saturate the devices and servers you
| have, at least in our scenario.
| ryan_j_naughton wrote:
| There is a slow but seismic shift to AMD within data
| centers right now.
| mhh__ wrote:
| They still have something like 40% to go to even reach
| parity with Intel though
| [deleted]
| totalZero wrote:
| > Rocket lake being backported to 14nm is still not a good
| sign for actual production volume
|
| I'm genuinely having trouble understanding what you mean by
| this.
|
| Rocket Lake being backported to 14nm means that 10nm can be
| allocated in greater proportion toward higher-priced chips
| like Alder Lake and Ice Lake SP. Seems like it would be good
| for production volume.
| rincebrain wrote:
| I think they mean that the fact that they needed to
| backport Rocket Lake, versus just having all their
| production on 10nm, implies a much more limited production
| capacity than the other situation.
| zamadatix wrote:
| Production volume referring to production volume of the
| node still being low as a whole not risks to the being able
| to get volume of these particular SKUs using the node.
| jlawer wrote:
| I was under the impression the issue with 10nm is
| frequency, which lead to the rocket lake backport.
| Unfortunately it seems that the 10nm node's efficiency
| point is lower on the frequency curve. The reviewed ice
| lake processor was 300mhz lower then the previous
| generation (though with much higher core count), despite
| higher power draw and the process node shrink.
|
| In laptop processors they can easily show efficiency gains
| from the 10nm process and IPC improvement, It appears most
| laptop processors end up running a power envelope lower
| then the ideal performance per watt efficiency. Server
| Processors with higher core counts means you can run more
| workloads per server, again providing efficiency gains.
| However desktop / gaming tends to be smaller core count +
| higher frequency with little concern of efficiency outside
| of quality of life factors (i.e. don't make me use a 1KW
| chiller). Intel has been pushing 5ghz processor frequency
| for years, and rocket lake continues that push (5.3ghz
| boost), when they drop frequency to move to 10nm, its hard
| to see an IPC improvement that is able to paper over that.
|
| However alder lake CPUs will have a thread count advantage,
| so at least with 24 threads it should be able to show
| generational improvement over the current 8c/16 rocket lake
| parts. That will allow them to at least argue their value
| with select benchmarks and intel only features. Those 8
| efficiency cores will likely be a BIG win on laptop, but on
| desktop I doubt they will compare favourably to the full
| fat cores on a current Ryzen 5900x (i.e. a currently
| available 24 core processor).
|
| Intel is going to have at least 1 more BAD mainstream
| desktop generation before they can truely compete on the
| mainstream high end, however there is a chance they have
| something like a HEDT part that would allow them to at
| least save face. That being said, given a choice, Intel
| will give up desktop market share for the faster growing
| laptop and server markets.
| buu700 wrote:
| What's a "jelly bean" processor? Trying to search for that just
| gets a bunch of hits about Android 4.1.
| madsushi wrote:
| https://news.ycombinator.com/item?id=17376874
|
| > [1] Jelly Bean chips are those that are made in batches of
| 1 - 10 million with a set of functions that are fairly
| specific to their application.
| foobarian wrote:
| Is that the chip-on-board packaging like described here:
| https://electronics.stackexchange.com/questions/9137/what-
| ki... ?
| ChuckMcM wrote:
| Sometimes referred to as "applications specific processor"
| (ASP) or "System on chip" (SoC). These are the bulk of
| semiconductor sales these days as they have replaced all of
| the miscellaneous gate logic on devices with a single
| programmable block that has a bunch of built in peripherals.
|
| Think Atmel AtMega parts, there are trillions of these in
| various roles. When you think of something like a 555
| timer[1] that is now more cost effectively and capably
| replaced with an 8 pin micro-processor you can get an idea of
| the shift.
|
| While these are rarely built on the "leading edge" process
| node, when a process node takes over for high margin chips,
| the previous node gets used for lower margin chips, which
| effectively does a shrink on their die increasing their cost
| (most of these chips seem to keep their performance specs
| fairly constant, preferring cost reduction over performance
| improvement.)
|
| Anyway, the zillions of these chips in lots of different
| "flavors" are colloquially referred to as "jelly bean" chips.
| dragontamer wrote:
| http://sparks.gogo.co.nz/assets/_site_/downloads/smd-
| discret...
|
| > Jellybean is a common term for components that you keep in
| your parts inventory for when your project just needs "a
| transistor" or "a diode" or "a mosfet"
|
| -----------
|
| For many hobbyists, a Raspberry Pi or Arduino is a good
| example of a Jellybean. You buy 10x Raspberry Pis and stuff
| your drawer full of them, because they're cheap enough to do
| most tasks. You don't really know what you're going to use
| all 10x Rasp. Pi for, but you know you'll find a use of it a
| few weeks from now.
|
| ---------
|
| At least, in my Comp. Engineering brain, I think N2222
| transistors or 3904-transistors, or the 741 Op-amp. There are
| better op-amps and better transistors for any particular job.
| But I chose these parts because they're familiar,
| comfortable, cheap and well understood by a wide variety of
| engineers.
|
| Well, not the 741 OpAmp anymore anyway. 741 was a jellybean
| back in the 12V days. Today I think 5V compatibility has
| become the standard voltage (because of USB). So 5V op-amps
| are a more important "jellybean".
| klodolph wrote:
| I don't know how old you are, but the 741 was obsolete in
| the 1980s. It sticks around in EE textbooks because it's
| such an easy way to demonstrate _problems_ with op-amps...
| high input current, low gain-bandwidth product, low slew
| rate, etc.
|
| I think your jellybean op-amps would more likely be TL072,
| LM358, or NE5532.
| dragontamer wrote:
| Old, beaten up textbooks from the corner of my
| neighborhood library was talking 741 back in the 2000s,
| when I was in high school and started dabbling around
| with electricity more seriously.
|
| Maybe it was fully obsolete by that point, but high
| school + neighborhood libraries aren't exactly filled
| with up-to-date textbooks or the latest and greatest.
|
| I remember that Radio Shack was still selling kits with
| 741 in them, as well as breadboards and common
| components... 12V wall-warts and the like. Online
| shopping was beginning to get popular, but I was still a
| mallrat who picked up components and dug through old
| Radio Shack manuals into 2005 or 2006.
|
| It was the ability to walk around, and see those
| component shelves sitting there in Radio Shack that got
| me curious about the hobby and start researching it. I do
| wonder how modern children are supposed to get interested
| into hobbies now that malls are less popular (and
| electronic shops like Radio Shack are basically
| disappeared).
|
| ------------
|
| I don't remember what we used in college. I knew that I
| was more selective and understood the kinds of problems
| various OpAmps had back then. Also you're not really rich
| enough to invest into a private stockpile of chips, and
| instead just use whatever the labs are stocked with in
| college.
|
| LM358 is the jellybean that I keep in my drawer today. If
| you're curious. Old habits die hard though, I still think
| 741 as the jellybean even though it really is obsolete
| today.
| ChuckMcM wrote:
| I've got a tube each of 358's and 1458's (dual version)
| in my parts supplies. But my microwave stuff is finding
| them lacking.
| bavell wrote:
| +1 for LM358
| carlhjerpe wrote:
| What I don't understand is: ASML is building these machines for
| making ICs. Why can TSMC use them for 7nm but Intel can only
| use them for 10 right now? Doesn't ASML make the lenses as well
| so that you're "only" stuck making the etching thingy (forgot
| what it's called, but the reflective template of a CPU).
|
| It seems like nobody is talking about this, could anyone shine
| some light?
| dragontamer wrote:
| Consider that the wavelength of red light is 700 nm, and the
| wavelength of UV-C is 100nm to 280nm.
|
| And immediately, we see the problem about dropping to 10nm:
| that's literally smaller than the distance that photons
| vibrate on their way to the final target.
|
| And yeah, 10nm and 7nm is a marketing term, but that doesn't
| change the fact that these processes are all smaller than the
| wavelength of light.
|
| -------
|
| So there are two ways to get around this problem.
|
| 1. Use smaller light: "Extreme UV" is even smaller than
| normal UV at 13.5nm. Kind of the obvious solution, but higher
| energy and changes the chemistry slightly, since the light is
| a different color. Things are getting mighty close to literal
| "X-Ray Lasers" as they are, so the power requirements are
| getting quite substantial.
|
| 2. Multipatterning -- Instead of developing the entire thing
| in one shot, do it in multiple shots, and "carefully line up"
| the chips between different shots. As difficult as it sounds,
| its been done before at 40nm and other processes. (https://en
| .wikipedia.org/wiki/Multiple_patterning#EUV_Multip...)
|
| 3. Do both at the same time to reach 5nm, 4nm, or 3nm. Either
| way, 10nm and 7nm is the point where the various companies
| had to decide to do #1 first or #2 first. Either way, your
| company needs to learn to do both in the long term. TSMC and
| Samsung went with #1 EUV, and I think Intel though that #2
| multi-patterning would be easier.
|
| And the rest is history. Seems like EUV was easier after all,
| and TSMC / Samsung's bets paid off.
|
| Mind you, I barely know any of the stuff I'm talking about.
| I'm not a physicist or chemist. But the above is my general
| understanding of the issues. I'm sure Intel had their reasons
| to believe why multipatterning would be easier. Maybe it was
| easier, but other company issues drove away engineers and
| something unrelated caused Intel to fall behind.
| vzidex wrote:
| I'll take a crack at it, though I'm only in undergrad (took a
| course on VLSI this semester).
|
| Making a device at a specific technology node (e.g. 14nm,
| 10nm, 7nm) isn't just about the lithography, although litho
| is crucial too. In effect, lithography is what allows you to
| "draw" patterns onto a wafer, but then you still need to do
| various things to that patterned wafer (deposition, etching,
| polishing, cleaning, etc.). Going from "we have litho
| machines capable of X nm spacing" to "we can manufacture a
| CPU on this node at scale with good yield" requires a huge
| amount of low-level design to figure out transistor sizings,
| spacings, and then how to actually manufacture the designed
| transistors and gates using the steps listed above.
| mqus wrote:
| TSMCs 7nm is roughly equivalent to intels 10nm, the numbers
| don't really mean anything and are not comparable
| lifeisstillgood wrote:
| This might be a very dumb question but it always bothered me -
| silicon wafers are always shown as great circles, but processor
| dies are obviously square. But it looks like the etching etc goes
| right to the circular edges - wouldn't it be better to leave the
| dead space untouched?
| pas wrote:
| I think these are just press/PR wafers and real production ones
| don't pattern on the edge. (First of all it takes time, and in
| case of EUV it means things amortize even faster, because every
| shot damages the "optical elements" a bit.)
|
| edit: it also depends on how many dies the mask (reticle) has
| on it. Intel uses one die reticles, so i. theory their real
| wafers have no situation in which they have partial dies at the
| edge.
| w0utert wrote:
| Most semiconductor production processes like etching, doping,
| polish etc are done on the full wafer, not on individual
| images/fields. So there is nothing to be gained there in terms
| of production efficiency.
|
| The litho step could in theory be optimized by skipping
| incomplete fields at the edges, but the reduction in exposure
| time would be relatively small, especially for smaller designs
| that fit multiple chips within a single image field. I imagine
| it would als introduce yield risk because of things like uneven
| wafer stress & temperature, higher variability in stage move
| time when stepping edge fields vs center fields, etc.
| andromeduck wrote:
| Many of the process steps involve rotation so this is
| impractical.
| jvanderbot wrote:
| From Anandtech[1]:
|
| "As impressive as the new Xeon 8380 is from a generational and
| technical stand-point, what really matters at the end of the day
| is how it fares up to the competition. I'll be blunt here; nobody
| really expected the new ICL-SP parts to beat AMD or the new Arm
| competition - and it didn't. The competitive gap had been so
| gigantic, with silly scenarios such as where a competing 1-socket
| systems would outperform Intel's 2-socket solutions. Ice Lake SP
| gets rid of those more embarrassing situations, and narrows the
| performance gap significantly, however the gap still remains, and
| is still undeniable."
|
| This sounds about right for a company fraught with so many
| process problems lately: Play catch up for a while and hope you
| experience fewer in the future to continue to narrow the gap.
|
| "Narrow the gap significantly" sounds like good technical
| progress for Intel. But the business message isn't wonderful.
|
| 1. https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-
| scal...
| ajross wrote:
| I don't know that it's all so bad. The final takeaway is that a
| 660mm2 Intel die at 270W got about 70-80% of the performance
| that AMD's 1000mm2 MCM gets at 250W. So performance per
| transistor is similar, but per watt Intel lags. But then the
| idle draw was significantly better (AMD's idle power remains a
| problem across the Zen designs), so for many use cases it's
| probably a draw.
|
| That sounds "competetive enough" to me in the datacenter world,
| given the existing market lead Intel has.
| marmaduke wrote:
| It's impressive how you and parent comment copied over
| to/from the dupe posting verbatim.
|
| _edit_ oops nevermind, I see my comment was also
| mysteriously transported from the dupe.
| Symmetry wrote:
| I'm not sure that's a fair area comparison? AMD only has
| around 600 mm2 of expensive leading edge 7nm silicon and uses
| chiplets to up their yields. The rest is the connecting bits
| from an older and cheaper process. Intel's full size is a
| single monolithic die on a leading edge process.
| ineedasername wrote:
| Do chiplets underperform compared to a monolithic die?
| wmf wrote:
| Yes.
| Symmetry wrote:
| All things being equal a chiplet design will underperform
| a monolithic die. But we've already seen the benchmarks
| on the performance of Milan so looking at chiplets versus
| monolithic is mostly about considering AMD's strategy and
| constraints rather than how the chips perform.
| monocasa wrote:
| Pretty much any time you have signals going off chip, you
| lose out on both bandwidth and latency.
| ComputerGuru wrote:
| I would argue that for high-end servers, idle draw is a bit
| of a non-issue as presumably either you have only one of
| these machines and it's sitting idle (so no matter how
| inefficient it doesn't matter) or you have hundreds/thousands
| of them and they'll be as far from idle as it's possible to
| be.
|
| AMD's idle power consumption is a bigger issue for desktop,
| laptop, and HEDT.
| rbanffy wrote:
| If it has 80% of the performance, it will still be
| competitive at 80% of the price.
| ShroudedNight wrote:
| This sounds like a dangerous assumption to make. I would
| expect that needing 25% more machines for the same
| performance would be a non-starter for many potential
| customers.
| throwaway4good wrote:
| I would expect most high-end servers in data centers to sit
| idle most of the time? Do you know of any data on this?
| ajross wrote:
| Most servers are doing things for human beings, and we
| have irregular schedules. Standard rule of thumb is that
| you plan for a peak capacity of 10x average. A datacenter
| that _doesn 't_ have significant idle capacity is one
| that's some kind of weird special purpose thing like a
| mining facility.
| adrian_b wrote:
| That's true, but I would expect that most idle servers
| are turned off and they use Wake-on-LAN to become active
| when there is work to do.
|
| Just a few servers could be kept idle, not off, to enable
| a sub-second start-up time for some new work.
| jeffbee wrote:
| Certainly for bit players and corporate datacenters with
| utilization < 1% you'd expect the median server to just
| sit there. For larger (amazon, google, etc) players the
| economic incentives against idleness are just too great.
| JoshTriplett wrote:
| > For larger (amazon, google, etc) players the economic
| incentives against idleness are just too great.
|
| Not all workloads are CPU-bound. Cloud providers have
| _many_ servers for which the CPUs are idle most of the
| time, because they 're disk-bound, network-bound, other-
| server-bound, bursty, or similar. They're going to aim to
| minimize the idle time, but they can't eliminate it
| entirely given that they have customer-defined workloads.
| mamon wrote:
| But if the workload is not CPU-bound then why would they
| care about upgrading their CPUs to more performant ones,
| like Ice Lake Xeons?
| JoshTriplett wrote:
| The workloads are determined by their customers, and
| customers don't always pick the exact size system they
| need (or there isn't always an option for the exact size
| system they need). The major clouds are going to upgrade
| and offer faster CPUs as an option, people are going to
| use that option, and some of their workloads will end up
| idling the CPU. Major cloud vendors almost certainly have
| statistics for "here's how much idle time we have, so
| here's approximately how much we'd save with lower power
| consumption on idle".
| ajross wrote:
| Electricity costs for large datacenters are higher than the
| equipment costs. They absolutely care about idle draw.
| bostonsre wrote:
| If you are the one paying the electricity bills for that
| datacenter, then yes, it probably matters to you a lot.
| If you are just renting a server from aws or gcp, it
| probably matters less. Although, I assume costs born from
| idle inefficiency will probably be passed to the
| customer...
| [deleted]
| spideymans wrote:
| Shouldn't datacenters attempt to minimize idle time
| though? A server sitting at idle is a depreciating asset
| that could likely be put to more productive use if tasks
| were rescheduled to take advantage of idle time (this
| would also reduce the total number of servers needed).
| deelowe wrote:
| Utilization is a very difficult problem to solve. The
| difference between peak and off peak utilization can be
| as much as 70% or more depending on the application.
| gumby wrote:
| That is definitely the objective but the reality is that
| load is not* uniform over the day. So you are paying to
| keep some number of servers hot (I don't know about
| spinup/spindown practices in modern datacenters).
|
| I doubt this applies to HPC (the target market for this
| part) as they either schedule jobs closely or could, I
| imagine, shut them down. But I'm not in that space either
| so this is merely conjecture.
|
| * I am sure there are corner cases where the load _is_
| uniform, but they are by definition few.
| ComputerGuru wrote:
| If you have enough servers for idle draw to be more than
| a rounding error in your opex breakdown, then you have a
| strategy to keep idle time to zero. It doesn't make any
| financial sense (no matter how low idle draw is) to have
| a server sit idle (or even powered off, but that's a
| capex problem).
| my123 wrote:
| For a cloud infrastructure, you have a significant part
| at idle, for when customers want to instantly spawn a VM.
| zsmi wrote:
| The target market for this part is not that kind of
| datacenter.
|
| Based on the article they're targeting high performance
| compute, i.e. "application codes used in earth system
| modeling, financial services, manufacturing, as well as
| life and material science."
| klodolph wrote:
| The opposite is true... a major advantage of running
| cloud infrastructure is that you can run your CPUs near
| 100% all the time. CPUs which are not running full bore
| can have jobs moved to them.
| jrockway wrote:
| Yeah, I think it's hard to keep your computers at 100%
| utilization for the entire day. You host services close
| to your users, and your users go to bed at some point,
| many of them at around the same time every day. Then your
| computers have very little work to do.
|
| Some bigger companies have a lot of batch jobs that can
| run overnight and steal idle cycles, but you have to be
| gigantic before that's realistic. (My experience with
| writing gigantic batch jobs is that I just requisitioned
| the compute at "production quality" so I could work on
| them during the day, rather than waiting for them to run
| overnight. Not sure what other people did, and therefore
| not really sure how much runs overnight at big
| companies.)
|
| Cloud providers have spot instances that could take up
| some of this slack, but I bet there is plenty of idle
| capacity precisely because the cost can't go to $0
| because of electricity use. Or I could be completely
| wrong about workloads, maybe everyone has their web
| servers and CI systems running at 100% CPU all night.
| I've never seen it, though.
| thekrendal wrote:
| Or for redundancy sake, if you're using any kind of sane
| setup. (Yes, YMMV bigly with this particular idea.)
| chomp wrote:
| Can confirm, built out a datacenter space in a past life.
| Power costs were of limited concern - cooling was the
| limited resource. Even then, literally no one went down a
| spec sheet and compared "hmm, this one has a tiny less
| amount of watts idle". We just kept servers dark
| regardless so that we can save on cooling. Nitpicking
| idle draw for server processors just isn't realistic for
| a lot of cases.
| dahfizz wrote:
| large datacenters have hardware orchestration systems
| that let them turn off unused machines. There really is
| no reason to have lots of machines on but unused. At
| least, that is not a significant enough event to be a
| determining factor in hardware purchasing.
| neogodless wrote:
| A bit off topic from the server CPU discussion, but I was
| curious how well AMD is advancing idle power consumption.
|
| For example, the Ryzen 3000 desktop chips seemed to have
| the issue[0], but the same Zen 2 cores seem to have found
| some improvements in the Ryzen 4000 mobile chips[1].
|
| I didn't want to just rely on Reddit forum comments, so I
| found this measure of the Ryzen 3600[2].
|
| > When one thread is active, it sits at 12.8 W, but as we
| ramp up the cores, we get to 11.2 W per core. The non-core
| part of the processor, such as the IO chip, the DRAM
| channels and the PCIe lanes, even at idle still consume
| around 12-18 W in the system.
|
| My interpretation was expect ~12 W or more idle consumption
| (just from the CPU package), but I'm not sure I understand
| it correctly.
|
| I couldn't find the same information for Ryzen 4000
| laptops, but the same APU is tested in a NUC, where the
| total system draw (at the wall) at idle was about 10-11 W,
| still nearly double that of a Core i7 U-series NUC[3], but
| certainly lower than that of just the CPU package in the
| Ryzen 3600.
|
| Anecdotally, my 45W Ryzen 7 4800H laptop with 15.6" 1080p
| screen lasts about 4 hours on 80% of the 60Wh battery with
| 95% brightness, doing various non-intensive tasks. Though I
| don't know how well the battery holds up on complete non-
| use standby.
|
| [0] https://old.reddit.com/r/AMDHelp/comments/cfm1xa/why_is
| _ryze...
|
| [1] https://old.reddit.com/r/Amd/comments/haq4fg/the_idle_p
| ower_...
|
| [2] https://www.anandtech.com/show/15787/amd-
| ryzen-5-3600-review...
|
| [3] https://www.anandtech.com/show/16236/asrock-4x4-box4800
| u-ren...
| bkor wrote:
| > I couldn't find the same information for Ryzen 4000
| laptops
|
| I measured an Asus Mini PC PN50 with a Ryzen 4500U. The
| idle power usage was 8.5 Watt for the system. This with
| 32GB of memory and a SATA SSD installed. It would be nice
| if it was lower than this, but it isn't too bad.
| Interestingly the machine used 1.2 Watt while off after
| it wasn't on power, 0.5 Watt after starting up and
| shutting it down.
|
| Recently noticed some people focussing on low power but
| powerful 24/7 home "servers". Systems that are on 24/7,
| but often idle. One system used around 4.5 Watt in idle.
| The "brick" / power adapter often uses too much power,
| even when everything is off.
| wtallis wrote:
| Ryzen 3000 desktop processors use a chiplet design, with
| the IO die built on an older process than the processor
| dies. Ryzen 4000 mobile processors are monolithic dies,
| so they don't have the extra power of the inter-chiplet
| connections and they're entirely 7nm parts instead of a
| mix of 7nm and 14nm.
| monocasa wrote:
| You can't really compare die sizes of a MCM and a single die
| and expect to get transistor counts out of that. So much of
| the area of the MCM is taken up by all the separate phys to
| communicate between the chiplets and the I/O die, and the I/O
| die itself is on GF14nm (about equivalent to Intel 22nm) last
| time I checked, not a new competitive logic node.
|
| There's probably a few more gates still on the AMD side, but
| it's not the half again larger that you'd expect looking at
| area alone.
| jvanderbot wrote:
| Furthermore:
|
| "At the end of the day, Ice Lake SP is a success. Performance
| is up, and performance per watt is up. I'm sure if we were able
| to test Intel's acceleration enhancements more thoroughly, we
| would be able to corroborate some of the results and hype that
| Intel wants to generate around its product. But even as a
| success, it's not a traditional competitive success. The
| generational improvements are there and they are large, and as
| long as Intel is the market share leader, this should translate
| into upgraded systems and deployments throughout the enterprise
| industry. Intel is still in a tough competitive situation
| overall with the high quality the rest of the market is
| enabling."
| jandrese wrote:
| I found it a little weird that they conclusions section
| didn't mention the AMD or ARM competition at all, given that
| the Intel chip seemed to be behind them in most of the tests.
| jvanderbot wrote:
| You mean OP didn't? Yes, that's probably standard PR to
| focus on strengths rather than competition.
| jandrese wrote:
| I mean the Anand piece.
| jvanderbot wrote:
| The conclusions section was quoted in my post and they
| explicitly mention it.
|
| "As impressive as the new Xeon 8380 is from a
| generational and technical stand-point, what really
| matters at the end of the day is how it fares up to the
| competition. I'll be blunt here; nobody really expected
| the new ICL-SP parts to beat AMD or the new Arm
| competition - and it didn't. "
| ksec wrote:
| It is certainly good enough to compete, prioritising Fab
| capacity to Server unit and lock in those important ( Swaying )
| deals from clients. Sales and Marketing work their connection
| along with software tools that HPC markets needs and AFAIK is
| still far ahead of AMD.
|
| And I can bet those prices have lots of room for special
| discount to clients. Since RAM and NAND Storage dominate the
| cost of server, the difference of Intel and AMD shrinks rapidly
| in the grand scheme of things, giving Intel a chance to fight.
| And there is something not mentioned enough, the importance of
| PCI-E 4.0 Support.
|
| I wanted to rant about AMD, but I guess there is not much
| point. ARM is coming.
| quelsolaar wrote:
| >This sounds about right for a company fraught with so many
| process problems lately
|
| Publicly the problems have been lately, but the things that
| caused these problems have happened much further back.
|
| I'm cautiously bullish on Intel. From what I gather, Intel is
| in a much better place internally. They have much better focus,
| there is less infighting, its more engineering then sales lead,
| they have some very good people and they are no longer
| complacent. It will however take years before this is becomes
| visible from the outside.
|
| Given the demand for CPUs and the competitions inability to
| deliver, I think intel will do OK even if they are no ones
| first choice of CPU vendor, while they try to catch up.
| intricatedetail wrote:
| Why Intel even bothers releasing products that don't bring
| anything new and worthwhile to the table? This is such a massive
| waste of time, resources and environment.
| w0mbat wrote:
| 10nm? I love retro-computing.
| ajross wrote:
| As gets repeated ad nauseum, industry numbering has gone wonky.
| Intel still hews more or less to the ITRS labelling for its
| nodes, which means that it's 10nm process has pitches and
| density values along the same lines as TSMC or Samsung's 7nm
| processes.
|
| This is, indeed, no longer an industry leading density and it
| lags what you see on "5nm" parts from Apple and Qualcomm. But
| it's the same density that AMD is using for the Zen 2/3 devices
| against which this is competing in the datacenter.
| adrian_b wrote:
| Maybe the density is the same, but the 10-nm process variant
| that Intel is forced to use for Ice Lake Server is much worse
| than the 7-nm TSMC process.
|
| It is worse in the sense that at the same number of active
| cores and the same power consumption, the 10-nm Ice Lake
| Server can reach only a much lower clock frequency than the
| 7-nm Epyc, which results in a much lower performance for
| anything that does not use AVX-512.
|
| It is also worse in the sense that the maximum clock
| frequency when the power limits are not reached is also much
| worse for the 10-nm process used for Ice Lake Server.
|
| Ice Lake Server does not use the improved 10-nm process
| (SuperFin) that is used for Tiger Lake and it is strongly
| handicapped because of that.
| ac29 wrote:
| While I'd agree with you that TSMC's current 7nm seems to
| be better than Intel's current 10nm, comparing Epyc to Ice
| Lake SP isnt quite the same. Intel is putting (up to) 40
| cores on the same die, AMD only puts 8 cores. It looks like
| AMD has the better method for overall performance, and
| Intel will likely follow them - in addition to being able
| to get more cores into a socket, I suspect Intel could also
| crank frequency higher with less cores per die.
| adrian_b wrote:
| For the user it does not matter how many cores are on a
| die.
|
| For the user it matters what is included in a package.
| The new Ice Lake Server package (77.5 mm x 56.5 mm) has
| finally reached about the same size as the Epyc package
| (75.4 mm x 58.5 mm), because now Intel offers for the
| first time 8 memory channels, like its competitors have
| offered for many years.
|
| So in packages of the same size, Intel has 40 cores,
| while AMD offers 64 cores. Moreover Intel requires an
| extra package for the I/O controller, while AMD includes
| it in the CPU package.
|
| So for general-purpose users, AMD offers much more in the
| same space.
|
| On the other hand, Ice Lake Server has twice the number
| of FMA units, so it has as many floating-point
| multipliers as 80 AMD cores. This advantage is diminished
| by the fact that the clock frequency for heavy AVX-512
| instructions is only 80% of the nominal frequency, but it
| can still give an advantage to Ice Lake Server for the
| programs that can use AVX-512.
| totalZero wrote:
| From a yield perspective, if core failures are
| independent events, binning is probably easier with the
| big chiplet approach.
|
| The Epyc 3 approach does have some drawbacks. Looking at
| the Epyc 3 TDP numbers, there's probably a nontrivial
| thermal cost to breaking out the dies as AMD has. Not to
| mention the I/O for Epyc 3 is not on TSMC 7nm.
| mhh__ wrote:
| Intel's process have been a disaster, however considering that
| for the most part they aren't _that_ far behind (especially
| financially) I don 't think they have to catch up much on
| process at least to be right back in the fight - I will believe
| that the pecking order has truly changed when AMD's
| documentation and software is as good as Intel's.
| Pr0GrasTiNati0n wrote:
| And only 20 of those cores have back doors.....lulz
| Sephr wrote:
| As disappointing as the perf is for server workloads, what I'm
| really interested in is SLI gaming performance. I can imagine
| that this would be a boon for high end gaming with multiple x16
| PCIe 4.0 slots and 8 DDR4 channels.
|
| SLI really shines on HEDT platforms, and this is probably the
| last non-multi-chip quasi-HEDT CPU for a while with this kind of
| IO.
|
| (Yes, I know SLI is 'dead' with the latest generation of GPUs)
| zamadatix wrote:
| These would be absolute trash for SLI performance vs top end
| standard consumer desktop parts. The best SKU has a peak boost
| clock of 3.7 GHz, the core to core latencies are about twice as
| high as the desktop parts, and the memory+PCIe bandwidth mean
| little to nothing for gaming performance (remember SLI
| bandwidth goes over a dedicate bridge as well) which is highly
| sensitive to latencies instead.
| marmaduke wrote:
| Nice to see that AVX512 hasn't died with Xeon Phi. I see it
| coming out in a number of high end but lightweight notebooks too
| (Surface Pro with i7 10XXG7, MacBookPro 13" idem). This is a nice
| way to avoid needing GPU for heavily vectorizable compute tasks,
| assuming you don't need the CUDA ecosystem.
| api wrote:
| The 2020 Intel MacBook Air and 13" Pro have 10nm Ice Lake with
| AVX512. The Ice Lake MacBook Air performs pretty well and very
| close to the Ice Lake Pro, though of course the M1 destroys it.
| mhh__ wrote:
| > though of course the M1 destroys it.
|
| SIMD throughput?
| api wrote:
| Actually I don't know... I suspect Intel still wins in wide
| SIMD. The M1 totally destroys Intel in general purpose code
| performance, especially when you consider power
| consumption.
| bitcharmer wrote:
| AVX-512 is an abomination in my field and we avoid it like the
| plague. It looks like we're not the only ones. Linus has a lot
| to say about it as well.
|
| https://www.phoronix.com/scan.php?page=news_item&px=Linus-To...
| 37ef_ced3 wrote:
| For example, AVX-512 neural net inference: https://NN-512.com
|
| Only interesting if you care about price (dollars spent per
| inference)
|
| For raw speed (no matter the price) the GPU wins
| dragontamer wrote:
| GPGPU will never really be able to take over CPU-based SIMD.
|
| GPUs have far more bandwidth, but CPUs beat them in latency.
| Being able to AVX512 your L1 cached data for a memcpy will
| always be superior to passing data to the GPU.
|
| With Ice Lake's 1MB L2 cache, pretty much all tasks smaller
| than 1MB will be superior in AVX512 rather than sending it to a
| GPU. Sorting 250,000 Float32 elements? Better to SIMD Bitonic
| sort / SIMD Mergepath
| (https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf) on your
| AVX512 rather than spend a 5us PCIe 4.0 traversal to the GPU.
|
| It is better to keep the data hot in your L2 / L3 cache, rather
| than pipe it to a remote computer (even if the 16x PCIe 4.0
| pipe is 32GB/s and the HBM2 RAM is high bandwidth once it gets
| there).
|
| --------
|
| But similarly: CPU SIMD can never compete against GPGPUs at
| what they do. GPUs have access to 8GBs @500GB/s VRAM on the
| low-end and 40GBs @1000GB/s on the high end (NVidia's A100).
| EDIT: Some responses have reminded me about the 80GB @ 2000GB/s
| models NVidia recently released.
|
| CPUs barely scratch 200GB/s on the high end, since DDR4 is just
| slower than GPU-RAM. For any problem where data-bandwidth and
| parallelism is the bottleneck, that fits inside of GPU-VRAM
| (such as many-many sequences of large scale matrix
| multiplications), it will pretty much always be better to
| compute that sort of thing on a GPU.
| marmaduke wrote:
| In my experience, the most important aspect missing in most
| CPU GPU discussions, is that CPUs have a massive cache
| compared to GPUs, and that cache has pretty good bandwidth
| (~30 GB/core?), even if main memory doesn't. So even if your
| task's hot data doesn't fit in L2 but in L3/core, AVX-
| whatever per core processing is a good bet regardless of what
| a GPU can do.
|
| Another aspect that seems like a hidden assumption in CPU-GPU
| discussions is that you have the time-energy-expertise budget
| to (re)build your application to fit GPUs.
| dragontamer wrote:
| On the memory perspective, I basically see problems in
| roughly the following grouping of categories:
|
| 40TBs+ -- Storage-only solutions. "External Tape Merge sort
| algorithm", "Sequential Table Scan", etc. etc. (SSDs or
| even Hard drives if you go big enough)
|
| 4TB to 40TBs -- Multi-socket DDR4 RAM is king (8-way Ice
| Lake Xeon Scalable Platinum will probably reach 40TBs).
| Single-node distributed memory with NUMA / UPI to scale.
|
| 1TB to 4TB -- Single Socket DDR4 RAM (EPYC, even if at 4x
| NUMA. Or Single-node Ice Lake).
|
| 80GB to 1TB -- DGX / NVlink distributed memory A100 ganging
| up HBM2 together. GPU-distributed RAM is king.
|
| 256MBs to 80GBs -- HBM2 / GDDR6 Graphics RAM is king (80GB
| A100 2TB/s).
|
| 1.5MBs to 256MBs -- L3 cache is king (8x32MBs EPYC L3
| cache, or POWER9 110MB+ L3 cache unified)
|
| 128kB to 1.5MBs -- L2 cache is king (1.25MB Ice Lake Xeons
| L2, this article)
|
| 1kB to 128kB -- L1 cache is king. (128kB L1 cache on Apple
| M1). Note: "GPU __Shared__" is a close analog to L1 and
| competes against it, but is shared between 32 to 256 GPU
| threads, so its not an apples-to-apples comparison.
|
| 1kB and below -- The realm of register-space solutions.
| (See 64-bit chess engine bitboards and the like). Almost
| fully CPU-constrained / GPU-constrained programming. 256x
| 32-bit GPU registers per GPU-thread / SIMD thread. CPUs
| have fewer nominal registers, but many "out of order"
| buffers or "reorder buffers" that practically count as
| register storage in a practical / pragmatic sense. CPUs
| just use their "real registers" as a mechanism to
| automatically discover parallelism in otherwise single-
| thread written code.
|
| ------------
|
| As you can see: GPUs win in some categories, but CPUs win
| in others. And these numbers change every few months as a
| new CPU and/or GPU comes out. And at the lowest levels:
| CPUs and GPUs cannot be compared due to fundamental
| differences in architecture.
|
| For example: GPU __shared__ memory has gather/scatter
| capabilities (the NVidia PTX instructions / AMD GCN
| instructions permute vs bpermute), while CPUs traditionally
| only accelerate gather capabilities (pshufb), and leave
| vgather/vscatter instructions to the L1 cache instead. GPUs
| have 32x ports to __shared__, so every one of the
| 32-threads in a wave-front can read/write every single
| clock-tick (as long as all 32 they are on different
| ports/alignment, or you have a special one-to-all
| broadcast). CPUs only have 2 or 4 ports, so vscatter and
| vgather operate slowly, as if a single thread were
| reading/writing each of the memory locations.
|
| But CPU L1 cache has store-forwarding, MESI + cache
| coherence, and other acceleration features that GPUs don't
| have.
|
| GPUs are therefore more efficient at sharing data within
| workgroups of ~256 threads, but CPUs are more efficient at
| sharing data between cores, or even among out-of-die NUMA
| solutions, thanks to robust MESI messaging.
| ajross wrote:
| FWIW: your DRAM numbers are quoting clock speeds and not
| bandwidth. They aren't linear at all. In fact with enough
| cores you can easily saturate memory that wide, and CPUs are
| getting wider just as fast as GPUs are. The giant Epyc AMD
| pushed out last fall has 8 (!) 64 bit DRAM channels, where
| IIRC the biggest NVIDIA part is still at 6.
| mrb wrote:
| dragontamer is still correct. He quotes correct bandwidth
| numbers. EPYC's 8 channels of DDR4-3200 gets it to 204.8
| GB/s (and, yes, that's _bandwidth_ )
|
| Whereas Nvidia's A100 has over 2000 GB/s of memory
| bandwidth. That's 10-fold better.
| dragontamer wrote:
| > 8 (!) 64 bit DRAM channels
|
| Yeah. And at 3200 Mbit/sec, that comes out to 200GB/s.
| (3200 MHz x 8-bytes (aka 64-bit) == 25GB/s. x8 channels ==
| 200GB/s).
|
| > where IIRC the biggest NVIDIA part is still at 6.
|
| That's 6x *1024-bit* HBM2 channels. Total bandwidth is
| 2000GBps, or over 10x the speed of the "8x channel EPYC".
| Yeah, HBM2 is fat, extremely fat.
|
| ----------
|
| *ONE* HBM2 channel offers over 300GBps bandwidth. And the
| A100 has *SIX* of them. Literally ONE HBM2 channel beats
| the speed of all 8x DDR4 EPYC memory channels working in
| parallel.
| ajross wrote:
| You're still quoting clock speeds. That's not how this
| works. Go check a timing diagram for a DRAM cycle in your
| part of choice and do the math.
| dragontamer wrote:
| Do you know what 3200MHz / PC4-25600 DDR4 means?
|
| 25600 is the channel rate in (EDIT) MB/sec of the stick
| of RAM. That's 25GB/s for a 3200 MHz DDR4 stick. x8 (for
| 8-channels working in parallel) is 200GB/s.
|
| -----------
|
| This has been measured in practice by Netflix: https://20
| 19.eurobsdcon.org/slides/NUMA%20Optimizations%20in...
|
| As you can see, Netflix's FreeBSD optimizations have
| allowed EPYC to reach 194GB/s measured performance (or
| just under the 200GB/s theoretical). And only with VERY
| careful NUMA-tuning and extreme optimizations were they
| able to get there.
| gbl08ma wrote:
| All of that is bandwidth and clock speed, not latency
| dragontamer wrote:
| Look, if CPUs were better at memory latency, the BVH-
| traversal of raytracing would still be done on CPUs.
|
| BVH-tree traversals are done on the GPU now for a reason.
| GPUs are better at latency hiding and taking advantage of
| larger sets of bandwidth than CPUs. Yes, even on things
| like pointer-chasing through a BVH-tree for AABB bounds
| checking.
|
| GPUs have pushed latency down and latency-hiding up to
| unimaginable figures. In terms of absolute latency,
| you're right, GPUs are still higher latency than CPUs.
| But in terms of "practical" effects (once accounting for
| latency hiding tricks on the GPU, such as 8x way
| occupancy (similar to hyperthreading), as well as some
| dedicated datastructures / programming tricks (largely
| taking advantage of the millions of rays processed in
| parallel per frame), it turns out that you can convert
| many latency-bound problems into bandwidth-constrained
| problems.
|
| -----------
|
| That's the funny thing about computer science. It turns
| out that with enough RAM and enough parallelism, you can
| convert ANY latency-bound problem into a bandwidth-bound
| problem. You just need enough cache to hold the results
| in the meantime, while you process other stuff in
| parallel.
|
| Raytracing is an excellent example of this form of
| latency hiding. Bouncing a ray off of your global data-
| structure of objects involved traversing pointers down
| the BVH tree. A ton of linked-list like current_node =
| current_node->next like operations (depending on which
| current_node->child the ray hit).
|
| From the perspective of any ray, it looks like its
| latency-bound. But from the perspective of processing
| 2.073 million rays across a 1920 x 1080 video game scene
| with realtime-raytracing enabled, its bandwidth bound.
| wmf wrote:
| That presentation shows 194 gigabits/s which is only ~24
| gigabytes/s at the NIC; that requires ~96 gigabytes/s of
| memory bandwidth. Usable memory bandwidth on Milan is
| only <120 gigabytes/s which is about 60% of the
| theoretical max. DRAM never gets more than ~80% of
| theoretical max bandwidth because of command overhead
| (which is what I think ajross keeps alluding to).
| https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-
| scal...
| dragontamer wrote:
| I appreciate the correction. It seems like I made the
| mistake of Gbit vs GByte confusion (little-b vs big-B).
|
| > (which is what I think ajross keeps alluding to)
|
| It seems like ajross is accusing me of underestimating
| CPU-bandwidth. At least, that's my interpretation of the
| discussion so far. As you've pointed out however, I'm
| overestimating it.
|
| EDIT: But I'm overestimating it on both sides. A100 2000
| TB/s is the "channel bandwidth" as well, as the CAS and
| RAS commands still need to go through the channel and get
| interpreted.
| volta83 wrote:
| > Being able to AVX512 your L1 cached data for a memcpy will
| always be superior to passing data to the GPU.
|
| The two last apps I worked on have been GPU-only. The CPU
| process starts running and launches GPU work, and that's it,
| the GPU does all the work until the process exits.
|
| There is no need to "pass data to the GPU" because data is
| never on CPU memory, so there is nothing to pass from there.
| All network and file I/O goes directly to the GPU.
|
| Once all your software runs on the GPU, passing data to the
| CPU for some small task doesn't make much sense either.
| dragontamer wrote:
| So we know that GPUs are really good at raytracing and
| matrix multiplication, two things that are needed for
| graphics programming.
|
| However, the famous "Moana" scene for Disney-level
| productions is a 93GB (!!!!) scene statically, with another
| 131GBs (!!!) of animation data (trees blowing in the winds,
| waves moving on the shore, etc. etc.).
|
| That's simply never going to fit on a 8GB, 40GB, or even
| 80GB high-end GPU. The only way to work with that kind of
| data is to think about how to split it up, and have the CPU
| store lots of the data, while the GPU processes pieces of
| the data in parallel.
|
| https://www.render-blog.com/2020/10/03/gpu-motunui/
|
| Which has been done before, mind you. But it should be
| noted that the discussion point for GPU-scale compute runs
| into practical RAM-capacity constraints today, even on
| movie-scale problems from 5 years ago (Moana was released
| in 2016, and had to be rendered on hardware years older
| than 2016).
|
| Moana scene is here if you're curious:
| https://www.disneyanimation.com/resources/moana-island-
| scene...
|
| ----------
|
| But yes, if your data fits within the 8GBs GPU (or you can
| afford a 40GB or 80GB VRAM GPU and your data fits in that),
| doing everything on the GPU is absolutely an option.
| oivey wrote:
| We know that GPUs are really good at far more than ray
| tracing and matrix multiplication. Oversimplifying a bit,
| they're great at basically any massively parallel
| operation that has minimal branching and can fit in
| memory. Using a GPU to just add two images together
| probably isn't worth it, but many real world workflows
| allow you to operate solely on the GPU.
|
| If you're Disney, you can afford boxes with 10+ A100s
| with NVLink sharing the memory in a single 400+ GB pool.
| Unknown if that ends up being more economical than the
| equivalent CPU version, but it's important to understand
| in order to evaluate the future of GPUs.
| volta83 wrote:
| >That's simply never going to fit on a 8GB, 40GB, or even
| 80GB high-end GPU. The only way to work with that kind of
| data is to think about how to split it up, and have the
| CPU store lots of the data, while the GPU processes
| pieces of the data in parallel.
|
| There is always a problem size that does not fit into
| memory.
|
| Whether that memory is the GPU memory, or the CPU memory,
| doesn't really matter.
|
| We have been solving this problem for 60 years already.
| It isn't rocket science.
|
| ---
|
| The CPU doesn't have to do anything.
|
| The GPU can map a file stored on hard disk to VRAM
| memory, do random access into it, process chunks of it,
| write the results into network sockets and send them over
| the network, etc.
|
| The only thing the CPU has to do is launch a kernel:
| int main(args...) {
| main_kernel<<<...>>>(args...); synchronize();
| return 0; }
|
| and this is a relatively accurate depiction of how the
| "main function" of the two latests apps I've worked on
| look like: the GPU does everything.
|
| ---
|
| > However, the famous "Moana" scene for Disney-level
| productions is a 93GB (!!!!) scene statically, with
| another 131GBs [...] That's simply never going to fit on
| a high-end GPU.
|
| LOL.
|
| V100 with 32Gbs and 8x per rack gave you 256 Gb of VRAM
| addressable from any GPU in the rack.
|
| A100 with 80GB and 16x per rack give you 1.3 TB of VRAM
| addressable from any GPU in the rack.
|
| You can fit Moana in GPU VRAM in a now old DGX-2.
|
| If you are willing to bet cash on Moana never fitting on
| a single GPU, I'd take you on that bet. Sounds like free
| money to me.
| dragontamer wrote:
| I'll post the link again: https://www.render-
| blog.com/2020/10/03/gpu-motunui/
|
| This person rendered the Moana scene on just 8GBs of GPU
| VRAM. It does this by rendering 6.7GB chunks at a time on
| the GPU, with the CPU keeping the RAM-heavy "big picture"
| in mind. (EDITED paragraph. First wording of this
| paragraph was poor).
|
| ------
|
| Its not that these problems "cannot be solved", its that
| these problems "become grossly more complicated" when
| under RAM / VRAM constraints. They're still solvable, but
| now you have to do strange techniques.
|
| ------
|
| With regards to a Ray-tracer, tracing the ray-of-light
| that's bouncing around could theoretically touch ANY of
| the 93GBs of static object data (which could have been
| shifted by any of the 131 GBs of animation data). That is
| to say: a ray that bounces off of a any leaf on any tree
| could bounce in any direction, hitting potentially any
| other geometry in the scene.
|
| That pretty much forces you to keep the geometry in high-
| speed RAM, and not do an I/O cycle between each ray-
| bounce.
|
| As a rough reminder of the target performance: Raytracers
| aim at ~30 million to 30-billion ray-bounces per second,
| depending on movie-grade vs video-game optimized. Either
| way, that level of performance is really only ever going
| to be solved by keeping all of the geometry data in RAM.
|
| > A100 with 80GB and 16x per rack give you 1.3 TB of VRAM
| addressable from any GPU in the rack.
|
| That doesn't mean it makes sense to traverse a BVH-tree
| across a relatively high-latency NVLink connection off-
| chip. I know GPUs have decent latency hiding but...
| that's a lot of latency to hide.
|
| Again: your CPU-renderers can hit 10s of millions of rays
| per second. I'm not sure if you're gonna get something
| pragmatic by just dropping the entire geometry into
| distributed NVSwitch'd memory and hoping for the best.
|
| Honestly, that's where the 8GB CPU+GPU team becomes
| interesting to me. A methodology for clearly separating
| the geometry and splitting up which local compute-devices
| are responsible for handling which rays is going to scale
| better than a naive dump reliant on remote-connections
| pretending to be RAM.
|
| Video games hit Billions of rays/second. The promise of
| GPU-compute is on that order, and I just doubt that
| remote RAM accesses over NVLink will get you there.
|
| > If you are willing to bet cash on Moana never fitting
| on a single GPU, I'd take you on that bet. Sounds like
| free money to me.
|
| The issue is not Moana (or other movies from 2016), the
| issue are the movies that will be made in 2022 and into
| the future. Especially if they're near photorealistic
| like Marvel-movies or Star Wars.
|
| ----------
|
| The other problem is: what's cheaper? A DGX-system could
| very well be faster than one CPU system. But would it be
| faster than a cluster of Ice-Lake Xeons with AVX512 each
| with the precise amount of RAM needed for the problem?
| (Ex: 512GBs in some hypothetical future movie?)
|
| A team probably would be better: CPUs have expandable
| RAM, that's their biggest advantage. GPUs have fixed RAM.
| Slicing the problem up so that pieces of Raytracing fits
| on GPUs, while the other, "bulkier" bits fit on CPU DDR4
| (or DDR5), would probably be the most cost-efficient way
| at solving the raytracing problem.
|
| The GPU-Moana experiment showed that "collecting rays
| that bounce outside of RAM" is an efficient methodology.
| Slice the scene into 8GB chunks, process the rays that
| are within that chunk, and the collate the rays together
| to find where the rays go.
| aviraldg wrote:
| > There is no need to "pass data to the GPU" because data
| is never on CPU memory, so there is nothing to pass from
| there. All network and file I/O goes directly to the GPU.
|
| This is very interesting - do you have a link that explains
| how it works / is implemented?
| dragontamer wrote:
| PS5, XBox Series X, and NVidia have a "GPU Direct I/O"
| feature.
|
| https://www.nvidia.com/en-us/geforce/news/rtx-io-gpu-
| acceler...
|
| https://www.amd.com/en/products/professional-
| graphics/radeon...
|
| The GPU itself can send PCIe 4.0 messages out. So why not
| have the GPU make I/O requests on behalf of itself? Its a
| bit obscure, but this feature has been around for a
| number of years now. The idea is to remove the CPU and
| DDR4 from the loop entirely, because those just
| bottleneck / slowdown the GPU.
|
| --------
|
| From an absolute performance perspective, it seems good.
| But CPUs are really good and standardized at accessing
| I/O in very efficient ways. I'm personally of the opinion
| that blocking and/or event driven I/O from the CPU (with
| the full benefit of threads / OS-level concepts) would be
| easier to think about than high-performance GPU-code.
|
| But still, its a neat concept, and it seems like there's
| a big demand for it (see PS5 / XBox Series X).
| etaioinshrdlu wrote:
| The CPU is still acting as the PCIe controller though
| (right?), which kind of makes the CPU act like a network
| switch. PCIe is a point-to-point protocol kind of like
| ethernet too. Old-school PCI was a shared bus so devices
| might be able to directly talk to each other, but I don't
| think that was ever actually used.
| d110af5ccf wrote:
| My understanding matches yours, but it's worth noting
| that (IIUC) memory and PCIe are (last time I checked?) a
| separate I/O subsystem that just happens to reside within
| the same package as the CPU on modern chips. So P2PDMA
| avoids burning CPU cycles and RAM bandwidth shuffling
| data around that you never wanted to use on the CPU
| anyway. (Also see: https://lwn.net/Articles/767281/)
| dragontamer wrote:
| Take a look at the Radeon more closely.
|
| I think the Radeon + Premier Pro documentation makes it
| clear how it works:
| https://www.amd.com/system/files/documents/radeon-pro-
| ssg-pr...
|
| As you can see, the GPU is attached to the x16 slot, and
| the 4x NVMe SSDs are attached to the GPU. When the CPU
| wants to store data on the SSD, it communicates first to
| the GPU, which then pass-throughs the data to the four
| SSDs.
|
| That's the simpler example.
|
| --------------
|
| In NVidia's case, they're building on top of GPUDirect
| Storage (https://developer.nvidia.com/blog/gpudirect-
| storage/), which seems to be based on enterprise
| technology where PCIe switches were used.
|
| NVidia's GPUs would command the PCIe switch to grab data,
| without having the PCIe switch send data to the CPU
| (which would most likely be dropped in DDR4, or maybe L3
| in an optimized situation).
| ASpaceCowboi wrote:
| will this work on the latest mac pro? Probably not right?
| wmf wrote:
| No, it's a different socket.
| robbyt wrote:
| Classic Intel
| wmf wrote:
| You can't increase memory and PCIe channels while keeping
| the same socket. This isn't a cash grab; it's actual
| progress.
| paulpan wrote:
| TLDR from Anandtech is that while this is a good improvement over
| previous gen, it still falls behind AMD (Epyc) and ARM (Altra)
| counterparts. What's somewhat alarming is that on a per-core
| comparison (28-core 205W designs), the performance increase can
| be a wash. Doesn't bode well for Intel as both their competitors
| are due for refreshes that will re-widen the gap.
|
| Key question will be how quickly Intel will shift to the next
| architecture, Sapphire Rapids. Will this release be like the
| consumer/desktop Rocket Lake? E.g. just a placeholder to
| essentially volume test the 10nm fabrication for datacenter.
| Probably at least a year out at this point since Ice Lake SP was
| supposed to be originally released in 2H2020.
| gsnedders wrote:
| > Key question will be how quickly Intel will shift to the next
| architecture, Sapphire Rapids. Will this release be like the
| consumer/desktop Rocket Lake? E.g. just a placeholder to
| essentially volume test the 10nm fabrication for datacenter.
| Probably at least a year out at this point since Ice Lake SP
| was supposed to be originally released in 2H2020.
|
| Alder Lake is meant to be a consumer part contemporary with
| Sapphire Rapids, which is server only. They're likely based on
| the same (performance) core, with Adler Lake additionally
| having low-power cores.
|
| Last I heard the expectation was still that these new parts
| would enter the market at the end of this year.
| CSSer wrote:
| Lately Intel seems to be getting a lot of flack here. As a
| layperson in the space who's pretty out of the loop (I built a
| home PC about a decade ago), could someone explain to me why that
| is? Is Intel really falling behind or dressing up metrics to
| mislead or something like that? I also partly ask because I feel
| that I only really superficially understand why Apple ditched/is
| ditching Intel, although I understand if that is a bit off-topic
| for the current article.
| s_dev wrote:
| >Is Intel really falling behind
|
| Intel is already behind AMD -- they have no product segment
| where they are absolutely superior. The means AMD is setting
| the market pace.
|
| On top of this Apple is switching to ARM designed CPUs. This
| also looks to be a vote of no confidence in Intel.
|
| The consensus seems to be that Intel who have their own fabs --
| never really nailed anything under 14nm and are now being
| outcompeted.
| meepmorp wrote:
| Apple designs it's own chips, it doesn't use ARM's designs.
| They do use the ARM ISA, tho.
| totalZero wrote:
| > Intel is already behind AMD -- they have no product segment
| where they are absolutely superior.
|
| There are some who would argue this claim, but I think it's
| at least a defensible one.
|
| Still, availability is an important factor that isn't
| captured by benchmarking. AMD has had CPU inventory trouble
| in the low-end laptop segment and high-end desktop segment
| alike.
|
| > The consensus seems to be that Intel who have their own
| fabs -- never really nailed anything under 14nm and are now
| being outcompeted.
|
| Intel has done well with 10nm laptop CPUs. They were just
| very late to the party. Desktop and server timelines have
| been quite a bit worse. I agree Intel did not nail 10nm, but
| they're definitely hanging in there. It's one process node at
| the cusp of transition to EUV, so some of the defeatism
| around Intel may be overzealous if we keep in mind that 7nm
| process development has been somewhat parallel to 10nm
| because of the difference in the lithographic technology.
| yoz-y wrote:
| Intel was unable to improve their fabrication process year
| after year, while promising to do so repeatedly. Now, they have
| been practically lapped twice. Apple has a somewhat specific
| use case, but their cpus have significantly better performance
| per watt.
| matmatmatmat wrote:
| Some of the other comments above have touched on this, but I
| think there is also a bit of latent anti-Intel sentiment in
| many people's minds. Intel extracted a non-trivial price
| premium out of consumers for many, many years (both for chips
| and by forcing people to upgrade motherboards by changing CPU
| sockets) while AMD could only catch up to them for brief
| periods of time. People paid that price premium for one reason
| or another, but it doesn't mean they were thrilled about it.
|
| Many people, I'd say especially enthusiasts, were quite happy
| when AMD was able to compete on a performance/$ basis and then
| outright beat Intel.
|
| Of course, now the tables have turned and AMD is able to
| extract that price premium while Intel cut prices. Who knows
| how long this will last, but Intel is still the 800 lb gorilla
| in terms of capacity, engineering talent, and revenue. I don't
| think we've heard the last from them.
| blackoil wrote:
| A perfect storm. Intel had trouble with its 10nm/7nm
| engineering processes, which TSMC has been able to achieve. AMD
| had a resurgence with Zen arch. and ARM/Apple/TSMC/Samsung put
| 100s of billions to catchup with the x86 performance.
|
| Intel is still biggest player in the game, because even though
| they are stuck at 14nm, AMD isn't able to manufacture enough to
| take bigger chunks of the market. Apple won't sell it to
| PC/Datacenter space, rest are still niche.
| ac29 wrote:
| > even though they are stuck at 14nm
|
| I think this isnt quite fair, their laptop 10nm chips have
| been shipping in volume since last year, and their server
| chips were released today, with 200k+ units already shipped
| (according to Anandtech). The only line left on 14nm is
| socketed Desktop processors, which is a relatively small
| market compared to laptops and servers.
| colinmhayes wrote:
| Hacker News users generally aren't very interested in
| laptop processors. Sure business wise they're incredibly
| important, but as far as getting flack on hacker news,
| laptop chips won't stop it. People here have been waiting
| for intel 10nm on server and especially desktop for 6 years
| now.
| totalZero wrote:
| Unless you have scraped past posts to perform some kind
| of sentiment analysis, this is pure speculation intended
| to move the goalposts on GP.
| jimbob21 wrote:
| Yes, quite simply they have fallen behind while also promising
| things they have failed to deliver. As an example, their most
| recent flagship release is the 11900k, which has 2 fewer cores
| (now 8) than its predecessor (had 10, 10900k), and almost no
| improvement to speak of otherwise (in some games its ~1%
| faster). On the other hand, AMD's flagship, which to be fair is
| $150 more expensive, has 16 cores, very similar clock speeds,
| and is much more energy efficient (intel and amd calculate TDP
| differently). Overall, AMD is the better choice by a large
| margin and Intel is getting flock because it sat on its laurels
| for the last decade(?) and hasn't done anything to improve
| itself.
|
| To put it in numbers alone, look at this benchmark. Flagship vs
| Flagship:
| https://www.cpubenchmark.net/compare/Intel-i9-11900K-vs-AMD-...
| formerly_proven wrote:
| Naturally the 11900K performs quite a bit worse than the
| 10900K in anything which uses all cores, but the remarkable
| thing about the 11900K is that it even performs worse in a
| bunch of game benchmarks, so as a product it genuinely
| doesn't make any sense.
| chx wrote:
| Absolutely. Intel has been stuck on the 14nm node for a very,
| very long time. 10nm CPUs were supposed to ship in 2015, they
| did really only in late 2019, 2020. Meanwhile AMD caught up and
| Intel has been doing the silliest shenanigans to appear as if
| they were competitive, like in 2018 they demonstrated a 28 core
| 5GHz CPU and kinda forgot to mention the behind-the-scenes one
| horsepower (~745W) industrial chiller keeping that beast
| running.
|
| Also, the first 10nm "Ice Lake" mobile CPUs were not really an
| improvement over the by then many times refined 14nm chips
| "Comet Lake". It's been a faecal pageant.
| mhh__ wrote:
| Intel's processes (i.e. turning files on a computer into chips)
| have been a complete disaster in recent years, to the point of
| basically _missing_ one of their key die shrinks entirely as
| far as I can tell.
|
| They are, in a certain sense, suffering from their own success
| in that their competitors have basically been nonexistant up
| until Zen came about (and even then only until Zen 3 have Intel
| truly been knocked off their single thread perch). This has led
| to them getting cagey, and a bit ridiculous in the sense that
| they are not only backporting new designs to old processes but
| also pumping them up to genuinely ridiculous power budgets.
| With Apple, AMD, and TSMC they have basically been caught with
| their trousers down by younger and leaner companies.
|
| Ultimately this is where Intel need good leadership. The mba
| solution is to just give up and do something else (e.g. spin
| off the fabs), but I think they should have the confidence (as
| far as I can tell this is what they are doing) to rise to the
| technical challenge - they will probably never have a run like
| they did from Nehalem to shortly before now, but throwing in
| the towel means that the probability is zero.
|
| Intel have been in situations like this before, e.g. When
| Itanium was clearly doomed and AMD were doing well (amd64),
| they came back with new processors and basically ran away to
| the bank for years - AMD's server market share is still pitiful
| compared to Intel (10% at most), for example.
| Symmetry wrote:
| I don't want to council despair but I'm not as sanguine as
| you either. Intel has had disastrous microarchitectures
| before. Itanium, P4, and previous ones. But it's never had to
| worry about recovering from a _process_ disaster before. It
| might very well be able to but I worry.
| mhh__ wrote:
| I'm not exactly optimistic either, I just think that the
| doomsaying is overblown (and sometimes looks like a tribal
| thing from Apple and AMD fans if I'm being honest - i.e.
| companies aren't your friends)
| ac29 wrote:
| > Intel's processes (i.e. turning files on a computer into
| chips) have been a complete disaster in recent years, to the
| point of basically missing one of their key die shrinks
| entirely as far as I can tell.
|
| Which one? I dont believe they missed a die shrink, it just
| took a _long_ time. Intel 14nm came out in 2014 with their
| Broadwell Processors, and the next node, 10nm came out in
| 2019 (technically 2018, but very few units shipped that
| year).
| totalZero wrote:
| Intel killed the longstanding "tick tock" model in 2016
| because of failures with 10nm yield and the higher-than-
| expected costs of 14nm. Intel got too aggressive with the
| timeline of the die shrink, which led to them trying to do
| 10nm on DUV rather than waiting for EUV technology where
| the light is about an order of magnitude shorter in
| wavelength than that of DUV (and thus able to resolve
| today's nano-scale features without all the RETs needed for
| DUV).
|
| From the 2015 10-K [0]:
|
| _" We expect to lengthen the amount of time we will
| utilize our 14nm and our next-generation 10nm process
| technologies, further optimizing our products and process
| technologies while meeting the yearly market cadence for
| product introductions."_
|
| Spoiler alert: In the five years after shelving the tick-
| tock model, Intel also missed the yearly market cadence for
| product introductions.
|
| [0] https://www.sec.gov/Archives/edgar/data/50863/000005086
| 31600...
| mhh__ wrote:
| Cannon Lake I believe was basically cancelled.
| chx wrote:
| You wish. It was released because a bunch of Intel
| managers had bonuses tied to launching 10nm and so they
| released it.
| ineedasername wrote:
| They can't get their next-gen fabs (chip factories) into
| production. It's been a problem long enough that they're not
| even next-gen anymore: it's current-gen, about to be previous-
| gen.
|
| So what you're seeing isn't really anti-Intel, it's probably
| often more like bitter disappointment that they haven't done
| better. Though I'm sure there's a tiny bit of fanboy-ism for &
| against Intel.
|
| There's definitely some of that pro-AMD fanboy sentiment in the
| gaming community where people build their own rigs: AMD chips
| are massively cheaper than a comparable Intel chip.
| M277 wrote:
| Just a minor nitpick regarding your last paragraph, this is
| no longer the case. Intel is now significantly cheaper after
| they heavily cut prices across the board.
|
| For instance, you can now get an i7-10700K (which is roughly
| equivalent in single thread and better in multi thread) for
| cheaper than a R5 5600X.
| robocat wrote:
| Nitpick: you are comparing price where you should be
| comparing performance per dollar, or are you cherry-picking
| the wrong comparison?
|
| My cherry-pick is where the AMD chip is 30% more expensive,
| but multi-threaded performance is 100% better in this
| example:
| https://www.cpubenchmark.net/compare/Intel-i9-11900K-vs-
| AMD-...
|
| Edit: picking individual processors to compare (especially
| low volume ones) is often not useful when talking about how
| well a company is competing in the market.
| makomk wrote:
| The comment you're replying to is "cherry-picking" the
| current-gen AMD processor which offers the best value for
| most users. You're cherry-picking an Intel processor
| which almost no-one has any reason to buy over other
| Intel options (the i9-11900K is much more expensive than
| the 11700K or 10700k for little extra performance; AMD
| had a few chips like this last gen, and they actually
| downplayed how much of a price increase this gen was by
| only comparing to those poor-value chips). One of these
| comparisons is a lot more useful than the other.
| MangoCoffee wrote:
| >So what you're seeing isn't really anti-Intel, it's probably
| often more like bitter disappointment that they haven't done
| better.
|
| its back to where everyone design its own chip for their own
| product but don't need a fab 'cause of foundry like TSMC and
| Samsung.
| tyingq wrote:
| Lots of shade because they first missed the whole mobile
| market, then got beat by AMD Zen by missing the chiplet concept
| and a successful current-gen process size, then finally also
| overshadowed by Apple's M1. The M1 thing is interesting,
| because it likely means the next set of ARM Neoverse CPUs for
| servers, from Amazon and others, will be really impressive.
| Intel is behind on many fronts.
| mhh__ wrote:
| >likely means the next set of ARM Neoverse CPUs from Amazon
| and others will be really impressive
|
| M1 is proof that it can be done, however you can absolutely
| make a bad CPU for a good ISA so I wouldn't take it for
| granted.
| tyingq wrote:
| Might be a hint as to how much of M1's prowess is just the
| process size and how much is Apple.
| JohnJamesRambo wrote:
| https://jamesallworth.medium.com/intels-disruption-is-now-co...
|
| I think that summarizes it pretty well in that one graph.
___________________________________________________________________
(page generated 2021-04-06 23:00 UTC)