[HN Gopher] Computing Performance on the Horizon
___________________________________________________________________
Computing Performance on the Horizon
Author : mrry
Score : 133 points
Date : 2021-07-05 14:24 UTC (8 hours ago)
(HTM) web link (brendangregg.com)
(TXT) w3m dump (brendangregg.com)
| hackermeows wrote:
| Nice talk , covers a lot of base . He predicts Unikernels are
| dead, Containers will keep growing and lighthweight vms will take
| over after that.
| martinpw wrote:
| Slide 26 is interesting - arguing that cloud providers have an
| advantage for future CPU design since they can analyze so many
| real world customer workloads directly.
|
| In previous roles I have worked with CPU vendors who have been
| very keen on getting access to profiling data from our workloads
| for design optimization, and lamenting the fact that it was hard
| to get such data and they were often limited to synthetic
| benchmark workloads when tuning new designs.
|
| So this argument does sound like a valid one, and does imply AWS
| etc will have significant advantages in future designs.
| justicezyx wrote:
| It's already happening. Intel design now are significantly
| influenced by Google and Amazon's data center needs.
| handrous wrote:
| An interesting application of a now-familiar pattern: get lots
| of users, spy on them at massive scale, use those data to
| dominate some other market in a way that, at most, a single
| digit count of companies in the world could conceivably compete
| with (because none but they have anything like the data that
| you do). See also: everything to do with "AI".
| uluyol wrote:
| Amazon, Microsoft, and Google have massive applications and
| systems that they run, some of which they sell as a service.
| They have plenty of workload data without having to poke
| around user VMs.
| handrous wrote:
| You disagree with slide 26, then?
| thechao wrote:
| He sort of implies that "just better hardware" will peter out
| in the 2030s. I think he's calling it at least 50 years too
| soon. Here's why: (1) I think logic designers are still faffing
| about in term of optimizing their designs; and (2), I think
| there's a lot of smart people thinking "incrementally" through
| what we'd consider paradigm shifts in HW implementation. That
| is, our fabs will just _naturally_ segue into 3D, spintronics,
| etc. I think he even mentions 3D circuits? One thing a lot of
| people miss is that layout of the design is materially
| different in 3D vs 2D: in 2D layout is NP-hard (complete)
| without efficient polynomial approximations; in 3D layout is
| low-order polynomial. The reduction in layout complexity will
| allow us to design things that are unthinkable right now, due
| to layout constraints & wire congestion.
| hobofan wrote:
| Aren't "big" 3D circuits unfeasible due to temperature
| limitations, though?
| a1369209993 wrote:
| > Aren't "big" 3D circuits unfeasible due to temperature
| limitations, though?
|
| Less so than you'd think, since (as long as you can keep
| leakage current under control) heat is only generated when
| the circuits are _active_ (barring leakage current, when
| bits are being flipped). So you can have arbitrarily large
| amounts of increasingly-rarely-used circuitry for various
| purposes.
|
| The naive-but-easy-to-understand example would be having
| separate, optimal circuitry for each machine instruction -
| the total number of gates is O(N*M), but the number of
| gates _activated_ on each clock cycle (and thus the amount
| of waste heat generated) is only O(N), so you can keep
| adding new, perfectly-hardware-accelerated instructions up
| to the limits of physical space. In practice it 's more
| complicated and less of a nonissue, but it's not "big
| circuits are useful in proportion to their surface area,
| not their volume", it's more "big circuits are less useful
| than their volume alone would suggest". (You do hit
| physical limits like the Bekenstein bound[0] eventually,
| but that's far enough out that we mostly don't care yet.)
|
| 0: https://en.wikipedia.org/wiki/Bekenstein_bound
| nradov wrote:
| How are we going to cool those 3D chips?
| thechao wrote:
| The commenter above is correct: just stop toggling HW. We
| already do this to a great extent; we're limited in the
| number of custom implementations because we can't wire
| everything together. 3D chips will have a lot more "dark"
| logic than current chips, but will be orders-of-magnitude
| more efficient (& thus powerful) due to deep customization.
|
| Also, remember the argument of my timeline is ~50-80 years
| out from now.
| nradov wrote:
| Dark logic isn't doing any useful work so what's the
| point? Sure you can include specialized circuitry for a
| bunch of rare cases, but that won't lift overall system
| performance much and will kill manufacturing yields.
| yjftsjthsd-h wrote:
| > in 2D layout is NP-hard (complete) without efficient
| polynomial approximations; in 3D layout is low-order
| polynomial
|
| Any chance you could explain to a novice why 3D is easier? To
| my naive intuition, it would have seemed like the more room
| to maneuver is offset by having more stuff to route.
| thechao wrote:
| Try laying out a square with all the vertices connected _in
| a plane_ : it's not possible. You must move one of the
| wires "up" a layer. Which wire? Great layout minimizes
| layer transitions while also bunching together related HW
| blocks. The NP-completeness proof is related to work
| Knuth's student (Plass?) did on laying out images in TeX.
| jeffbee wrote:
| A delightful set of slides and references that tickles many of my
| pet topics. In particular, one I'd love to hear more about is why
| so many deployments are still choosing 2-socket servers by
| default when managing them is such a pain in the neck and the
| performance when you do it badly is so poor. Live the life of the
| future, today: choose single sockets!
| syoc wrote:
| Rack space can be quite expensive. Sometimes you need a lot of
| computing power in one or two rack units.
|
| Would be interested in what the management pains are. I agree
| that 2 socket machines require more thought in a lot of
| scenarios, especially IO heavy workloads.
| jeffbee wrote:
| The OpenCompute "Delta Lake" machine mentioned in the article
| occupies only one third of 1RU and peaks at 400W. You will
| certainly be power/cooling limited, rather than volume
| limited, with that kind of density.
| zaroth wrote:
| In my very limited experience it seems like space is much
| less an issue than power density.
|
| You can fit far more kW/U than the datacenter can possibly
| cool.
|
| In the commodity space that I rent, I ran out of power before
| filling even half the rack. I'm sure higher power/cooling
| density is possible to obtain, but I would think you're
| primarily paying for that versus square footage?
| zozbot234 wrote:
| What do you need that power density for? It's a rack, not a
| supercomputer. (I sure hope it's not "mining coins" or
| anything like that.)
| soulbadguy wrote:
| In world where most work loads are containerized, and where
| each container can be pinned to numa region doesn't it really
| matter?
| wmf wrote:
| Does any container runtime/orchestrator perform this
| optimization yet? Why wait?
| syoc wrote:
| Kernel scheduling is NUMA aware and will localize
| workloads. Threads will mostly have their RAM on the sticks
| local to their node. The core the thread is delegated to is
| also more likely to be the core local to the disk or NIC
| being used for IO.
|
| This is at least my experience, though I am no expert.
| solarkennedy wrote:
| Titus (Netflix's container orchestrator that I work on)
| does this via: https://github.com/Netflix-Skunkworks/titus-
| isolate
| jeffbee wrote:
| k8s, by default, is oblivious to NUMA topology. You have to
| enable unreleased features and configure them correctly,
| which is the unwanted complexity to which I referred earlier.
| Simply aligning your containers to NUMA domains does not
| solve the problem that your arriving network frames or your
| NVMe completion queues can still be on the wrong domain.
| Isn't it simpler to just have 1 socket and not need to care?
| The number of cores available on a single socket system is
| pretty high these days, and in general the 1S parts are
| cheaper and faster.
| eloff wrote:
| Yeah, it makes a lot of sense to go with single socket
| servers unless you can't scale horizontally (e.g. database
| server). Why deal with the complexity when you can just
| side step it.
| dragontamer wrote:
| Why would you switch from a 100GBps NUMA connection (800
| gigabits per second) over NUMA fabric into a 10 Gbps
| Ethernet fabric?
|
| If you are scaling horizontally, NUMA is the superior
| fabric than Ethernet or Infiniband (100Gbps)
|
| Horizontal scaling seems to favor NUMA. 1000 chips over
| Ethernet is less efficient than 500 dual socket nodes
| over Ethernet. Anything you can do over Ethernet seems
| easier and cheaper over NUMA instead.
| eloff wrote:
| I'm talking mostly abour scaling things like app servers
| where they might not need any communication.
|
| But in general if you can't scale horizontally at 10
| gbps, you're in for a world of hurt. Numa gets you to 8x
| scale at best on very expensive very exotic hardware. And
| then you hit the wall.
| dragontamer wrote:
| I'm mostly talking about 2 socket servers, which are IIRC
| more common than even single socket servers.
|
| Dual socket is a cheap, easy, and common. If only to
| recycle fans, power supplies and racks, it seems useful.
| wmf wrote:
| This is correct if your software is NUMA-optimized (or if
| auto-NUMA works well for you) but if it isn't you can end
| up with slowdowns.
| dragontamer wrote:
| Surely that can be fixed with just a well placed numactl
| command to set node affinity and CPU affinity.
|
| The root article is discussing rewriting code to fit on
| FPGAs. If NUMA is too complex then... I dunno. The FPGA
| argument seems dead on arrival.
| kortilla wrote:
| Your scaling architecture sucks if it depends on that
| kind of throughput. If you need that you've only can
| kicked your way to more capacity without a real scaling
| fix.
| dragontamer wrote:
| Depend on? Heavens no.
|
| Dual socket has numerous advantages in density and rack
| space. The fact that performance is better is pretty much
| icing on the cake.
|
| It's easier to manage 500 dual socket servers than 1000
| single socket servers. Less equipment, higher utilization
| of parts, etc. Etc.
|
| To suggest dual socket NUMA is going away is... just very
| unlikely to me. I don't see what the benefits would be at
| all. Not just performance, but also routine maintenance
| issues (power, Ethernet, local storage, etc etc)
| dragontamer wrote:
| FPGAs from Xilinx are very complicated. They are no longer
| homogeneous 4-LUTs or 6LUTs with dedicated multipliers here and
| there.
|
| Today's FPGAs are VLIW minicores capable of SIMD execution with
| custom routing and some LUTs thrown around. They've stepped
| towards GPU style architecture while retaining the custom logic
| portions.
|
| FPGAs remain so difficult to use, I find it unlikely that they'd
| be mainstream in any capacity. GPUs seem like the easier way to
| get access to HBM + heavy compute, but either way the HBM future
| is eminent.
|
| ------------
|
| GPUs have big questions about ease of use and practicality as it
| is, even with widespread acceptance of their compute potential.
| FPGAs are much less known, it's hard for me to imagine a
| mainstream future of them.
|
| Since memory bounds remains the biggest issue and not compute
| performance, I bet that the easiest to use accelerator with mass
| production and cheap access to the highest speed HBM is going to
| be the winner. GPUs are the current frontrunner, but the Fujitsu
| ARM CPU has easy access to HBM and could be a wildcard.
|
| POWER10 will be using high performance GDDR6. Not quite HBM, but
| it signals that IBM is also concerned with the memory bandwidth
| problem in the near future.
|
| CPUs could very well switch to HBM in some scenarios.
|
| ------------
|
| If I were to guess the future: I think that AMD and NVidia have
| proven that today's systems need high speed routers to
| practically scale
|
| AMD has their IO die on EPYC. NVidia has NVLink and NVSwitch.
| That seems to be how to get more dies / sockets without
| additional NUMA hops.
|
| More efficient networks of chips with explicit switching /
| routing topologies is the only way to scale. The exact form of
| this network is still a mystery, but that's my big bet for the
| future.
|
| HBM is probably the future for high performance. DDR5 for cheaper
| bulk RAM but HBM on high performance CPUs / GPUs / FPGAs is going
| to be key.
|
| ---------
|
| The insight into RAM bottlenecks is interesting but seems to be
| point in favor of SMT. If your core is 50% waiting on RAM, then
| SMT into another thread to perform work while waiting on RAM.
| Robotbeat wrote:
| FPGAs are hard to work with in part because the tools are
| extremely proprietary. But that is starting to change. Open
| source FPGA tools are becoming more common and more powerful.
| volta83 wrote:
| > If your core is 50% waiting on RAM, then SMT into another
| thread to perform work while waiting on RAM.
|
| If your core is 50% waiting on RAM, then SMT into another
| thread, and that other thread will want some memory to work on,
| so it will also wait on RAM. On Top of it, this second thread
| now puts extra pressure on the memory subsystem, might cause
| cache evictions for the other thread, etc etc etc
|
| The moment that you include the memory subsystem into the SMT
| picture, SMT goes from a "no brainer; waiting on memory? do
| other work" to a "uhhh... i don't know if this makes things
| better or worse".
| dragontamer wrote:
| Not quite.
|
| DDR4 and DDR5 have 50ns (single socket) to 150ns (dual
| socket) latency.
|
| For a 3GHz processor, that's 150 to 450 cycles.
|
| On any latency bound problem, SMT helps. However, what you
| say is true on bandwidth bound problems. Given the shear
| number of pointer hopping that happens in typical OOP code
| these days (or python / JavaScript), I expect SMT to be of
| big help to typical applications.
|
| DDR5 will double bandwidth in the near future. But that's not
| enough: HBM and GDDR6 have a possible future because you can
| only solve the bandwidth problem with more hardware. No
| tricks like SMT can help.
| zozbot234 wrote:
| Memory bandwidth is just barely trying to keep up with
| cores, frequencies and IPC amounts. Bandwidth available per
| core is still going to drop. So newer development workflows
| that optimize for this bottleneck are going to be very
| relevant.
| dragontamer wrote:
| Memory bandwidth does improve at least.
|
| Latency hasn't improved for the last 30 years. Tricks
| like SMT which can help mitigate the latency issue seem
| like the way forward.
| rbanffy wrote:
| I have an enormous respect for Brandon Gregg, but this "one
| socket ought to be enough for anyone" is something I saw too many
| people get burned with.
|
| I mean, it should, but who knows what the next version of Slack
| will need...
| rektide wrote:
| Some random contemporary musings, that touch some of these
| topics: I really hope we have a rad eBPF based QUIC/HTTP3 front-
| end/reverse-proxy router in the next 5 years.
|
| QUIC is so exciting and I just want it to be both fast & a
| supremely flexible way for a connection from a client to talk to
| a host of backend services. We'll definitely see some classic
| userland based approaches emerge, but gee, really hungry for
|
| For context, I was at the park two days ago, thinking about
| replacing a Node timesync[1] over websockets thing with a NTP-
| over-WebTransport (QUIC) implementation. There werent any H3
| front-ends (which I kind of need because I just have some random
| colo & VPS boxes), and even if there were I was worried about
| adding latency (which a BPF based solution would significantly
| reduce, while letting me re-use ports 80/443).
|
| Especially as we see more extreme-throughput/HBM memory systems
| arrive, it's just so neat that we have a multiplexed transport
| protocol. Figuring out how to use that connection (semi stateless
| "connection", because QUIC is awsome) to talk to an array of
| services is an ultra-interesting challenge, and BPF sure seems
| like the go-to tech for routing & managing packets in the world
| today. QUIC, with it's multiplexing, adds the complexity that it
| is now subpackets that we want to route. I hope we can find a way
| to keep a lot of that processing in the kernel.
|
| [1] https://www.npmjs.com/package/timesync
| ksec wrote:
| >for storage including new uses for 3D Xpoint as a 3D NAND
| accelerator;
|
| 3D XPoint's future is not entirely certain. Intel with their new
| CEO has remained rather quiet on the subject. Micron are pulling
| the plug on it and sold the Fab to Texas Instrument. The problem
| is there isn't a clear path forward with the technology, it make
| some sense when NAND and DRAM price were high in 2016 - 2019.
| Once they dropped to a normal level with newer DDR5 and faster
| SLC NAND or ZNAND with lower latency than XPoint's cost benefits
| becomes unclear. I guess we will know once Intel's Optane P5800X
| [1] is out with review. It is quite a beast.
|
| >Multi-Socket is Doomed
|
| Are there really no use-case where 128 Core+ with NUMA offer some
| advantage?
|
| >Slower Rotational
|
| Seagate [2] is actually working on dual Actuator HDD, think of it
| as something like internal RAID 0. The rational being as HDD gets
| bigger the time to fill up those drive increases as well.
|
| >ARM on Cloud
|
| Marvell partly confirms all HyperScalers have intention to build
| their own ARM CPU. But Google just announced their Tau instances
| [3], effectively cutting their cost / pref by 50%. Where each
| vCPU is an entire physical CPU core rather than a x86 thread.
|
| Not much mention on GPGPU.
|
| [1]
| https://www.intel.com/content/www/us/en/products/docs/memory...
|
| [2] https://www.anandtech.com/show/16544/seagates-
| roadmap-120-tb...
|
| [3] https://cloud.google.com/blog/products/compute/google-
| cloud-...
| infogulch wrote:
| > Are there really no use-case where 128 Core+ with NUMA offer
| some advantage?
|
| Are there any use cases where 128+ core single socket wouldn't
| be preferred to a 128+ core multiple socket design that is
| burdened by NUMA?
|
| AMD has been showing us that integrating the interconnects into
| the CPU package directly and letting it handle all the issues
| is a better design.
| dragontamer wrote:
| When a hypothetical 128-core single socket comes out, will
| there be no workload that prefers to use a 2x128-core dual
| socket instead?
|
| AMD CPUs remain largely dual-socket compatible. Today's
| 64-core EPYCs can be dual-socketed into 2x64-core beasts.
|
| It just seems silly to me that if you're building say 200
| computers in 10x racks (20-computers per 10x 40U racks) that
| you'd prefer single socket over dual-socket. If you're
| scaling up and out so much, what exactly is the problem with
| dual socket? Its not costs: dual socket remains cost-
| effective on a per-core basis over single-socket. Dual-
| sockets cuts the number of computers you need to work with in
| half. Etc. etc.
___________________________________________________________________
(page generated 2021-07-05 23:00 UTC)