hngopher.com

       [HN Gopher] Arm-Based 128-Core Ampere CPUs Cost a Fraction of x8...
       ___________________________________________________________________
        
       Arm-Based 128-Core Ampere CPUs Cost a Fraction of x86 Price
        
       Author : rbanffy
       Score  : 176 points
       Date   : 2022-01-03 11:32 UTC (1 days ago)
        
 (HTM) web link (www.tomshardware.com)
 (TXT) w3m dump (www.tomshardware.com)
        
       | freemint wrote:
       | During a ARM HPC User group meetup I got to play with a 160 core
       | machine from Ampere and got really, really impressive performance
       | (not to talk about performance per dollar) for SAT solving. I
       | pitched buying one to my local HPC cluster.
        
         | cameron_b wrote:
         | Where does one find information on ARM HPC User groups and
         | local HPC Clusters?
        
           | freemint wrote:
           | For the ARM HPC user group there is a mailing list
           | https://arm-hpc.groups.io/g/ahug and a Twitter account
           | https://twitter.com/armhpcusergroup . As for the local HPC
           | Cluster that depends on your geographical position. Mine is
           | at the university I am studying at.
        
       | stillicidious wrote:
       | Totally meaningless numbers without at least some normalized
       | performance/watt to compare against
        
       | jakuboboza wrote:
       | I think Ampere doesnt support hyper threading. That means this
       | 128 cores is comparable to 64 cores on EPYC/Xeon. Also there is
       | the L2/L3 cache that is important and ofc arch. Arm still has low
       | adoption because code has to be recompiled. While "web" targets
       | are easy things like financial software that gains benefits from
       | AVX-512 instructions might be harder because Neoverse doesn't
       | have this instructions.
       | 
       | On the other side, massively concurrent solutions might benefit
       | much more from this new 128/256 arms chips. So for sure there is
       | room for this type of solutions and im happy we are adding on top
       | of x86/amd64.
       | 
       | Last but not the least x86/amd64 unless something changed were
       | locked to AMD and Intel so regions like EU can't rely on them if
       | they want to be independent in terms of silicon
       | production/design. So Arm and maybe RISC-V are the only real
       | paths right now.
        
         | monocasa wrote:
         | A hyperthread is only worth ~30% of a full core back of the
         | napkin.
        
           | FpUser wrote:
           | Many years ago enabling hyperthreading on a single core PCs
           | of my clients had doubled performance. It was multi-threaded
           | multimedia type application.
        
           | berkut wrote:
           | Can be closer to 40% in my experience with Intel when things
           | are memory/cache bound (i.e. one thread is regularly stalled
           | waiting for data), and I've seen above 55% in some cases with
           | AMD's Zen microarch's SMT (Zen3).
        
           | floatboth wrote:
           | And security people hate them because side-channels.
        
           | water8 wrote:
           | Hyperthreading works by putting multiple arithmetic logical
           | units such as an Adder bit shifter, or mux to work in
           | parallel during some clock cycles where the executing code
           | can utilize more ALUs. If the code executing was simply using
           | the same ALU over and over again, such as repetitive
           | summations, no additional performance gain will be possible
           | from hyperthreading. A hyperthread doesn't necessarily have
           | priority of the ALUs a normal core thread does
        
             | cesarb wrote:
             | You seem to be confusing SMT (the formal term for
             | hyperthreading) with superscalar. It's superscalar (and
             | out-of-order which builds on it) which works by using
             | multiple ALUs in parallel when possible. SMT then builds on
             | that by allowing the unused ALUs to be used by other
             | thread(s). If a single thread was simply using the same ALU
             | over and over again, you'd have a large additional
             | performance gain from SMT, since all the other ALUs would
             | be available for the second thread.
        
           | Symmetry wrote:
           | It really depends on the core. An SMT8 POWER8 core might see
           | an 80% throughput increase going from 1 to 2 threads but
           | those have been accused of being two SMT4 cores standing on
           | each other's shoulders wearing a trenchcoat. 30% for a
           | mainstream x86 consumer core is a good average expected value
           | but it'll tend to vary a lot by workload with lower IPC tasks
           | tending to get better results.
        
           | magicalhippo wrote:
           | What a hyperthread brings to the table has huge variability.
           | It can even be a net negative in certain cases. For the
           | software ray-tracer I worked on, it brought 7-80% of a single
           | core.
        
         | thrwyoilarticle wrote:
         | My impression of Arm-world SMT is that it was added in duress
         | because people kept asking for it despite the word-of-God being
         | that it was better just to add additional cores (I wonder if
         | their license structure influences that claim?). Today SMT Arm
         | cores are still very much the minority, so either the fashion
         | sustained or their customers/implementors agree that more cores
         | are better than fewer cores with SMT.
         | 
         | There are no AVX-512 instructions. But that's the x86 branding
         | of the vector instructions that you can only implement with the
         | right x86 licence. So it's tautological. Arm can have vector
         | instructions and languages are even beginning to make portable
         | interfaces for vectors on multiple architectures.
        
       | parsimo2010 wrote:
       | I can buy an 8 core AMD Opteron for $18 with shipping on eBay, so
       | actually I can get 128 cores of x86 for $288. Checkmate lazy
       | journalists.
       | 
       | Proof: https://www.ebay.com/itm/234113546803 and that's just the
       | first result when I searched AMD Opteron. I didn't even bother to
       | see whether I could do better with other old CPUs.
        
         | iosjunkie wrote:
         | Now source 5,000 CPUs at that price for your new data center.
        
           | BatteryMountain wrote:
           | And it eats a cool 95W per chip.
        
             | rbanffy wrote:
             | Not counting the 16 PSU's.
             | 
             | There are use cases for having 16 machines instead of one
             | 16 times more powerful one, but lower TCO isn't one of
             | them.
        
       | [deleted]
        
       | kellengreen wrote:
       | This article just makes me miss the Tom's Hardware of old.
        
       | JohnJamesRambo wrote:
       | I can't believe there are CPUs that cost $8k. Does it really pay
       | off price/performance wise or is it just for people that like hot
       | rods?
        
         | can16358p wrote:
         | Many of them are for datacenters/cloud computing or rendering
         | services/hypervisors. For them yes they do pay the price for
         | sure, given that the CPU time is utilized efficiently.
        
         | PragmaticPulp wrote:
         | It's not for consumers. It's for special-purpose servers and
         | workstations.
         | 
         | Many problems don't scale well to more nodes. In many cases,
         | it's worth spending a lot on a single, very expensive server to
         | avoid having to rewrite the software to be distributed across
         | multiple machines.
        
         | chrisseaton wrote:
         | Are you disbelieving because you think most people would use
         | multiple cheaper CPUs instead? There are some workloads where
         | inter-core communication is the bottleneck, so people pay a
         | premium to get a lot of cores in one place.
        
         | wongarsu wrote:
         | The first "bestselling" consumer CPU I checked costs ~$50/core
         | (a 6core AMD Ryzen 5600X). Scaled to 128 cores that Ryzen would
         | cost $6400. Considering how many motherboards, PSUs, Fans, etc.
         | you safe by having one computer with 128-core CPU compared to
         | 8.3 computers with a 6-core CPU each, the price premium pays
         | for itself (for workloads where this CPU performs at least as
         | well as 8.3 Ryzen 5600X).
        
           | JohnJamesRambo wrote:
           | Is ram, heat, and network I/O not an issue with 128 cores in
           | one box?
        
             | wongarsu wrote:
             | It can be a problem, or unlock efficiencies. 8x1Gbit vs
             | 1x10Gbit LAN costs about the same, but having the 10Gbit
             | LAN in one box gives you better burst performance than
             | scattering your I/O across 8 boxes. Having all the RAM in
             | one place might reduce how much RAM you need in total. But
             | if you need huge amounts of RAM or network I/O, scattering
             | it across many boxes is probably easier and cheaper.
             | 
             | The heat from this CPU is fairly reasonable (about a
             | RTX3070, to stay in consumer comparisons). It pays for that
             | with low clock speed, so your workload has to somehow
             | profit from having all the cores in one place.
        
       | throwaway4good wrote:
       | Actually it sounds like the arm alternative is a lot more for any
       | practical application.
        
       | jabej wrote:
       | Wake me up when those cores can run x86 code.
        
         | rbanffy wrote:
         | Why? Can't you recompile your code?
        
         | selfhoster11 wrote:
         | Emulation is possible, and now even easy. Windows has a built-
         | in Windows on ARM (WoA) feature, and Linux can do something
         | similar with QEMU application-based emulation (including
         | stubbing out calls to dynamically linked libraries with calls
         | to native ones).
         | 
         | This is in addition to the fact that a lot of server-based code
         | runs as an interpreted/JIT language, or a bytecode VM like Java
         | or .NET.
        
         | montalbano wrote:
         | You should be awake already then:
         | 
         | https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...
        
         | sidkshatriya wrote:
         | I can understand requiring consumer based systems to offer x86
         | emulation like Mac M1s. This is because consumers tend to be
         | less sophisticated, expect their old applications to run like
         | before etc. etc.
         | 
         | This product is squarely positioned as a server product. I
         | don't see the need to have to offer performant x86 emulation
         | like Mac M1 does (you could run QEMU if you wanted though but
         | that's going to slowish).
         | 
         | You have all the major programming languages, HTTP servers,
         | databases, OSes running on ARM Linux. If you're buying an
         | Ampere server it would sure be a waste to expect then to run
         | x86 in emulation!
        
         | samus wrote:
         | Hardware with x86 compatibility is only relevant if for some
         | reason you are stuck with x86-only software somewhere in your
         | stack. Or you have specialized hardware that you can't run on
         | another platform for some reason. But I'd argue most new
         | applications aren't dependant on x86: most modern and relevant
         | programming language have production-grade compilers and
         | runtime environments for ARM and increasingly also for RISC-V.
        
       | [deleted]
        
       | nextweek2 wrote:
       | In a past life when I would specify servers, the first
       | requirement was ECC RAM.
       | 
       | Is that a thing with ARM or are these servers used for something
       | where you don't care about data assurance (like gaming)?
        
       | 8K832d7tNmiQ wrote:
       | Do enterprises even care about price per core?
       | 
       | > Ampere positions its Altra and Altra Max processors with up to
       | 128 core largely for hyperscale providers of cloud services.
       | 
       | > That leaves the company with a fairly limited number of
       | potential customers.
       | 
       | Even the article itself admits that this is a niche product.
        
       | childintime wrote:
       | Anyone has an idea how these cores compare to say SiFive's
       | latest, the P650?
       | 
       | The P650 has only 16 cores, but looks like it should be able to
       | compete with the $800 32 core Ampere running at 1.7GHz.
        
         | floatboth wrote:
         | The P650 core is compared to Cortex-A77 (while Neoverse-N1 is
         | an offshoot of A76), but N1 is from 2019. P650 is still in the
         | "dev kit somewhere in 2022" stage, while the new Neoverse-N2 is
         | in the "AWS is letting _EC2 customers_ test drive it already "
         | stage: https://aws.amazon.com/blogs/aws/join-the-preview-
         | amazon-ec2...
        
         | Symmetry wrote:
         | I haven't seen any data on absolute P650 performance. I've seen
         | some on SPECInt2006/GHz which is interesting but not useful for
         | performance comparisons without the clock speed.
        
       | zackmorris wrote:
       | This is great! I've waited for chips approaching something like
       | 256 to 1024 cores since the 90s. These CPUs still leave a lot to
       | be desired, but I sense that there's been a shift towards better
       | uses of transistor count than just single-threaded performance.
       | 
       | These are still roughly 10 times more expensive than they should
       | be because of their memory architecture. I'd vote to drop the
       | idea of busses to external memories and switch to local memories,
       | then have the cores self-organize by using web metaphors like
       | content-addressable memory (CAM) to handle caching. Basically get
       | rid of all of the cache coherence hardware and treat each core-
       | memory as its own computer. The hardware that wasn't scalable
       | could go to hardware-accelerated hashing for the CAM.
       | 
       | And a somewhat controversial opinion - I'd probably drop 64 bit
       | also and either emulate 64 bit math on an 8/16/32 bit processor,
       | or switch to arbitrary precision. That's because the number of
       | cores is scalable and quickly dwarfs bits calculated per cycle.
       | So we'd take say a 10% performance hit for a 100% increase in the
       | number of cores, something like that. This would probably need to
       | be tested in simulation to know where the threshold is, maybe 64
       | cores or something. Similar arguments could be used for clock
       | speed and bus width, etc.
        
         | samus wrote:
         | Number of cores is just another metric to optimize for. What
         | counts in the end is whether it can efficiently and quickly
         | deal with the load it is expected to handle.
         | 
         | Many cores are great for mostly independent tasks, but
         | performance will suffer as soon as communication is required.
         | Making chip architectures more distributed seems to be the
         | state of the art at the moment, but this doesn't mean we will
         | suddenly be able to escape Amdahl's Law. To be specific, for
         | inherently serial applications where we are absolutely
         | interested in getting the result ASAP, single-thread
         | performance remains crucial.
        
         | rbanffy wrote:
         | > I'd probably drop 64 bit also
         | 
         | There are a lot of other nice things that would have to go
         | along that. A 32-bit linear address space is not enough for a
         | lot of the things we do today, specially not in servers.
         | 
         | Having some memory dedicated to a given core is clever,
         | however, provided we have the required affinity settings to
         | match (moving a task to a different core would imply copying
         | the scratchpad to the new core and would be extremely costly -
         | much more than the cache misses we account for in current
         | kernels)
         | 
         | What I would drop immediately is ISA compatibility. I have no
         | use for it provided I can compile my code on the new ISA.
        
           | zackmorris wrote:
           | Oh I just meant that the CPU might have a 64 bit interface
           | and address space, but be 32 bit or less internally. Mostly
           | the microcode would provide the emulation, but maybe a few
           | instructions like 64 bit multiply would have dedicated
           | hardware. If that turns out to be the bulk of the hardware
           | needed in the first place, then maybe my idea wouldn't work.
           | Maybe a hybrid design, half the transistors for a 5%
           | performance loss or something.
           | 
           | Ya I agree, ISA is kind of an antiquated concept. Intel chips
           | basically emulate x86 instructions in microcode today, as a
           | glaring example.
        
             | rbanffy wrote:
             | I have the impression reorder buffers took a huge amount of
             | space, specially in irregularly sized ISAs like x86.
             | Simplifying the ISA could free up a lot of space for 64-bit
             | multiply units.
        
             | SV_BubbleTime wrote:
             | Or... just use 64bit through out? I'm not following what
             | you are trying to save by adding a mixed system and
             | emulation. I can't imagine dropping 64bit to pump cores up.
        
         | dmitrygr wrote:
         | > treat each core-memory as its own computer.
         | 
         | There was a company about a decade back that did this. I seem
         | to remember it was useful for web serving. Bought by AMD. Not
         | sure what happened next. Look them up. Name was SeaMicro
        
           | rbanffy wrote:
           | They were discontinued. I ported some enablement software
           | from Python 2 to 3 for it as part of Canonical's MaaS
           | platform (we had one in the OpenStack Integration Lab, IIRC).
           | 
           | I wouldn't mind finding one on my driveway ;-)
        
         | dragontamer wrote:
         | > These are still roughly 10 times more expensive than they
         | should be because of their memory architecture. I'd vote to
         | drop the idea of busses to external memories and switch to
         | local memories, then have the cores self-organize by using web
         | metaphors like content-addressable memory (CAM) to handle
         | caching. Basically get rid of all of the cache coherence
         | hardware and treat each core-memory as its own computer. The
         | hardware that wasn't scalable could go to hardware-accelerated
         | hashing for the CAM.
         | 
         | If you want "many cores" and "get rid of cache-coherence
         | hardware", its called a GPU.
         | 
         | Yes, a lot of those "cores" are SIMD-lanes, at least by NVidia
         | / AMD naming conventions. But GPU SIMD-lanes have memory-
         | fetching hardware that operates per-lane, so you approximate
         | the effects of a many-many core computer.
         | 
         | -------
         | 
         | Japanese companies are experimenting with more CPUs though.
         | PEZY "villages" are all proper CPUs IIRC, but this architecture
         | isn't very popular outside of Japan. In terms of the global
         | market, your best bet is in fact a GPU.
         | 
         | The Fujitsu ARM supercomputer was also ARM-based + HBM2. But
         | once again, that's a specific Japanese supercomputer and not
         | very popular outside of Japan. It is available though.
        
           | freemint wrote:
           | Japanese Companies such as NEC are also selling Vector
           | Computers such as the NEC SX-Aurora TSUBASA Plugin cards.
        
       | tromp wrote:
       | When I hear "priced at a fraction of", I would assume the
       | fraction to be well under a half.
       | 
       | Here, it's used for a fraction of 5800/7890 ~ 73.5%, which I find
       | rather misleading.
       | 
       | It would be more accurate to say it costs a fraction _less_ than
       | x86.
        
         | [deleted]
        
         | larsbrinkhoff wrote:
         | That's an odd assumption. Let's be rational: 1.5 is also a
         | fraction. 1/p isn't, even though it's less than 0.5.
         | 
         | I don't mean to sound negative, but it's not very complex.
        
           | Cantinflas wrote:
           | Well, 5/2 is also a fraction... What would you think of the
           | headline if the fraction they refered to was 5/2?
        
           | hdjjhhvvhga wrote:
           | If you ignore the meaning assigned to it in everyday English
           | and concentrate on the mathematical sense, it leads to
           | absurdities as practically all physical objects can be
           | described in terms of fractions as they include improper
           | fractions, irrational and complex numbers in the nominator or
           | denominator etc.
           | 
           | "A fraction of the price" means "significantly less". When
           | someone promises to sell you a new Macbook for a fraction of
           | the price but it turns out the price difference is 5 cents,
           | from your point of view they're correct but you have every
           | right to feel cheated.
        
           | ben_w wrote:
           | It's not a rational fraction, but I think it's still a
           | fraction. And now I'm wondering about 1/i, inverse
           | quaternions (I barely get those even normally), and if 1/M
           | [?] M^(-1).
           | 
           | (Nice puns, BTW).
        
             | OJFord wrote:
             | A rational number is one that can be written as a fraction
             | (of two integers).
             | 
             | What's a rational fraction? Integer parts as opposed to
             | e.g. 1.5/2.3?
        
             | samus wrote:
             | 1/i = (1/i) * (i/i) = i/(i*i) = i/(-1) = -i
        
             | caddybox wrote:
             | A rational number (with a terminating decimal
             | representation or a repeating, non-terminating decimal
             | representation) can always be expressed in fractional p/q
             | form.
             | 
             | I don't think the additional "rational" qualifier is needed
             | for fractions.
        
               | ben_w wrote:
               | Looks like all three replies to this share the same
               | misunderstanding of my intent, so I assume my use of
               | "it's" was unhelpfully vague.
               | 
               | 1/pi is not rational and therefore not "a rational
               | fraction", but think 1/pi is "a fraction".
        
             | hdjjhhvvhga wrote:
             | Of course these exist and that's why we have the process of
             | rationalization.[0]
             | 
             | https://en.wikipedia.org/wiki/Rationalisation_(mathematics)
        
           | Asraelite wrote:
           | "A fraction of" does not literally mean a fraction of in the
           | mathematical sense. Its commonly accepted use in English is
           | much more refined.
        
         | BackBlast wrote:
         | I've been using them in production. I've been pleased with the
         | value proposition offered. It helps that, on our workload, arm
         | overperforms relative to published benchmarks while AMD
         | underperforms.
        
           | sidkshatriya wrote:
           | How many cores version do you have? Are you able to use all
           | the software you need to? Are the machines "rock solid" or
           | are there some teething issues?
        
             | BackBlast wrote:
             | Not able to use it for the whole stack. Some of our stack
             | doesn't run on arm.
             | 
             | Don't have a bare metal machine, just some modest sized
             | virtual machines with 2 cores each.
             | 
             | They have been stable.
        
         | amelius wrote:
         | Before I came here I just knew the top comment would be about
         | this headline rather than the actual meat of the content.
         | 
         | I just wish we could have a separate moderation type for
         | comments like this, so they can be moved to the bottom.
         | Complaints about a website being slow, popups, etc. also fall
         | in this category.
        
           | simonh wrote:
           | What really irks me is quibbling about headlines that are
           | technically inaccurate, but are actually just common everyday
           | usage that everyone understands perfectly well not to take
           | literally. Comments like that have deeply negative value and
           | drive me up the wall. In this case though I think I agree
           | with the criticism, in everyday usage "a fraction of" is
           | generally taken to mean quite a small fraction, which this
           | isn't.
           | 
           | Another commenter is applying the literal definition of
           | fraction to argue the title is accurate, and quite rightly
           | getting downvoted into oblivion for it. Everybody (who is a
           | native English speaker) knows very well what this phrase is
           | taken to mean in everyday English.
        
           | Snoozus wrote:
           | Well what is the meat of the content? The article says they
           | are a bit cheaper, overstates by how much and does not put
           | that in relation to any measure of performance.
           | 
           | "Lada Cost a Fraction of Bentley" thank you very much.
        
           | bjarneh wrote:
           | > rather than the actual meat of the content
           | 
           | To be fair, there isn't much "meat" in that article, it's
           | mainly about the price - and I guess most would agree that
           | when we hear the word "fraction" used like that we don't
           | think about 73/100.
           | 
           | If you were told you could get a laptop at a fraction of the
           | cost on Black Friday, and the fraction turned out to be
           | 98/100; wouldn't you be somewhat confused?
        
         | panini-grammar wrote:
         | Please note at the processors under the study -
         | 
         | Ampere processor has 128 cores ($5.8K or $45 per core), AMD
         | processor has 64 cores ($7.9K or $123 per core), Intel
         | processor has 40 cores ($8.1K or $202 per core),
         | 
         | Still Ampere's cost is 50% less than AMD/Intel's cost 'in per
         | core' price. So, I think, article used correct word 'fraction'.
         | 
         | Yes, there may other advantages in Intel/AMD, which I do not
         | have knowledge of the below ex: - clk speed, - L1/L2/L3 cache
         | size and speed, - peripherals ... - etc (I tend to feel, Ampere
         | is better in these too. Question is about chipset, SW base etc
         | )
        
           | ekianjo wrote:
           | are the cores equivalent though?
        
             | calaphos wrote:
             | In benchmarks amperes cores are roughly as fast as a
             | hyperthread on current gen Epyc. Sometimes more on heavily
             | execution bound workloads, sometimes less.
        
             | sidkshatriya wrote:
             | Good point. Intel's 40 cores are basically 80 cores due to
             | hyper threading. Some people don't like hyperthreading for
             | various reasons but it would be better to say the
             | comparison is between 128 cores of Ampere and 80 cores of
             | Intel.
             | 
             | But then again, number of cores is really a crude measure.
             | We need to measure what those machines can really do...
             | 
             | P.S. Surprised I am being downvoted. If you only want to
             | compare number of cores then I was implying that 80 is a
             | better number than 40 to compare with 128. It gives you a
             | rough idea of parallelism even though obviously HT does not
             | give you 2x available cores all the time.
             | 
             | I also added a caveat that that number of cores is a crude
             | measure anyways.
        
               | Asmod4n wrote:
               | Last time I checked a hyper thread is ~75% slower than a
               | normal one, so it's roughly a 50 core cpu.
        
               | reitzensteinm wrote:
               | This depends heavily on your memory access patterns.
               | 
               | If you're chasing dependent loads around memory, like
               | traversing a long fragmented linked list due to garbage
               | code, you will get an absolutely perfect 100% speed up.
               | 
               | If you're already saturating the ALU on one thread,
               | adding another to the core will probably slow you down
               | with context switching and cache contention, and indeed
               | well written numeric simulations on supercomputers often
               | turn it off entirely.
               | 
               | However, most software we run resembles the garbage
               | pointer chasing variety more than the finely tuned
               | numerical variety.
        
               | Const-me wrote:
               | > most software we run resembles the garbage pointer
               | chasing variety more than the finely tuned numerical
               | variety.
               | 
               | To be fair, crappy pointer-heavy data structures is not
               | the only possible reason. Some useful algorithms are
               | inherently serial.
               | 
               | For instance, all streaming parsers, compressors, or
               | cryptography are inherently serial algorithms. An
               | implementation gonna be relatively slow not because of
               | RAM latency, but due to continuous chain of data
               | dependencies between sequential instructions.
               | 
               | It's technically possible to implement single-threaded
               | code to handle multiple streams concurrently.
               | Practically, more often that not it's prohibitively
               | complicated to achieve. However, doing that in hardware
               | with two hardware threads running on the same core is way
               | more manageable in terms of software complexity.
        
               | reitzensteinm wrote:
               | Yeah, garbage code was a little unfair.
               | 
               | I actually experimented with parallel fetching of
               | multiple values out of a large persistent vector years
               | ago, and saw nearly linear speed up for up to four
               | fetches in parallel.
               | 
               | The code was awful and it would need to be a compiler
               | generated thing for sure.
        
               | goldenkey wrote:
               | Once mitigations for Spectre, Meltdown, and whatnot are
               | enabled, is hyper threading still a real gain?
        
               | sidkshatriya wrote:
               | Newer generation Intel CPUs deal with some of these
               | issues on a hardware level.
               | 
               | See
               | https://www.intel.com/content/www/us/en/developer/topic-
               | tech...
               | 
               | So the level at which your Intel CPU is "crippled" would
               | depend on how old it is, I guess.
        
               | goldenkey wrote:
               | Are these hardware fixes just bandaids or are the actual
               | architectural flaws really fixed? Because didn't other
               | processors that did speculative execution besides Intel's
               | also have similar vulnerabilities?
               | 
               | A lot of expert level folks on here were saying that the
               | architecture surrounding speculative execution would need
               | a total removal or reengineering to fix it.
        
               | holbrad wrote:
               | That isn't how hyperthreading works... It's completely
               | dependent on the workload.
               | 
               | If your using the CPU core as efficiently as possible,
               | you'd see no benefit from hyperthreading.
               | 
               | If your using it very poorly, you'd see a massive 2~x
               | benefit.
        
               | sidkshatriya wrote:
               | > If your using it very poorly, you'd see a massive 2~x
               | benefit.
               | 
               | I find this statement tough to agree with. It depends on
               | the kind of work your processors are doing: it is IO
               | heavy work or are you running Math computations? There
               | are other axes basically related to how much work can be
               | done by the current thread while the other thread stalls
               | waiting for data to be fetched (or other non-
               | parallelizable dependencies to be available).
               | 
               | So if you're getting a high benefit you shouldn't
               | necessarily feel embarrassed!
        
               | esens wrote:
               | When doing C++ compiles and also renders I see a 2x
               | benefit generally. It is rare that I do not see something
               | close to 1.5 to 2x speed ups.
               | 
               | That said I did switch from Intel over to an Apple M1
               | anyway.
        
               | mschuster91 wrote:
               | Most compilers are one of the _prime_ cases of
               | inefficient architecture, the worse the larger the
               | project gets. There is an awful lot of time lost waiting
               | on I /O for hundreds to thousands of files (even assuming
               | that there is enough RAM for the OS to cache all the
               | files and metadata, every file that is read has _at
               | least_ three syscall context switches for open
               | /read/close, dito for intermediate writes) and for
               | process creation and destruction.
               | 
               | What I would find really, _really_ interesting: a
               | "single-process" compiler that has a global in-RAM cache
               | for all source contents and intermediate outputs and can
               | avoid the overhead of child processes... basically a
               | model like Webpack or Parcel that has an inotify watcher
               | and is constantly running. The JS world had no other
               | choice with NodeJS/npm all but forcing the tooling to
               | adapt to a lot of incredibly small source files, it's
               | time for the "classic" world to adapt.
        
       | l33tman wrote:
       | This article is mostly useless, it just focuses on number of
       | cores... I didn't find even one mention in the article of actual
       | benchmarks.
        
       | berkut wrote:
       | Anandtech had a review of it here (and a 2-socket system) against
       | Xeons and EPYCs:
       | 
       | https://www.anandtech.com/show/16979/the-ampere-altra-max-re...
        
         | Symmetry wrote:
         | On a socket to socket basis scoring significantly better than
         | Intel and AMD's offering on some SPECint tests, significantly
         | worse on others, and doing about the same as an AMD 7763 on
         | average which has 64 cores with 128 threads. On the SPECfp a
         | bit behind AMD but doing better than Intel, which sort of
         | surprised me given AVX512.
        
           | robocat wrote:
           | Linus on AVX512:
           | https://www.phoronix.com/scan.php?page=news_item&px=Linus-
           | To...
           | 
           | Perhaps the implementation of AVX512 instructions has
           | improved, but the earlier versions just caused weird global
           | performance decrease side-effects, making the instructions
           | virtually useless for general purpose code.
           | 
           | Also, if you have an application that really benefits from
           | fast FP, then you nearly always want a GPU.
        
       | imachine1980_ wrote:
       | oracle cloud have free tier four vcore from amper, if you want to
       | have a test on arm is proablably the best option
        
       | josephg wrote:
       | So? The real metric isn't the price or the number of cores. Its
       | how much performance you get per dollar. (And sometimes also,
       | perf per watt or perf per RU).
       | 
       | The headline may as well be "Slow CPU cheaper than fast CPU".
       | This is not newsworthy.
        
         | freemint wrote:
         | For SAT solving and compiling those machines have really good
         | performance. As for other workloads I didn't test these.
        
         | capableweb wrote:
         | What do you mean "the real metric isn't X"? Different metrics
         | are useful under different circumstances. Performance/USD
         | doesn't matter if you're not price sensitive for example, but
         | it's still "a real metric" for the ones that are price
         | sensitive.
        
           | masklinn wrote:
           | The comment you're responding to is not that
           | "performance/usd" isn't a metric, it's that price or core
           | count alone are not useful information.
           | 
           | They're specifically saying performance/$ _would_ be useful.
        
         | dahfizz wrote:
         | There's some benchmarks here
         | 
         | https://www.anandtech.com/show/16979/the-ampere-altra-max-re...
         | 
         | Depending on workload, they seem to perform very well
        
           | philjohn wrote:
           | Until you hit the limitation in cache size.
        
         | maxwell86 wrote:
         | For our HPC workload these CPUs perform around 25% better than
         | the AMD ones being compared.
         | 
         | "Fast CPU is cheaper than slow CPU" is newsworthy.
        
           | bee_rider wrote:
           | What kind of workloads do you have?
           | 
           | I often wonder how these Ampere chips would do on sparse
           | matrix operations. Potential for parallelism, but possibly
           | less vector potential, to keep the GPUs away...
        
           | slaymaker1907 wrote:
           | This makes sense to me since the x86 chips have the advantage
           | of running a lot more software out of the box. Sure a lot of
           | software can just be recompiled, but compiling for ARM can be
           | very difficult for software doing a lot of low level
           | performance tricks.
        
           | josephg wrote:
           | Great! That's what I want to know.
           | 
           | Thankyou - Your comment is more informative for me than the
           | article.
        
             | rbanffy wrote:
             | Even if a Neoverse N1 core had a third of the performance
             | of a Xeon core, the part has about 3x as many cores than a
             | top-of-the-line Xeon for about 75% of the part price.
             | 
             | After reading the article it is quite obvious this has a
             | much higher performance per dollar than a Xeon for less.
        
         | [deleted]
        
       | alberth wrote:
       | AnandTech indicates that the M1 (non-Pro) provides 2x overall
       | performance than what Amazons Graviton2 (Ampere) can provide.
       | 
       | https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...
       | 
       | https://images.anandtech.com/graphs/graph16252/111168.png
       | 
       | Seems like Ampere has a long way to go when an 8-core M1 (4 perf
       | core + 4 eff core) can beat a 64-core Ampere.
        
         | omni wrote:
         | Those are single-core tests, it's expected that a consumer CPU
         | will beat most server CPUs in that category. If you think the
         | M1 can beat a 64-core anything in multicore tests then I have a
         | bridge to sell you.
         | 
         | M1 multi-core: https://www.anandtech.com/show/16252/mac-mini-
         | apple-m1-teste...
         | 
         | Graviton 2 multi-core:
         | https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...
        
           | omni wrote:
           | I can't edit but I screwed up the M1 link, corrected:
           | https://www.anandtech.com/show/16252/mac-mini-
           | apple-m1-teste...
        
             | rbanffy wrote:
             | The number of cores available on M1 is a consequence of
             | Apple's product lineup. There is no 64-core M1 yet because
             | there is no product for it. Yet.
        
         | wmf wrote:
         | Ampere at 3.3 GHz is significantly faster than Graviton2 at 2.5
         | GHz.
        
       | cwillu wrote:
       | Original information from
       | https://www.phoronix.com/scan.php?page=news_item&px=Ampere-A...
        
         | bithavoc wrote:
         | This should be the link of the post, HN should ban links to
         | sites like tomshardware.com where the scroll is hijacked to
         | force play videos and don't provide meaningful information over
         | the same Phoronix's articles.
        
           | IntelMiner wrote:
           | Being less information dense than Phoronix is an achievement
           | 
           | That idiot stuffs his "articles" with so many pages littered
           | with ads to milk every single penny he can out of the few
           | Linux users that don't have adblock enabled
        
             | 5e92cb50239222b wrote:
             | That "idiot" also works 20 hours a day and barely comes
             | even at the end of the month despite all those
             | advertisements. It's a one-man shop. I am no fan of
             | Phoronix, but give him a break.
        
               | stonogo wrote:
               | Maybe there isn't a market for working 20 hours a day
               | posting content-free articles sourced from other freely-
               | available news sites, and he should find other work?
        
               | samus wrote:
               | He also maintains a general-purpose benchmark collection
               | and uses it to perform regular performance tests for to-
               | be-released Linux kernels. Several times serious
               | performance regressions were found (most recently because
               | of AMD power management and on Alder Lake). Maybe he
               | should indeed stop helping big companies save money this
               | way, but the Open Source scene would be worse off as a
               | whole.
        
             | cwillu wrote:
             | Believe me, it hurt my soul to point out that this was a
             | rehash of something michael wrote.
             | 
             | But the fact remains that the only informational content of
             | this post came out of the legwork performed by another
             | party.
             | 
             | I can't read the word "phoronix" without mentally
             | substituting "moronix", but the point stands.
        
       | PedroBatista wrote:
       | These ARM cpu/boards has a cheaper replacement for x86 in real
       | world general computing are becoming a "Year of desktop Linux" by
       | now.
       | 
       | I've been hearing these news for more than a decade and still
       | hasn't materialize in anything meaningful if you take into
       | account how much "fraction of a cost" they are advertised.
       | 
       | Edit: I was referring to a regular person or SME buying something
       | like a couple "ATX" boards or "regular servers". Since it's a
       | "fraction of the cost" I don't get why it hasn't spread like
       | wildfire yet. I wasn't talking about giant cloud companies who
       | place orders on the hundreds of thousands at least and many of
       | them design their own hardware by now. Neither I was talking
       | about a CPU that's attached with +$1000 of gray aluminum.
       | 
       | Raspberry Pi and it's "clones" are what it's closer to what I was
       | talking about, but not really.
        
         | rbanffy wrote:
         | The CPU costs a fraction of the equivalent Xeon, but the CPU is
         | not the only part in a server BoM nor is the most expensive
         | subsystem of the box. When you add a terabyte of RAM, extra
         | networking, and a bunch of SAS SSD's and HDD's, the CPU cost is
         | almost negligible.
         | 
         | Most companies that buy x86 servers have no desire to recompile
         | their software for a new architecture - they want to run
         | PowerBI or SharePoint. They don't really benefit from a machine
         | like this.
        
         | matwood wrote:
         | Not sure if it's 'real world' enough, but for many workloads
         | Graviton2 (AWS ARM) is a drop in replacement for x86. Last year
         | I moved a lot of workloads over with very little effort.
        
         | jagger27 wrote:
         | What? General computing as in supercomputers?
         | 
         | https://www.fujitsu.com/global/about/innovation/fugaku/
         | 
         | Apple's entire Mac lineup is going to be Arm-powered by year
         | end too.
         | 
         | Cost comparison is almost useless in this market because there
         | simply are not enough wafers to meet demand. The price per core
         | per watt is the real world comparison, and my inaudible MacBook
         | fan demonstrates that beautifully.
        
           | formerly_proven wrote:
           | TOP xx supercomputers are not general computing. The software
           | stack basically doesn't matter because everything is going to
           | be custom anyway. That's why you see custom ISAs, funky ISA
           | extensions, one-off accelerators and single-purpose operating
           | systems in them. None of that would fly for commercial
           | applications.
        
             | thrwyoilarticle wrote:
             | There was a time when people were sceptical about patches
             | to get Linux running on those little 32 bit
             | microcontrollers.
        
             | rbanffy wrote:
             | > and single-purpose operating systems in them
             | 
             | Do they? Aren't most of them running Linux?
        
               | smoldesu wrote:
               | Linux isn't an operating system. There are many single-
               | purpose operating systems built with Linux though, which
               | is why I think the parent comment is still correct. As
               | I'm sure you can imagine, those machines are multiplexed
               | into hundreds or thousands of containers, VMs and
               | supervisor kits, and those are definitely designed to run
               | extremely niche containerized software or specialized
               | distributions.
        
               | rbanffy wrote:
               | > Linux isn't an operating system.
               | 
               | While some of those VMs may be running trimmed down OSs
               | tailored for their individual workloads, the fact remains
               | they are all running under some sort of GNU/Linux or, at
               | least, relying on the services of a Linux kernel.
               | 
               | Now that you mention it, WSL1 should be called
               | GNU/Windows instead, because it's the GNU userland on top
               | of a Windows kernel with Linux lipstick.
        
               | [deleted]
        
         | nine_k wrote:
         | ARM-based instances on AWS are a thing, and they do cost less.
         | 
         | Not _dramatically_ less, like 15% of x64 instances, but still.
        
         | swdev281634 wrote:
         | Oracle's cloud servers are OK. Their physical servers have 160
         | cores (2x Ampere Altra Q80-30, 80 cores/each), 1TB RAM, and 100
         | Gbps network bandwidth (2x 50Gbps cards). They can also cut
         | these servers into VMs and offer these smaller VMs.
         | 
         | The software story is OK by now. I had little to no issues with
         | that aarch64 Linux in their VMs. I didn't need a lot though,
         | only mysql, asp.net core runtime, and related OS setup
         | (SELinux, built-in firewall, etc).
        
         | thrwyoilarticle wrote:
         | Not counting cloud computing (and, presumably, Apple computers)
         | is akin to the Linux naysayers not counting Android after the
         | very essence of desktop computing was uprooted.
         | 
         | Standardisation and critical mass is the hard part of the
         | puzzle for Arm64 desktops. But it was also the hard part of the
         | puzzle for supercomputing and cloud servers, where it now has a
         | firm foothold. Personally, I work in an industry where everyone
         | is moving away from x86 and developing on physical Arm64
         | machines because x86 simply can't fit the power budget in
         | production.
        
         | selfhoster11 wrote:
         | ARM has been pretty weak for more than a decade, and only
         | started getting decent CPU performance and RAM sizes with a
         | reasonable pricing very recently. Even x86 computing took a
         | while to develop into something useful.
        
       | bullen wrote:
       | 2 watts per core at 3GHz is pretty impressive, what nanometer are
       | they at?
       | 
       | But still memory bandwidth is going to restrain those 128 cores
       | to do anything joint parallel, you might actually be better off
       | with many smaller 4-8 core machines.
        
         | jagger27 wrote:
         | TSMC N7.
        
           | bullen wrote:
           | Thx
        
         | voxadam wrote:
         | > what nanometer are they at?
         | 
         | It looks like 7nm.[0] I suppose that means they're coming out
         | of TSMC, which isn't terribly surprising.
         | 
         | [0] https://www.servethehome.com/ampere-altra-
         | max-m128-30-128-co...
        
         | marcan_42 wrote:
         | This is a good point. They have 8 channels of DDR4-3200, which
         | is 204 GB/s, to be shared across 128 cores. That's _half_ the
         | total memory bandwidth of the M1 Max, at 400 GB /s, and it only
         | has 10 cores and a (decently beefy) GPU.
         | 
         | A single M1 P-core can saturate 70GB/s of memory bandwidth, and
         | not even at max clock. If these 128 cores were individually as
         | good as the M1 cores, they'd need _350 channels_ of DDR4-3200
         | RAM to get that nonblocking per-core memory bandwidth. The M1
         | Max can 't achieve that, ending up at somewhere around 28GB/s
         | per core with all P-cores competing, but even then this
         | 128-core CPU would need _140 channels_ of DDR4-3200 to achieve
         | the same memory bandwidth per core. So I imagine this kind of
         | CPU will scale well to compute-heavy workloads that mostly stay
         | in cache, but rather poorly for throughput-heavy workloads that
         | stream data to /from RAM very quickly. The RAM is more than 10x
         | oversubscribed; you might find that RAM-bound workloads
         | saturate with just 12-16 of those 128 cores active.
        
           | crest wrote:
           | According to all benchmarks I've seen the M1 Max doesn't
           | offer significant more memory bandwidth to the CPU cores
           | unless (unless the GPU is busy as well). You have run a
           | microbenchmark all all P and E cores to reach the internal
           | bus limitation at ~240GB/s. The remaining ~160GB/s can only
           | be utilised by the other parts of this monster SoC (GPU, TPU,
           | video decoders/encoders, etc.) and in my 14" machine the GPU
           | is throttled by a power limit (probably required by the
           | available cooling).
           | 
           | I would love to see how much performance potential there is
           | in the M1 Max die (better cooling with raised voltage and
           | power limits).
        
             | marcan_42 wrote:
             | That's what I said: you get 28 GB/s per P-core for a total
             | of around 220GB/s across 8 P-cores (adding the two E-cores
             | adds a bit more to reach 240). That's still more than the
             | Ampere CPU can offer its 128 cores.
        
           | cma wrote:
           | Most uses that need that few operations per byte would be IO
           | limited I would think, or better suited to GPU compute.
           | 
           | M1 max has a very beefy GPU so the bandwidth is mostly sized
           | for that. Its GPU is 22 single-precision TFLOPS or something
           | but still has less memory bandwidth than a 2080ti which is
           | only 14TFLOPs or so.
        
             | marcan_42 wrote:
             | Keep in mind that the GPU architecture is TBDR, which is
             | designed for lower memory bandwidth (and thus lower power
             | consumption) than typical immediate mode GPUs. Desktop GPUs
             | also use tiles these days, but for rendering use cases at
             | least the TFLOPS:bandwidth ratio is probably not directly
             | comparable between architectures that dissimilar.
             | 
             | The CPUs can still eat half of the M1 Max's memory
             | bandwidth alone, without the GPU to help. How often this
             | happens with real world workloads, I don't know :-)
        
           | ThrowawayR2 wrote:
           | The M1 Max gets that bandwidth from putting the RAM into the
           | processor package. There is no way that server processors can
           | fit the much higher RAM per core used for server workloads
           | onto the processor package so it's unclear that the
           | comparison is meaningful.
        
             | namibj wrote:
             | POWER10's memory architecture offers a TB/s in memory
             | bandwidth per die. And about another TB/s in other
             | bandwidth for SMP and IO. Only using this all for memory,
             | they'd be at an equivalent of around 80 channels. Not quite
             | the 140, but much closer than 8.
        
             | marcan_42 wrote:
             | You need to scale the DRAM channels with the number of
             | cores; you can't just pack more and more cores into one die
             | without adding DRAM width. Yes, you then run into physical
             | limitations, and that's kind of the point I'm trying to
             | make here :)
        
         | nine_k wrote:
         | Often you're after tail latency, not throughput.
         | 
         | It's pretty normal to target e.g. your web backend tier to not
         | exceed 20% CPU load, and be ready to instantly react to
         | incoming I/O.
        
       | silicaroach wrote:
       | should read "... at fraction of the _comparable_ x86 Price" :-D
        
       | gfykvfyxgc wrote:
       | ARM has a big problem .... compatibility.
       | 
       | X86 absolutely destroys ARM in the software front. Give me an x86
       | cpu and I can run any version of linux on it with wide hardware
       | support.
       | 
       | Give me a random ARM computer, you're lucky if you can get it to
       | boot.
       | 
       | You're even luckier if the ARM CPU manufacturer releases
       | documentation about how their chips work .... They prefer to keep
       | inner workings secret.
        
         | capableweb wrote:
         | > Give me a random ARM computer, you're lucky if you can get it
         | to boot.
         | 
         | That's wildly inaccurate. Many distributions have ARM builds
         | that works perfectly fine (at least for me, YMMV), have you
         | tried any of them?
         | 
         | Here is some of them:
         | 
         | - Arch Linux - https://archlinuxarm.org/
         | 
         | - Ubuntu - https://ubuntu.com/download/server/arm
         | 
         | - Alpine Linux - https://alpinelinux.org/downloads/
         | 
         | - Elementary OS - https://blog.elementary.io/elementary-os-on-
         | raspberry-pi/
         | 
         | - NixOS - https://nixos.wiki/wiki/NixOS_on_ARM
        
         | fork-bomber wrote:
         | I don't think that's largely true anymore for server class
         | hardware - which is the focus of the article.
         | 
         | Arm came up with a bunch of standardisation requirements quite
         | a while ago (see:
         | https://developer.arm.com/architectures/system-
         | architectures...) which have been quite successful especially
         | for server designs.
         | 
         | That was an absolute requirement in order for AArch64 to even
         | be considered as an alternative in the datacenter space where
         | it is now a very compelling alternative to x86_64.
         | 
         | What I mean specifically is standardised support for firmware,
         | hypervisor and operating system kernel interfaces for things
         | like system bootstrap, power-perf control etc. Think ACPI, EFI,
         | CPU capability discovery, DVFS, Idle management etc.
         | 
         | Being unable to boot Linux on modern AArch64 server class
         | hardware is actually increasingly rare thanks to the
         | standardisation.
         | 
         | Your comments are more applicable to the general Arm embedded
         | systems scene where fragmentation is understandably rife. It
         | was the price Arm had to pay to keep its royalty model in
         | flight - "Pay the license fee, do what you will with the
         | design".
        
         | 3np wrote:
         | The gap for generic arm64 has been closing _a lot_ in recent
         | years. These days IME the vast majority of fiddling doesn 't
         | have to do with the ARM SoC/CPU itself but rather with getting
         | the right dtb (and sometimes firmware) for other integrated
         | hardware, and u-boot.
         | 
         | These issues aren't really inherent to the architecture per-se,
         | more tied to the setups and practices for the kind of devices
         | you tend to find available ARM SoCs/CPUs in. IME from dealing
         | with embedded x86 devboards some years back it was a similar
         | situation. I'd assume units marketed for server loads are
         | comparable to x86 equivalents.
         | 
         | I've been trying out Arch Linux on ARM on a secondary
         | workstation recently. In most cases the only thing needed to
         | get a non-supported package working is to add 'aarch64' to the
         | list supported architectures in the PKGBUILD and then proceed
         | like normal.
         | 
         | > You're even luckier if the ARM CPU manufacturer releases
         | documentation about how their chips work .... They prefer to
         | keep inner workings secret.
         | 
         | This is the bigger practical issue. Rockchip are in general
         | good, while the kind of chips you find in flagship phones
         | require significant reverse-engineering are unapproachable for
         | the non-hacker. But again, not so much "software-to-CPU", more
         | "drivers-to-everything-around-the-CPU".
        
           | foxfluff wrote:
           | I wouldn't generalize Rockchip as good. You're lucky if they
           | release one volume of their six volume reference manual, with
           | "Confidential" plastered all over, and most peripherals
           | completely undocumented beyond vague register names. If
           | you're extra lucky, you'll find another leaked volume on a
           | sketchy Chinese website. Anything else? NDA.
           | 
           | And what about new chips? Look up datasheets for e.g. the
           | (now popular) RK3566. You get a 58 page document which is
           | just a general overview plus pinout and not much else. Does
           | fuck all for you if you actually want to write drivers and
           | get the thing to work.
        
             | 3np wrote:
             | Fair enough!
        
           | floatboth wrote:
           | dtb is for embedded devices. Server/workstation class
           | hardware uses ACPI.
        
             | 3np wrote:
             | Precisely, thanks for making it explicit.
        
         | maxwell86 wrote:
         | We bought a small ARM HPC cluster last summer, and everything
         | worked out of the box. All our apps and dependencies just
         | worked.
         | 
         | Documentation is also excellent. ARM docs are really good, and
         | go in much more detail than Intel and AMD docs about the inner
         | working of their cores.
         | 
         | That's what we expected, since ARM licenses all their IP, and
         | that's what our vendor delivered. Everything we wanted to know
         | about the Hw, there were docs and training materials ready for
         | it.
        
           | freemint wrote:
           | What scheduler do you use there?
        
           | sidkshatriya wrote:
           | Is dealing with the slightly more relaxed memory/concurrency
           | model of ARM been tricky or is it something you don't really
           | encounter in practice?
        
             | maxwell86 wrote:
             | We don't write relaxed atomic kind of code directly (who
             | does this, really?) We use MPI, pthreads primitives, c++
             | synchronization primitives, openmp, etc. These are portable
             | and "just work". Anecdotally, we haven't run into any
             | incorrect use of these in our apps yet that cause problems
             | on ARM but not on x86 (although that would be a bug on
             | both), but we aren't doing anything super fancy.
        
         | floatboth wrote:
         | > They prefer to keep inner workings secret
         | 
         | https://github.com/tianocore/edk2-platforms/tree/master/Sili...
         | https://github.com/tianocore/edk2-platforms/tree/master/Sili...
         | https://github.com/tianocore/edk2-platforms/tree/master/Sili...
         | https://github.com/tianocore/edk2-platforms/tree/master/Sili...
        
       | mobilio wrote:
       | Just two things: - 250W TDP is HUGE - motherboard for this
       | processor is expensive
        
         | jeffbee wrote:
         | I just bought a new Intel CPU that draws 125-241W (they don't
         | give a TDP any more) and it only has 8 cores. It is very fast,
         | though.
         | 
         | I don't think 250W is outrageous for a chip with this much
         | logic on it.
        
         | OJFord wrote:
         | > 250W TDP is huge
         | 
         | 128 cores is too though, I imagine it's pretty linear?
        
         | jakuboboza wrote:
         | is it ? Most of EPYC processors are TDP 180/200 watt and there
         | are cheap mobos for them that can host even two sockets. So I
         | don't think that would be a big issue.
         | 
         | Also we don't even know how they calculate TDP, lets not forget
         | every single company Intel, Amd, Nvidia etc... have their own
         | weird formula to calculate TDP. Your Intel 12900k has TDP on
         | paper of 125W but can easily jump to 300W of power consumed.
         | Without knowing formula for TDP calculation from every
         | manufacturer this type of comparison is only a guessing game.
        
         | tehbeard wrote:
         | if this was an Intel chip, then yeah 250W TDP is huge given
         | their core counts..
         | 
         | But for something packing a similar core count to
         | EPYC/Threadripper, it's in the right ballpark.
        
         | NavinF wrote:
         | Is it though? My desktop CPU regularly pulls 150W. 250W is
         | pretty normal for a server. That said, I agree that
         | motherboards will be really expensive since we're not gonna see
         | hundreds/thousands of models competing on price like we see
         | with every new x86 socket.
        
       | bee_rider wrote:
       | This article has no measurements, in fact it doesn't tell us
       | anything we couldn't have gotten by looking at the product
       | specifications. They even took their table from Phoronix, so they
       | didn't even do the legwork of comparing the marketing material
       | for the products!
       | 
       | Given that this low-effort article picked a metric which everyone
       | knows will benefit the ARM processor, I can only assume it is
       | marketing for Ampere. And yet, the first sentence starts with:
       | 
       | > Ampere's flagship 128-core Altra Max M128-30 may not be the
       | world's highest-performing processor
       | 
       | Ampere is cool. It is really awesome that somebody is putting up
       | a fight in CPU design without Intel/AMD's legacy advantages, or
       | Apple/Amazon's infinite money. I really hope they didn't pay much
       | for this fluff, that would be is pretty embarrassing.
       | 
       | --
       | 
       | Edit: It is neat to see that they've got a chip under $1k. I
       | wonder if a Q32-17 workstation could be put together for cheaper
       | than whatever the cheapest Apple M1 pro device is, to experiment
       | with computationally crunchy Arm codes.
        
         | londons_explore wrote:
         | It's also quite a niche use case... An application fine with
         | low single thread performance, that is highly parallelizable,
         | requiring hundreds of threads, but with sufficiently branchy
         | execution that CUDA/GPU doesn't work out... Oh, and it also no
         | binary blobs, or you won't be able to port to ARM.
        
           | guiand wrote:
           | Sounds like exactly the use case of a server. Rack space is
           | expensive, condensing it can save the operators money.
        
         | masklinn wrote:
         | > Edit: It is neat to see that they've got a chip under $1k. I
         | wonder if a Q32-17 workstation could be put together for
         | cheaper than whatever the cheapest Apple M1 pro device is, to
         | experiment with computationally crunchy Arm codes.
         | 
         | The cheapest M1P device is currently rather expensive ($2k for
         | the 8-cores 14") but there'll almost certainly be an M1P mini
         | for about the same price as the current (still on intel) high-
         | ender: $1100.
         | 
         | A Q32-17 leaves you with $300 for a bespoke box around the CPU.
         | For such a CPU class I'd expect the mainboard alone to exceed
         | that budget. Even if the Mini is price-bumped to, say, 1500
         | (which would be somewhat in-line with the 13" -> 14" price
         | differential) I don't think you can get even just the guts of
         | an Altra-based workstation for less than the price of the
         | processor.
        
           | dchichkov wrote:
           | An alternative could be Jetson AGX Orin - 2048 Cores @
           | 1000MHz.
        
           | bee_rider wrote:
           | Yeah. Someone in another branch linked a site selling Ampere
           | workstations, couldn't find anything under ~$7k.
           | 
           | I think if we want to drop ~$7k on ARM computers to play
           | around with for some reason, we'd be better off waiting for a
           | hypothetical M1P mini. Plus you could get a handful of the
           | things and make an adorable mini cluster -- then you get to
           | deal with MPI which will make for a more fun experiment
           | anyway.
        
             | R0b0t1 wrote:
             | There's cheaper arm workstations that smoke the usual
             | embedded chips. They need to drop the price before devs
             | will really want it.
        
             | jhickok wrote:
             | and that 7k is without storage and with a relatively small
             | ram selection. Realistically the workstation is more like
             | 10-15k.
             | 
             | It would be nice to have some competition to Apple in this
             | space, but with the rumored Apple Silicon Mac Pro on the
             | horizon that might be the best bet:
             | https://appleinsider.com/articles/22/01/02/smaller-mac-
             | pro-w...
        
         | floatboth wrote:
         | > I wonder if a Q32-17 workstation could be put together for
         | cheaper than whatever the cheapest Apple M1 pro device is
         | 
         | Nah. You can't just get a chip & mainboard retail since there
         | is basically no market for that.
         | 
         | About the only option for an Altra workstation is a prebuilt
         | for "as low as $7661" :(
         | 
         | https://store.avantek.co.uk/ampere-altra-64bit-arm-workstati...
        
           | bee_rider wrote:
           | Ouch. I (hypothetically -- some alternative version of me
           | that had more free time and free money) would be willing to
           | pay some early adopter, "I'm buying fun parts" tax, but $7k
           | for 32 threads @ 1.7GHz is a bit much even for hypothesizing.
           | I wonder what they have in mind for that chip, then.
        
             | fredoralive wrote:
             | It's mainly intended as a server CPU, which often go for
             | for lots of cores / threads over maximum single core
             | clockspeed. That 32 core / 1.7GHz config doesn't seem too
             | hot for either though.
             | 
             | The workstation is presumably intended as a developer tool
             | really, for those that want real hardware on hand instead
             | of remote access (and don't want a noisy server in an
             | office). Going to be even more niche and low volume than
             | the server versions, so gonna cost a fair chunk.
        
             | floatboth wrote:
             | For fun parts you can get an NXP LX2160A board from
             | SolidRun (or the earlier Marvell Armada8k one).
             | 
             | The performance wouldn't be competitive (Cortex-A72 is
             | really old by now) but it's nice to have a proper arm64
             | desktop.
        
           | leeter wrote:
           | Someone needs the courage to make a motherboard and CPU with
           | off the shelf parts and a socketed chip in the non-x86 space.
           | That was one of the key reasons that the IBM PC succeeded: it
           | was easy to clone. Without that it's likely it would have
           | been just another interesting footnote in history. But
           | because IBM didn't take the time to do everything custom and
           | locked in others were able to enter the market and make
           | things happen. Honestly I don't think anyone sells socketed
           | ARM64 or RISC-V chips. There might be OpenPOWER but I don't
           | think anyone is really fabbing them at a cost that a consumer
           | could even consider.
           | 
           | Sadly I don't think that sort of courage is in copious supply
           | it's always easier to vendor lock people than it is to take
           | the risk of someone else eating your lunch. So until that
           | happens I just don't see non-x86 platforms really eating into
           | the desktop/laptop market in any large share.
        
             | msgilligan wrote:
             | Well, Raspberry Pi, with the Compute Module 4 has done this
             | on the low-end and an ecosystem appears to be developing
             | rapidly with a wide variety of carrier boards and a handful
             | of 3rd-party compute module cards (e.g. Pine64 SOQuartz and
             | the Radxa CM3)
        
             | chasil wrote:
             | AMD actually did this in the distant past with the original
             | 32-bit Athlons.
             | 
             | They used the DEC Alpha EV6 bus from the 21264
             | microprocessor. It would be interesting to have seen DEC
             | StrongARM adapted to the same bus, and a single motherboard
             | able to run any of these CPUs with the right BIOS.
             | 
             | https://en.wikipedia.org/wiki/Slot_A
        
               | leeter wrote:
               | Other than BIOS/UEFI as far as I know nothing is stopping
               | them from using existing chipsets/motherboards. I don't
               | recall anything on the chipset itself being inherently
               | tied to x86 per se. I think even port IO is handled in
               | the SOC/microcode these days. So they could probably do
               | it. Particularly if they reused things like the AMD IO
               | die etc. No need to reinvent the wheel because the cores
               | are different.
        
             | stcredzero wrote:
             | _I don 't think anyone is really fabbing them at a cost
             | that a consumer could even consider._
             | 
             | What are the Chinese getting up to?
        
             | ohazi wrote:
             | > Someone needs the courage to make a motherboard and CPU
             | with off the shelf parts and a socketed chip in the non-x86
             | space.
             | 
             | Honestly? If they want an easy win, that someone should be
             | Intel or AMD.
             | 
             | They already design and release desktop and mobile chipsets
             | with "platform specs" that are designed to be copied
             | wholesale by all the major motherboard and PC vendors.
             | 
             | If Intel released a socketed ARM CPU with a chipset and a
             | platform spec to go with it, you'd be able to buy a
             | motherboard for it from ASUS or MSI for $150 within six
             | months.
        
               | dragontamer wrote:
               | AMD's original plan for Zen was to have an x86 chip +
               | socket-compatible ARM chip as well.
               | 
               | AMD didn't seem to have the funding needed to execute
               | this plan though, and killed the ARM-project. To be fair,
               | AMD was teetering on the edge of bankruptcy at this time,
               | and it probably was just too much risk for them to even
               | attempt to such a strange and risky strategy.
               | 
               | AMD did release a few ARM chips to test the waters. But
               | the ARM-based AMD chips never sold in high numbers or got
               | much demand, so AMD doubled-down on x86-only Zen chips
               | instead.
               | 
               | -------
               | 
               | Allegedly, AMD still owns their ARM-based decoder and
               | therefore can convert their Zen-chips over from x86 into
               | ARM-based instructions if they ever felt like it made
               | business sense. Given how successful EPYC is however,
               | there's probably no need for AMD to do something this
               | risky (might as well keep selling these x86 chips like
               | hotcakes, AMD is practically supply-constrained as it is
               | and doesn't need to do weird things like that to sell its
               | chips)
               | 
               | -------
               | 
               | See:
               | 
               | * https://www.amd.com/en/amd-opteron-a1100
               | 
               | * https://en.wikipedia.org/wiki/AMD_K12
        
             | oarsinsync wrote:
             | > That was one of the key reasons that the IBM PC
             | succeeded: it was easy to clone. Without that it's likely
             | it would have been just another interesting footnote in
             | history. But because IBM didn't take the time to do
             | everything custom and locked in others were able to enter
             | the market and make things happen.
             | 
             | I'm reasonable sure IBM PCs were totally custom and locked
             | down, and they got reverse engineered enabling "IBM
             | Compatible" clones to spring up.
        
               | klondike_ wrote:
               | The IBM PC was the first computer IBM made that used
               | entirely off the shelf chips except one: the BIOS
               | 
               | The BIOS chip was just copied outright for the first
               | clones, but that was ruled illegal in court so later
               | compatibles reverse engineered the code and made their
               | own.
        
           | Melatonic wrote:
           | If datacenters adopt it in a few years we will be able to get
           | the decomm'd stuff - once the hardware refresh cycles hit
           | ebay and the like will be flooded with stuff.
        
         | tkinom wrote:
         | Love to see compiler benchmark (compile firefox, chrome) on
         | this vs system with EPYC 64C/128T or 128C/256T.
        
         | [deleted]
        
         | jcadam wrote:
         | I would love an ARM-based linux workstation, but I'm not
         | willing to pay the extreme Apple premium for whatever an
         | M1-based Mac Pro is going to cost.
        
           | DenseComet wrote:
           | I wonder how the pricing will end up. At the baseline, $1200
           | for a MacBook Air with 16 gigs of ram is great value for the
           | machine you get, even compared to Windows alternatives. For
           | the new MacBook Pros, there are multiple companies who felt
           | that they were worth shelling out for top of the line configs
           | outside of their normal refresh cycles. Maybe Apple will
           | price their top of the line stuff for a reasonable
           | price/performance ratio, or they might think that the people
           | who actually need that config would be willing to pay sky
           | high prices and charge accordingly.
        
         | lmilcin wrote:
         | Q32-17 may have 32 cores, may have 45W TDP and may have
         | whooping 128 pcie 4.0 lanes but it is still only 1.7GHz.
         | 
         | What this means in practice is that this will heavily depend on
         | the type of load you are running. A lot of workstation type
         | loads just can't make use of 32 threads and on that CPU it will
         | have to just to offset for the slower single core performance.
        
           | PragmaticPulp wrote:
           | The 1.7GHz boost speed is a significant limitation.
           | 
           | Ignoring architecture differences for a moment, that's about
           | 1/2 the clock speed of something like an M1 or even a modern
           | cell phone chip. It's almost 1/3 of the boost clock of a
           | modern Intel or AMD desktop chip.
           | 
           | Given the choice and assuming equal cache sizes, I'd take a
           | 3.4GHz version with 16 cores over a 1.7GHz version with 32
           | cores.
           | 
           | The only time a higher core count with proportionally lower
           | clock speed really helps is when you're going for raw power
           | efficiency.
        
           | windexh8er wrote:
           | One application, where this type of processor makes sense, is
           | network based middleboxes. Think of things like firewalls
           | that have evolved to have a lot of different functionality
           | and a need for a lot of simultaneous thread processing
           | availability. This type of chip on a SmartNIC could be very
           | popular. And while this is a bit niche I'm sure there are
           | plenty of other use cases. E.g. nodes running a lot of FaaS
           | containers, VPS providers looking to offer higher CPU count
           | but lower clock speeds, etc.
        
           | bee_rider wrote:
           | I'm thinking more as a developer machine to prepare for the
           | future, rather than as a real productivity workstation.
           | Assuming the predictions of ARM ascendancy actually come true
           | (certainly remains to be seen) you should want your code to
           | run well on 32 ARM cores @1.7GHz, because that means it'll
           | run really well on the production workstations that come out
           | in a couple years, right?
        
             | wongarsu wrote:
             | If production workstations become ARM based, wouldn't it be
             | more likely that we get 2-4 high-speed (3-4GHz) cores and a
             | large number of low speed cores? That's analogous to
             | smartphone architectures with their low-power cores and
             | high-speed/high-power cores, and better matches how real-
             | world workloads behave.
        
               | bee_rider wrote:
               | So the smaller cores could be more like throughput-tuned
               | efficiency cores, rather than 100% power saving cores --
               | kind of like Alder Lake? Seems like a neat idea. Almost
               | like a Xeon + Xeon Phi, but on a chip. Seem like that'd
               | get around the communication issue of the Phi and
               | probably the compatibility issues as well.
        
             | zitterbewegung wrote:
             | Apple used an A12z bionic for people to prepare for M1 and
             | their derivatives which has a big-little configuration . I
             | don't see this or any other server with this configuration
             | at all. A ARM workstation would definitely have this which
             | begs the question that a Mac with apple silicon would be
             | the developer machine to prepare for the future .
        
       ___________________________________________________________________
       (page generated 2022-01-04 23:02 UTC)