[HN Gopher] Arm-Based 128-Core Ampere CPUs Cost a Fraction of x8...
___________________________________________________________________
Arm-Based 128-Core Ampere CPUs Cost a Fraction of x86 Price
Author : rbanffy
Score : 176 points
Date : 2022-01-03 11:32 UTC (1 days ago)
(HTM) web link (www.tomshardware.com)
(TXT) w3m dump (www.tomshardware.com)
| freemint wrote:
| During a ARM HPC User group meetup I got to play with a 160 core
| machine from Ampere and got really, really impressive performance
| (not to talk about performance per dollar) for SAT solving. I
| pitched buying one to my local HPC cluster.
| cameron_b wrote:
| Where does one find information on ARM HPC User groups and
| local HPC Clusters?
| freemint wrote:
| For the ARM HPC user group there is a mailing list
| https://arm-hpc.groups.io/g/ahug and a Twitter account
| https://twitter.com/armhpcusergroup . As for the local HPC
| Cluster that depends on your geographical position. Mine is
| at the university I am studying at.
| stillicidious wrote:
| Totally meaningless numbers without at least some normalized
| performance/watt to compare against
| jakuboboza wrote:
| I think Ampere doesnt support hyper threading. That means this
| 128 cores is comparable to 64 cores on EPYC/Xeon. Also there is
| the L2/L3 cache that is important and ofc arch. Arm still has low
| adoption because code has to be recompiled. While "web" targets
| are easy things like financial software that gains benefits from
| AVX-512 instructions might be harder because Neoverse doesn't
| have this instructions.
|
| On the other side, massively concurrent solutions might benefit
| much more from this new 128/256 arms chips. So for sure there is
| room for this type of solutions and im happy we are adding on top
| of x86/amd64.
|
| Last but not the least x86/amd64 unless something changed were
| locked to AMD and Intel so regions like EU can't rely on them if
| they want to be independent in terms of silicon
| production/design. So Arm and maybe RISC-V are the only real
| paths right now.
| monocasa wrote:
| A hyperthread is only worth ~30% of a full core back of the
| napkin.
| FpUser wrote:
| Many years ago enabling hyperthreading on a single core PCs
| of my clients had doubled performance. It was multi-threaded
| multimedia type application.
| berkut wrote:
| Can be closer to 40% in my experience with Intel when things
| are memory/cache bound (i.e. one thread is regularly stalled
| waiting for data), and I've seen above 55% in some cases with
| AMD's Zen microarch's SMT (Zen3).
| floatboth wrote:
| And security people hate them because side-channels.
| water8 wrote:
| Hyperthreading works by putting multiple arithmetic logical
| units such as an Adder bit shifter, or mux to work in
| parallel during some clock cycles where the executing code
| can utilize more ALUs. If the code executing was simply using
| the same ALU over and over again, such as repetitive
| summations, no additional performance gain will be possible
| from hyperthreading. A hyperthread doesn't necessarily have
| priority of the ALUs a normal core thread does
| cesarb wrote:
| You seem to be confusing SMT (the formal term for
| hyperthreading) with superscalar. It's superscalar (and
| out-of-order which builds on it) which works by using
| multiple ALUs in parallel when possible. SMT then builds on
| that by allowing the unused ALUs to be used by other
| thread(s). If a single thread was simply using the same ALU
| over and over again, you'd have a large additional
| performance gain from SMT, since all the other ALUs would
| be available for the second thread.
| Symmetry wrote:
| It really depends on the core. An SMT8 POWER8 core might see
| an 80% throughput increase going from 1 to 2 threads but
| those have been accused of being two SMT4 cores standing on
| each other's shoulders wearing a trenchcoat. 30% for a
| mainstream x86 consumer core is a good average expected value
| but it'll tend to vary a lot by workload with lower IPC tasks
| tending to get better results.
| magicalhippo wrote:
| What a hyperthread brings to the table has huge variability.
| It can even be a net negative in certain cases. For the
| software ray-tracer I worked on, it brought 7-80% of a single
| core.
| thrwyoilarticle wrote:
| My impression of Arm-world SMT is that it was added in duress
| because people kept asking for it despite the word-of-God being
| that it was better just to add additional cores (I wonder if
| their license structure influences that claim?). Today SMT Arm
| cores are still very much the minority, so either the fashion
| sustained or their customers/implementors agree that more cores
| are better than fewer cores with SMT.
|
| There are no AVX-512 instructions. But that's the x86 branding
| of the vector instructions that you can only implement with the
| right x86 licence. So it's tautological. Arm can have vector
| instructions and languages are even beginning to make portable
| interfaces for vectors on multiple architectures.
| parsimo2010 wrote:
| I can buy an 8 core AMD Opteron for $18 with shipping on eBay, so
| actually I can get 128 cores of x86 for $288. Checkmate lazy
| journalists.
|
| Proof: https://www.ebay.com/itm/234113546803 and that's just the
| first result when I searched AMD Opteron. I didn't even bother to
| see whether I could do better with other old CPUs.
| iosjunkie wrote:
| Now source 5,000 CPUs at that price for your new data center.
| BatteryMountain wrote:
| And it eats a cool 95W per chip.
| rbanffy wrote:
| Not counting the 16 PSU's.
|
| There are use cases for having 16 machines instead of one
| 16 times more powerful one, but lower TCO isn't one of
| them.
| [deleted]
| kellengreen wrote:
| This article just makes me miss the Tom's Hardware of old.
| JohnJamesRambo wrote:
| I can't believe there are CPUs that cost $8k. Does it really pay
| off price/performance wise or is it just for people that like hot
| rods?
| can16358p wrote:
| Many of them are for datacenters/cloud computing or rendering
| services/hypervisors. For them yes they do pay the price for
| sure, given that the CPU time is utilized efficiently.
| PragmaticPulp wrote:
| It's not for consumers. It's for special-purpose servers and
| workstations.
|
| Many problems don't scale well to more nodes. In many cases,
| it's worth spending a lot on a single, very expensive server to
| avoid having to rewrite the software to be distributed across
| multiple machines.
| chrisseaton wrote:
| Are you disbelieving because you think most people would use
| multiple cheaper CPUs instead? There are some workloads where
| inter-core communication is the bottleneck, so people pay a
| premium to get a lot of cores in one place.
| wongarsu wrote:
| The first "bestselling" consumer CPU I checked costs ~$50/core
| (a 6core AMD Ryzen 5600X). Scaled to 128 cores that Ryzen would
| cost $6400. Considering how many motherboards, PSUs, Fans, etc.
| you safe by having one computer with 128-core CPU compared to
| 8.3 computers with a 6-core CPU each, the price premium pays
| for itself (for workloads where this CPU performs at least as
| well as 8.3 Ryzen 5600X).
| JohnJamesRambo wrote:
| Is ram, heat, and network I/O not an issue with 128 cores in
| one box?
| wongarsu wrote:
| It can be a problem, or unlock efficiencies. 8x1Gbit vs
| 1x10Gbit LAN costs about the same, but having the 10Gbit
| LAN in one box gives you better burst performance than
| scattering your I/O across 8 boxes. Having all the RAM in
| one place might reduce how much RAM you need in total. But
| if you need huge amounts of RAM or network I/O, scattering
| it across many boxes is probably easier and cheaper.
|
| The heat from this CPU is fairly reasonable (about a
| RTX3070, to stay in consumer comparisons). It pays for that
| with low clock speed, so your workload has to somehow
| profit from having all the cores in one place.
| throwaway4good wrote:
| Actually it sounds like the arm alternative is a lot more for any
| practical application.
| jabej wrote:
| Wake me up when those cores can run x86 code.
| rbanffy wrote:
| Why? Can't you recompile your code?
| selfhoster11 wrote:
| Emulation is possible, and now even easy. Windows has a built-
| in Windows on ARM (WoA) feature, and Linux can do something
| similar with QEMU application-based emulation (including
| stubbing out calls to dynamically linked libraries with calls
| to native ones).
|
| This is in addition to the fact that a lot of server-based code
| runs as an interpreted/JIT language, or a bytecode VM like Java
| or .NET.
| montalbano wrote:
| You should be awake already then:
|
| https://docs.microsoft.com/en-us/windows/uwp/porting/apps-on...
| sidkshatriya wrote:
| I can understand requiring consumer based systems to offer x86
| emulation like Mac M1s. This is because consumers tend to be
| less sophisticated, expect their old applications to run like
| before etc. etc.
|
| This product is squarely positioned as a server product. I
| don't see the need to have to offer performant x86 emulation
| like Mac M1 does (you could run QEMU if you wanted though but
| that's going to slowish).
|
| You have all the major programming languages, HTTP servers,
| databases, OSes running on ARM Linux. If you're buying an
| Ampere server it would sure be a waste to expect then to run
| x86 in emulation!
| samus wrote:
| Hardware with x86 compatibility is only relevant if for some
| reason you are stuck with x86-only software somewhere in your
| stack. Or you have specialized hardware that you can't run on
| another platform for some reason. But I'd argue most new
| applications aren't dependant on x86: most modern and relevant
| programming language have production-grade compilers and
| runtime environments for ARM and increasingly also for RISC-V.
| [deleted]
| nextweek2 wrote:
| In a past life when I would specify servers, the first
| requirement was ECC RAM.
|
| Is that a thing with ARM or are these servers used for something
| where you don't care about data assurance (like gaming)?
| 8K832d7tNmiQ wrote:
| Do enterprises even care about price per core?
|
| > Ampere positions its Altra and Altra Max processors with up to
| 128 core largely for hyperscale providers of cloud services.
|
| > That leaves the company with a fairly limited number of
| potential customers.
|
| Even the article itself admits that this is a niche product.
| childintime wrote:
| Anyone has an idea how these cores compare to say SiFive's
| latest, the P650?
|
| The P650 has only 16 cores, but looks like it should be able to
| compete with the $800 32 core Ampere running at 1.7GHz.
| floatboth wrote:
| The P650 core is compared to Cortex-A77 (while Neoverse-N1 is
| an offshoot of A76), but N1 is from 2019. P650 is still in the
| "dev kit somewhere in 2022" stage, while the new Neoverse-N2 is
| in the "AWS is letting _EC2 customers_ test drive it already "
| stage: https://aws.amazon.com/blogs/aws/join-the-preview-
| amazon-ec2...
| Symmetry wrote:
| I haven't seen any data on absolute P650 performance. I've seen
| some on SPECInt2006/GHz which is interesting but not useful for
| performance comparisons without the clock speed.
| zackmorris wrote:
| This is great! I've waited for chips approaching something like
| 256 to 1024 cores since the 90s. These CPUs still leave a lot to
| be desired, but I sense that there's been a shift towards better
| uses of transistor count than just single-threaded performance.
|
| These are still roughly 10 times more expensive than they should
| be because of their memory architecture. I'd vote to drop the
| idea of busses to external memories and switch to local memories,
| then have the cores self-organize by using web metaphors like
| content-addressable memory (CAM) to handle caching. Basically get
| rid of all of the cache coherence hardware and treat each core-
| memory as its own computer. The hardware that wasn't scalable
| could go to hardware-accelerated hashing for the CAM.
|
| And a somewhat controversial opinion - I'd probably drop 64 bit
| also and either emulate 64 bit math on an 8/16/32 bit processor,
| or switch to arbitrary precision. That's because the number of
| cores is scalable and quickly dwarfs bits calculated per cycle.
| So we'd take say a 10% performance hit for a 100% increase in the
| number of cores, something like that. This would probably need to
| be tested in simulation to know where the threshold is, maybe 64
| cores or something. Similar arguments could be used for clock
| speed and bus width, etc.
| samus wrote:
| Number of cores is just another metric to optimize for. What
| counts in the end is whether it can efficiently and quickly
| deal with the load it is expected to handle.
|
| Many cores are great for mostly independent tasks, but
| performance will suffer as soon as communication is required.
| Making chip architectures more distributed seems to be the
| state of the art at the moment, but this doesn't mean we will
| suddenly be able to escape Amdahl's Law. To be specific, for
| inherently serial applications where we are absolutely
| interested in getting the result ASAP, single-thread
| performance remains crucial.
| rbanffy wrote:
| > I'd probably drop 64 bit also
|
| There are a lot of other nice things that would have to go
| along that. A 32-bit linear address space is not enough for a
| lot of the things we do today, specially not in servers.
|
| Having some memory dedicated to a given core is clever,
| however, provided we have the required affinity settings to
| match (moving a task to a different core would imply copying
| the scratchpad to the new core and would be extremely costly -
| much more than the cache misses we account for in current
| kernels)
|
| What I would drop immediately is ISA compatibility. I have no
| use for it provided I can compile my code on the new ISA.
| zackmorris wrote:
| Oh I just meant that the CPU might have a 64 bit interface
| and address space, but be 32 bit or less internally. Mostly
| the microcode would provide the emulation, but maybe a few
| instructions like 64 bit multiply would have dedicated
| hardware. If that turns out to be the bulk of the hardware
| needed in the first place, then maybe my idea wouldn't work.
| Maybe a hybrid design, half the transistors for a 5%
| performance loss or something.
|
| Ya I agree, ISA is kind of an antiquated concept. Intel chips
| basically emulate x86 instructions in microcode today, as a
| glaring example.
| rbanffy wrote:
| I have the impression reorder buffers took a huge amount of
| space, specially in irregularly sized ISAs like x86.
| Simplifying the ISA could free up a lot of space for 64-bit
| multiply units.
| SV_BubbleTime wrote:
| Or... just use 64bit through out? I'm not following what
| you are trying to save by adding a mixed system and
| emulation. I can't imagine dropping 64bit to pump cores up.
| dmitrygr wrote:
| > treat each core-memory as its own computer.
|
| There was a company about a decade back that did this. I seem
| to remember it was useful for web serving. Bought by AMD. Not
| sure what happened next. Look them up. Name was SeaMicro
| rbanffy wrote:
| They were discontinued. I ported some enablement software
| from Python 2 to 3 for it as part of Canonical's MaaS
| platform (we had one in the OpenStack Integration Lab, IIRC).
|
| I wouldn't mind finding one on my driveway ;-)
| dragontamer wrote:
| > These are still roughly 10 times more expensive than they
| should be because of their memory architecture. I'd vote to
| drop the idea of busses to external memories and switch to
| local memories, then have the cores self-organize by using web
| metaphors like content-addressable memory (CAM) to handle
| caching. Basically get rid of all of the cache coherence
| hardware and treat each core-memory as its own computer. The
| hardware that wasn't scalable could go to hardware-accelerated
| hashing for the CAM.
|
| If you want "many cores" and "get rid of cache-coherence
| hardware", its called a GPU.
|
| Yes, a lot of those "cores" are SIMD-lanes, at least by NVidia
| / AMD naming conventions. But GPU SIMD-lanes have memory-
| fetching hardware that operates per-lane, so you approximate
| the effects of a many-many core computer.
|
| -------
|
| Japanese companies are experimenting with more CPUs though.
| PEZY "villages" are all proper CPUs IIRC, but this architecture
| isn't very popular outside of Japan. In terms of the global
| market, your best bet is in fact a GPU.
|
| The Fujitsu ARM supercomputer was also ARM-based + HBM2. But
| once again, that's a specific Japanese supercomputer and not
| very popular outside of Japan. It is available though.
| freemint wrote:
| Japanese Companies such as NEC are also selling Vector
| Computers such as the NEC SX-Aurora TSUBASA Plugin cards.
| tromp wrote:
| When I hear "priced at a fraction of", I would assume the
| fraction to be well under a half.
|
| Here, it's used for a fraction of 5800/7890 ~ 73.5%, which I find
| rather misleading.
|
| It would be more accurate to say it costs a fraction _less_ than
| x86.
| [deleted]
| larsbrinkhoff wrote:
| That's an odd assumption. Let's be rational: 1.5 is also a
| fraction. 1/p isn't, even though it's less than 0.5.
|
| I don't mean to sound negative, but it's not very complex.
| Cantinflas wrote:
| Well, 5/2 is also a fraction... What would you think of the
| headline if the fraction they refered to was 5/2?
| hdjjhhvvhga wrote:
| If you ignore the meaning assigned to it in everyday English
| and concentrate on the mathematical sense, it leads to
| absurdities as practically all physical objects can be
| described in terms of fractions as they include improper
| fractions, irrational and complex numbers in the nominator or
| denominator etc.
|
| "A fraction of the price" means "significantly less". When
| someone promises to sell you a new Macbook for a fraction of
| the price but it turns out the price difference is 5 cents,
| from your point of view they're correct but you have every
| right to feel cheated.
| ben_w wrote:
| It's not a rational fraction, but I think it's still a
| fraction. And now I'm wondering about 1/i, inverse
| quaternions (I barely get those even normally), and if 1/M
| [?] M^(-1).
|
| (Nice puns, BTW).
| OJFord wrote:
| A rational number is one that can be written as a fraction
| (of two integers).
|
| What's a rational fraction? Integer parts as opposed to
| e.g. 1.5/2.3?
| samus wrote:
| 1/i = (1/i) * (i/i) = i/(i*i) = i/(-1) = -i
| caddybox wrote:
| A rational number (with a terminating decimal
| representation or a repeating, non-terminating decimal
| representation) can always be expressed in fractional p/q
| form.
|
| I don't think the additional "rational" qualifier is needed
| for fractions.
| ben_w wrote:
| Looks like all three replies to this share the same
| misunderstanding of my intent, so I assume my use of
| "it's" was unhelpfully vague.
|
| 1/pi is not rational and therefore not "a rational
| fraction", but think 1/pi is "a fraction".
| hdjjhhvvhga wrote:
| Of course these exist and that's why we have the process of
| rationalization.[0]
|
| https://en.wikipedia.org/wiki/Rationalisation_(mathematics)
| Asraelite wrote:
| "A fraction of" does not literally mean a fraction of in the
| mathematical sense. Its commonly accepted use in English is
| much more refined.
| BackBlast wrote:
| I've been using them in production. I've been pleased with the
| value proposition offered. It helps that, on our workload, arm
| overperforms relative to published benchmarks while AMD
| underperforms.
| sidkshatriya wrote:
| How many cores version do you have? Are you able to use all
| the software you need to? Are the machines "rock solid" or
| are there some teething issues?
| BackBlast wrote:
| Not able to use it for the whole stack. Some of our stack
| doesn't run on arm.
|
| Don't have a bare metal machine, just some modest sized
| virtual machines with 2 cores each.
|
| They have been stable.
| amelius wrote:
| Before I came here I just knew the top comment would be about
| this headline rather than the actual meat of the content.
|
| I just wish we could have a separate moderation type for
| comments like this, so they can be moved to the bottom.
| Complaints about a website being slow, popups, etc. also fall
| in this category.
| simonh wrote:
| What really irks me is quibbling about headlines that are
| technically inaccurate, but are actually just common everyday
| usage that everyone understands perfectly well not to take
| literally. Comments like that have deeply negative value and
| drive me up the wall. In this case though I think I agree
| with the criticism, in everyday usage "a fraction of" is
| generally taken to mean quite a small fraction, which this
| isn't.
|
| Another commenter is applying the literal definition of
| fraction to argue the title is accurate, and quite rightly
| getting downvoted into oblivion for it. Everybody (who is a
| native English speaker) knows very well what this phrase is
| taken to mean in everyday English.
| Snoozus wrote:
| Well what is the meat of the content? The article says they
| are a bit cheaper, overstates by how much and does not put
| that in relation to any measure of performance.
|
| "Lada Cost a Fraction of Bentley" thank you very much.
| bjarneh wrote:
| > rather than the actual meat of the content
|
| To be fair, there isn't much "meat" in that article, it's
| mainly about the price - and I guess most would agree that
| when we hear the word "fraction" used like that we don't
| think about 73/100.
|
| If you were told you could get a laptop at a fraction of the
| cost on Black Friday, and the fraction turned out to be
| 98/100; wouldn't you be somewhat confused?
| panini-grammar wrote:
| Please note at the processors under the study -
|
| Ampere processor has 128 cores ($5.8K or $45 per core), AMD
| processor has 64 cores ($7.9K or $123 per core), Intel
| processor has 40 cores ($8.1K or $202 per core),
|
| Still Ampere's cost is 50% less than AMD/Intel's cost 'in per
| core' price. So, I think, article used correct word 'fraction'.
|
| Yes, there may other advantages in Intel/AMD, which I do not
| have knowledge of the below ex: - clk speed, - L1/L2/L3 cache
| size and speed, - peripherals ... - etc (I tend to feel, Ampere
| is better in these too. Question is about chipset, SW base etc
| )
| ekianjo wrote:
| are the cores equivalent though?
| calaphos wrote:
| In benchmarks amperes cores are roughly as fast as a
| hyperthread on current gen Epyc. Sometimes more on heavily
| execution bound workloads, sometimes less.
| sidkshatriya wrote:
| Good point. Intel's 40 cores are basically 80 cores due to
| hyper threading. Some people don't like hyperthreading for
| various reasons but it would be better to say the
| comparison is between 128 cores of Ampere and 80 cores of
| Intel.
|
| But then again, number of cores is really a crude measure.
| We need to measure what those machines can really do...
|
| P.S. Surprised I am being downvoted. If you only want to
| compare number of cores then I was implying that 80 is a
| better number than 40 to compare with 128. It gives you a
| rough idea of parallelism even though obviously HT does not
| give you 2x available cores all the time.
|
| I also added a caveat that that number of cores is a crude
| measure anyways.
| Asmod4n wrote:
| Last time I checked a hyper thread is ~75% slower than a
| normal one, so it's roughly a 50 core cpu.
| reitzensteinm wrote:
| This depends heavily on your memory access patterns.
|
| If you're chasing dependent loads around memory, like
| traversing a long fragmented linked list due to garbage
| code, you will get an absolutely perfect 100% speed up.
|
| If you're already saturating the ALU on one thread,
| adding another to the core will probably slow you down
| with context switching and cache contention, and indeed
| well written numeric simulations on supercomputers often
| turn it off entirely.
|
| However, most software we run resembles the garbage
| pointer chasing variety more than the finely tuned
| numerical variety.
| Const-me wrote:
| > most software we run resembles the garbage pointer
| chasing variety more than the finely tuned numerical
| variety.
|
| To be fair, crappy pointer-heavy data structures is not
| the only possible reason. Some useful algorithms are
| inherently serial.
|
| For instance, all streaming parsers, compressors, or
| cryptography are inherently serial algorithms. An
| implementation gonna be relatively slow not because of
| RAM latency, but due to continuous chain of data
| dependencies between sequential instructions.
|
| It's technically possible to implement single-threaded
| code to handle multiple streams concurrently.
| Practically, more often that not it's prohibitively
| complicated to achieve. However, doing that in hardware
| with two hardware threads running on the same core is way
| more manageable in terms of software complexity.
| reitzensteinm wrote:
| Yeah, garbage code was a little unfair.
|
| I actually experimented with parallel fetching of
| multiple values out of a large persistent vector years
| ago, and saw nearly linear speed up for up to four
| fetches in parallel.
|
| The code was awful and it would need to be a compiler
| generated thing for sure.
| goldenkey wrote:
| Once mitigations for Spectre, Meltdown, and whatnot are
| enabled, is hyper threading still a real gain?
| sidkshatriya wrote:
| Newer generation Intel CPUs deal with some of these
| issues on a hardware level.
|
| See
| https://www.intel.com/content/www/us/en/developer/topic-
| tech...
|
| So the level at which your Intel CPU is "crippled" would
| depend on how old it is, I guess.
| goldenkey wrote:
| Are these hardware fixes just bandaids or are the actual
| architectural flaws really fixed? Because didn't other
| processors that did speculative execution besides Intel's
| also have similar vulnerabilities?
|
| A lot of expert level folks on here were saying that the
| architecture surrounding speculative execution would need
| a total removal or reengineering to fix it.
| holbrad wrote:
| That isn't how hyperthreading works... It's completely
| dependent on the workload.
|
| If your using the CPU core as efficiently as possible,
| you'd see no benefit from hyperthreading.
|
| If your using it very poorly, you'd see a massive 2~x
| benefit.
| sidkshatriya wrote:
| > If your using it very poorly, you'd see a massive 2~x
| benefit.
|
| I find this statement tough to agree with. It depends on
| the kind of work your processors are doing: it is IO
| heavy work or are you running Math computations? There
| are other axes basically related to how much work can be
| done by the current thread while the other thread stalls
| waiting for data to be fetched (or other non-
| parallelizable dependencies to be available).
|
| So if you're getting a high benefit you shouldn't
| necessarily feel embarrassed!
| esens wrote:
| When doing C++ compiles and also renders I see a 2x
| benefit generally. It is rare that I do not see something
| close to 1.5 to 2x speed ups.
|
| That said I did switch from Intel over to an Apple M1
| anyway.
| mschuster91 wrote:
| Most compilers are one of the _prime_ cases of
| inefficient architecture, the worse the larger the
| project gets. There is an awful lot of time lost waiting
| on I /O for hundreds to thousands of files (even assuming
| that there is enough RAM for the OS to cache all the
| files and metadata, every file that is read has _at
| least_ three syscall context switches for open
| /read/close, dito for intermediate writes) and for
| process creation and destruction.
|
| What I would find really, _really_ interesting: a
| "single-process" compiler that has a global in-RAM cache
| for all source contents and intermediate outputs and can
| avoid the overhead of child processes... basically a
| model like Webpack or Parcel that has an inotify watcher
| and is constantly running. The JS world had no other
| choice with NodeJS/npm all but forcing the tooling to
| adapt to a lot of incredibly small source files, it's
| time for the "classic" world to adapt.
| l33tman wrote:
| This article is mostly useless, it just focuses on number of
| cores... I didn't find even one mention in the article of actual
| benchmarks.
| berkut wrote:
| Anandtech had a review of it here (and a 2-socket system) against
| Xeons and EPYCs:
|
| https://www.anandtech.com/show/16979/the-ampere-altra-max-re...
| Symmetry wrote:
| On a socket to socket basis scoring significantly better than
| Intel and AMD's offering on some SPECint tests, significantly
| worse on others, and doing about the same as an AMD 7763 on
| average which has 64 cores with 128 threads. On the SPECfp a
| bit behind AMD but doing better than Intel, which sort of
| surprised me given AVX512.
| robocat wrote:
| Linus on AVX512:
| https://www.phoronix.com/scan.php?page=news_item&px=Linus-
| To...
|
| Perhaps the implementation of AVX512 instructions has
| improved, but the earlier versions just caused weird global
| performance decrease side-effects, making the instructions
| virtually useless for general purpose code.
|
| Also, if you have an application that really benefits from
| fast FP, then you nearly always want a GPU.
| imachine1980_ wrote:
| oracle cloud have free tier four vcore from amper, if you want to
| have a test on arm is proablably the best option
| josephg wrote:
| So? The real metric isn't the price or the number of cores. Its
| how much performance you get per dollar. (And sometimes also,
| perf per watt or perf per RU).
|
| The headline may as well be "Slow CPU cheaper than fast CPU".
| This is not newsworthy.
| freemint wrote:
| For SAT solving and compiling those machines have really good
| performance. As for other workloads I didn't test these.
| capableweb wrote:
| What do you mean "the real metric isn't X"? Different metrics
| are useful under different circumstances. Performance/USD
| doesn't matter if you're not price sensitive for example, but
| it's still "a real metric" for the ones that are price
| sensitive.
| masklinn wrote:
| The comment you're responding to is not that
| "performance/usd" isn't a metric, it's that price or core
| count alone are not useful information.
|
| They're specifically saying performance/$ _would_ be useful.
| dahfizz wrote:
| There's some benchmarks here
|
| https://www.anandtech.com/show/16979/the-ampere-altra-max-re...
|
| Depending on workload, they seem to perform very well
| philjohn wrote:
| Until you hit the limitation in cache size.
| maxwell86 wrote:
| For our HPC workload these CPUs perform around 25% better than
| the AMD ones being compared.
|
| "Fast CPU is cheaper than slow CPU" is newsworthy.
| bee_rider wrote:
| What kind of workloads do you have?
|
| I often wonder how these Ampere chips would do on sparse
| matrix operations. Potential for parallelism, but possibly
| less vector potential, to keep the GPUs away...
| slaymaker1907 wrote:
| This makes sense to me since the x86 chips have the advantage
| of running a lot more software out of the box. Sure a lot of
| software can just be recompiled, but compiling for ARM can be
| very difficult for software doing a lot of low level
| performance tricks.
| josephg wrote:
| Great! That's what I want to know.
|
| Thankyou - Your comment is more informative for me than the
| article.
| rbanffy wrote:
| Even if a Neoverse N1 core had a third of the performance
| of a Xeon core, the part has about 3x as many cores than a
| top-of-the-line Xeon for about 75% of the part price.
|
| After reading the article it is quite obvious this has a
| much higher performance per dollar than a Xeon for less.
| [deleted]
| alberth wrote:
| AnandTech indicates that the M1 (non-Pro) provides 2x overall
| performance than what Amazons Graviton2 (Ampere) can provide.
|
| https://www.anandtech.com/show/16252/mac-mini-apple-m1-teste...
|
| https://images.anandtech.com/graphs/graph16252/111168.png
|
| Seems like Ampere has a long way to go when an 8-core M1 (4 perf
| core + 4 eff core) can beat a 64-core Ampere.
| omni wrote:
| Those are single-core tests, it's expected that a consumer CPU
| will beat most server CPUs in that category. If you think the
| M1 can beat a 64-core anything in multicore tests then I have a
| bridge to sell you.
|
| M1 multi-core: https://www.anandtech.com/show/16252/mac-mini-
| apple-m1-teste...
|
| Graviton 2 multi-core:
| https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...
| omni wrote:
| I can't edit but I screwed up the M1 link, corrected:
| https://www.anandtech.com/show/16252/mac-mini-
| apple-m1-teste...
| rbanffy wrote:
| The number of cores available on M1 is a consequence of
| Apple's product lineup. There is no 64-core M1 yet because
| there is no product for it. Yet.
| wmf wrote:
| Ampere at 3.3 GHz is significantly faster than Graviton2 at 2.5
| GHz.
| cwillu wrote:
| Original information from
| https://www.phoronix.com/scan.php?page=news_item&px=Ampere-A...
| bithavoc wrote:
| This should be the link of the post, HN should ban links to
| sites like tomshardware.com where the scroll is hijacked to
| force play videos and don't provide meaningful information over
| the same Phoronix's articles.
| IntelMiner wrote:
| Being less information dense than Phoronix is an achievement
|
| That idiot stuffs his "articles" with so many pages littered
| with ads to milk every single penny he can out of the few
| Linux users that don't have adblock enabled
| 5e92cb50239222b wrote:
| That "idiot" also works 20 hours a day and barely comes
| even at the end of the month despite all those
| advertisements. It's a one-man shop. I am no fan of
| Phoronix, but give him a break.
| stonogo wrote:
| Maybe there isn't a market for working 20 hours a day
| posting content-free articles sourced from other freely-
| available news sites, and he should find other work?
| samus wrote:
| He also maintains a general-purpose benchmark collection
| and uses it to perform regular performance tests for to-
| be-released Linux kernels. Several times serious
| performance regressions were found (most recently because
| of AMD power management and on Alder Lake). Maybe he
| should indeed stop helping big companies save money this
| way, but the Open Source scene would be worse off as a
| whole.
| cwillu wrote:
| Believe me, it hurt my soul to point out that this was a
| rehash of something michael wrote.
|
| But the fact remains that the only informational content of
| this post came out of the legwork performed by another
| party.
|
| I can't read the word "phoronix" without mentally
| substituting "moronix", but the point stands.
| PedroBatista wrote:
| These ARM cpu/boards has a cheaper replacement for x86 in real
| world general computing are becoming a "Year of desktop Linux" by
| now.
|
| I've been hearing these news for more than a decade and still
| hasn't materialize in anything meaningful if you take into
| account how much "fraction of a cost" they are advertised.
|
| Edit: I was referring to a regular person or SME buying something
| like a couple "ATX" boards or "regular servers". Since it's a
| "fraction of the cost" I don't get why it hasn't spread like
| wildfire yet. I wasn't talking about giant cloud companies who
| place orders on the hundreds of thousands at least and many of
| them design their own hardware by now. Neither I was talking
| about a CPU that's attached with +$1000 of gray aluminum.
|
| Raspberry Pi and it's "clones" are what it's closer to what I was
| talking about, but not really.
| rbanffy wrote:
| The CPU costs a fraction of the equivalent Xeon, but the CPU is
| not the only part in a server BoM nor is the most expensive
| subsystem of the box. When you add a terabyte of RAM, extra
| networking, and a bunch of SAS SSD's and HDD's, the CPU cost is
| almost negligible.
|
| Most companies that buy x86 servers have no desire to recompile
| their software for a new architecture - they want to run
| PowerBI or SharePoint. They don't really benefit from a machine
| like this.
| matwood wrote:
| Not sure if it's 'real world' enough, but for many workloads
| Graviton2 (AWS ARM) is a drop in replacement for x86. Last year
| I moved a lot of workloads over with very little effort.
| jagger27 wrote:
| What? General computing as in supercomputers?
|
| https://www.fujitsu.com/global/about/innovation/fugaku/
|
| Apple's entire Mac lineup is going to be Arm-powered by year
| end too.
|
| Cost comparison is almost useless in this market because there
| simply are not enough wafers to meet demand. The price per core
| per watt is the real world comparison, and my inaudible MacBook
| fan demonstrates that beautifully.
| formerly_proven wrote:
| TOP xx supercomputers are not general computing. The software
| stack basically doesn't matter because everything is going to
| be custom anyway. That's why you see custom ISAs, funky ISA
| extensions, one-off accelerators and single-purpose operating
| systems in them. None of that would fly for commercial
| applications.
| thrwyoilarticle wrote:
| There was a time when people were sceptical about patches
| to get Linux running on those little 32 bit
| microcontrollers.
| rbanffy wrote:
| > and single-purpose operating systems in them
|
| Do they? Aren't most of them running Linux?
| smoldesu wrote:
| Linux isn't an operating system. There are many single-
| purpose operating systems built with Linux though, which
| is why I think the parent comment is still correct. As
| I'm sure you can imagine, those machines are multiplexed
| into hundreds or thousands of containers, VMs and
| supervisor kits, and those are definitely designed to run
| extremely niche containerized software or specialized
| distributions.
| rbanffy wrote:
| > Linux isn't an operating system.
|
| While some of those VMs may be running trimmed down OSs
| tailored for their individual workloads, the fact remains
| they are all running under some sort of GNU/Linux or, at
| least, relying on the services of a Linux kernel.
|
| Now that you mention it, WSL1 should be called
| GNU/Windows instead, because it's the GNU userland on top
| of a Windows kernel with Linux lipstick.
| [deleted]
| nine_k wrote:
| ARM-based instances on AWS are a thing, and they do cost less.
|
| Not _dramatically_ less, like 15% of x64 instances, but still.
| swdev281634 wrote:
| Oracle's cloud servers are OK. Their physical servers have 160
| cores (2x Ampere Altra Q80-30, 80 cores/each), 1TB RAM, and 100
| Gbps network bandwidth (2x 50Gbps cards). They can also cut
| these servers into VMs and offer these smaller VMs.
|
| The software story is OK by now. I had little to no issues with
| that aarch64 Linux in their VMs. I didn't need a lot though,
| only mysql, asp.net core runtime, and related OS setup
| (SELinux, built-in firewall, etc).
| thrwyoilarticle wrote:
| Not counting cloud computing (and, presumably, Apple computers)
| is akin to the Linux naysayers not counting Android after the
| very essence of desktop computing was uprooted.
|
| Standardisation and critical mass is the hard part of the
| puzzle for Arm64 desktops. But it was also the hard part of the
| puzzle for supercomputing and cloud servers, where it now has a
| firm foothold. Personally, I work in an industry where everyone
| is moving away from x86 and developing on physical Arm64
| machines because x86 simply can't fit the power budget in
| production.
| selfhoster11 wrote:
| ARM has been pretty weak for more than a decade, and only
| started getting decent CPU performance and RAM sizes with a
| reasonable pricing very recently. Even x86 computing took a
| while to develop into something useful.
| bullen wrote:
| 2 watts per core at 3GHz is pretty impressive, what nanometer are
| they at?
|
| But still memory bandwidth is going to restrain those 128 cores
| to do anything joint parallel, you might actually be better off
| with many smaller 4-8 core machines.
| jagger27 wrote:
| TSMC N7.
| bullen wrote:
| Thx
| voxadam wrote:
| > what nanometer are they at?
|
| It looks like 7nm.[0] I suppose that means they're coming out
| of TSMC, which isn't terribly surprising.
|
| [0] https://www.servethehome.com/ampere-altra-
| max-m128-30-128-co...
| marcan_42 wrote:
| This is a good point. They have 8 channels of DDR4-3200, which
| is 204 GB/s, to be shared across 128 cores. That's _half_ the
| total memory bandwidth of the M1 Max, at 400 GB /s, and it only
| has 10 cores and a (decently beefy) GPU.
|
| A single M1 P-core can saturate 70GB/s of memory bandwidth, and
| not even at max clock. If these 128 cores were individually as
| good as the M1 cores, they'd need _350 channels_ of DDR4-3200
| RAM to get that nonblocking per-core memory bandwidth. The M1
| Max can 't achieve that, ending up at somewhere around 28GB/s
| per core with all P-cores competing, but even then this
| 128-core CPU would need _140 channels_ of DDR4-3200 to achieve
| the same memory bandwidth per core. So I imagine this kind of
| CPU will scale well to compute-heavy workloads that mostly stay
| in cache, but rather poorly for throughput-heavy workloads that
| stream data to /from RAM very quickly. The RAM is more than 10x
| oversubscribed; you might find that RAM-bound workloads
| saturate with just 12-16 of those 128 cores active.
| crest wrote:
| According to all benchmarks I've seen the M1 Max doesn't
| offer significant more memory bandwidth to the CPU cores
| unless (unless the GPU is busy as well). You have run a
| microbenchmark all all P and E cores to reach the internal
| bus limitation at ~240GB/s. The remaining ~160GB/s can only
| be utilised by the other parts of this monster SoC (GPU, TPU,
| video decoders/encoders, etc.) and in my 14" machine the GPU
| is throttled by a power limit (probably required by the
| available cooling).
|
| I would love to see how much performance potential there is
| in the M1 Max die (better cooling with raised voltage and
| power limits).
| marcan_42 wrote:
| That's what I said: you get 28 GB/s per P-core for a total
| of around 220GB/s across 8 P-cores (adding the two E-cores
| adds a bit more to reach 240). That's still more than the
| Ampere CPU can offer its 128 cores.
| cma wrote:
| Most uses that need that few operations per byte would be IO
| limited I would think, or better suited to GPU compute.
|
| M1 max has a very beefy GPU so the bandwidth is mostly sized
| for that. Its GPU is 22 single-precision TFLOPS or something
| but still has less memory bandwidth than a 2080ti which is
| only 14TFLOPs or so.
| marcan_42 wrote:
| Keep in mind that the GPU architecture is TBDR, which is
| designed for lower memory bandwidth (and thus lower power
| consumption) than typical immediate mode GPUs. Desktop GPUs
| also use tiles these days, but for rendering use cases at
| least the TFLOPS:bandwidth ratio is probably not directly
| comparable between architectures that dissimilar.
|
| The CPUs can still eat half of the M1 Max's memory
| bandwidth alone, without the GPU to help. How often this
| happens with real world workloads, I don't know :-)
| ThrowawayR2 wrote:
| The M1 Max gets that bandwidth from putting the RAM into the
| processor package. There is no way that server processors can
| fit the much higher RAM per core used for server workloads
| onto the processor package so it's unclear that the
| comparison is meaningful.
| namibj wrote:
| POWER10's memory architecture offers a TB/s in memory
| bandwidth per die. And about another TB/s in other
| bandwidth for SMP and IO. Only using this all for memory,
| they'd be at an equivalent of around 80 channels. Not quite
| the 140, but much closer than 8.
| marcan_42 wrote:
| You need to scale the DRAM channels with the number of
| cores; you can't just pack more and more cores into one die
| without adding DRAM width. Yes, you then run into physical
| limitations, and that's kind of the point I'm trying to
| make here :)
| nine_k wrote:
| Often you're after tail latency, not throughput.
|
| It's pretty normal to target e.g. your web backend tier to not
| exceed 20% CPU load, and be ready to instantly react to
| incoming I/O.
| silicaroach wrote:
| should read "... at fraction of the _comparable_ x86 Price" :-D
| gfykvfyxgc wrote:
| ARM has a big problem .... compatibility.
|
| X86 absolutely destroys ARM in the software front. Give me an x86
| cpu and I can run any version of linux on it with wide hardware
| support.
|
| Give me a random ARM computer, you're lucky if you can get it to
| boot.
|
| You're even luckier if the ARM CPU manufacturer releases
| documentation about how their chips work .... They prefer to keep
| inner workings secret.
| capableweb wrote:
| > Give me a random ARM computer, you're lucky if you can get it
| to boot.
|
| That's wildly inaccurate. Many distributions have ARM builds
| that works perfectly fine (at least for me, YMMV), have you
| tried any of them?
|
| Here is some of them:
|
| - Arch Linux - https://archlinuxarm.org/
|
| - Ubuntu - https://ubuntu.com/download/server/arm
|
| - Alpine Linux - https://alpinelinux.org/downloads/
|
| - Elementary OS - https://blog.elementary.io/elementary-os-on-
| raspberry-pi/
|
| - NixOS - https://nixos.wiki/wiki/NixOS_on_ARM
| fork-bomber wrote:
| I don't think that's largely true anymore for server class
| hardware - which is the focus of the article.
|
| Arm came up with a bunch of standardisation requirements quite
| a while ago (see:
| https://developer.arm.com/architectures/system-
| architectures...) which have been quite successful especially
| for server designs.
|
| That was an absolute requirement in order for AArch64 to even
| be considered as an alternative in the datacenter space where
| it is now a very compelling alternative to x86_64.
|
| What I mean specifically is standardised support for firmware,
| hypervisor and operating system kernel interfaces for things
| like system bootstrap, power-perf control etc. Think ACPI, EFI,
| CPU capability discovery, DVFS, Idle management etc.
|
| Being unable to boot Linux on modern AArch64 server class
| hardware is actually increasingly rare thanks to the
| standardisation.
|
| Your comments are more applicable to the general Arm embedded
| systems scene where fragmentation is understandably rife. It
| was the price Arm had to pay to keep its royalty model in
| flight - "Pay the license fee, do what you will with the
| design".
| 3np wrote:
| The gap for generic arm64 has been closing _a lot_ in recent
| years. These days IME the vast majority of fiddling doesn 't
| have to do with the ARM SoC/CPU itself but rather with getting
| the right dtb (and sometimes firmware) for other integrated
| hardware, and u-boot.
|
| These issues aren't really inherent to the architecture per-se,
| more tied to the setups and practices for the kind of devices
| you tend to find available ARM SoCs/CPUs in. IME from dealing
| with embedded x86 devboards some years back it was a similar
| situation. I'd assume units marketed for server loads are
| comparable to x86 equivalents.
|
| I've been trying out Arch Linux on ARM on a secondary
| workstation recently. In most cases the only thing needed to
| get a non-supported package working is to add 'aarch64' to the
| list supported architectures in the PKGBUILD and then proceed
| like normal.
|
| > You're even luckier if the ARM CPU manufacturer releases
| documentation about how their chips work .... They prefer to
| keep inner workings secret.
|
| This is the bigger practical issue. Rockchip are in general
| good, while the kind of chips you find in flagship phones
| require significant reverse-engineering are unapproachable for
| the non-hacker. But again, not so much "software-to-CPU", more
| "drivers-to-everything-around-the-CPU".
| foxfluff wrote:
| I wouldn't generalize Rockchip as good. You're lucky if they
| release one volume of their six volume reference manual, with
| "Confidential" plastered all over, and most peripherals
| completely undocumented beyond vague register names. If
| you're extra lucky, you'll find another leaked volume on a
| sketchy Chinese website. Anything else? NDA.
|
| And what about new chips? Look up datasheets for e.g. the
| (now popular) RK3566. You get a 58 page document which is
| just a general overview plus pinout and not much else. Does
| fuck all for you if you actually want to write drivers and
| get the thing to work.
| 3np wrote:
| Fair enough!
| floatboth wrote:
| dtb is for embedded devices. Server/workstation class
| hardware uses ACPI.
| 3np wrote:
| Precisely, thanks for making it explicit.
| maxwell86 wrote:
| We bought a small ARM HPC cluster last summer, and everything
| worked out of the box. All our apps and dependencies just
| worked.
|
| Documentation is also excellent. ARM docs are really good, and
| go in much more detail than Intel and AMD docs about the inner
| working of their cores.
|
| That's what we expected, since ARM licenses all their IP, and
| that's what our vendor delivered. Everything we wanted to know
| about the Hw, there were docs and training materials ready for
| it.
| freemint wrote:
| What scheduler do you use there?
| sidkshatriya wrote:
| Is dealing with the slightly more relaxed memory/concurrency
| model of ARM been tricky or is it something you don't really
| encounter in practice?
| maxwell86 wrote:
| We don't write relaxed atomic kind of code directly (who
| does this, really?) We use MPI, pthreads primitives, c++
| synchronization primitives, openmp, etc. These are portable
| and "just work". Anecdotally, we haven't run into any
| incorrect use of these in our apps yet that cause problems
| on ARM but not on x86 (although that would be a bug on
| both), but we aren't doing anything super fancy.
| floatboth wrote:
| > They prefer to keep inner workings secret
|
| https://github.com/tianocore/edk2-platforms/tree/master/Sili...
| https://github.com/tianocore/edk2-platforms/tree/master/Sili...
| https://github.com/tianocore/edk2-platforms/tree/master/Sili...
| https://github.com/tianocore/edk2-platforms/tree/master/Sili...
| mobilio wrote:
| Just two things: - 250W TDP is HUGE - motherboard for this
| processor is expensive
| jeffbee wrote:
| I just bought a new Intel CPU that draws 125-241W (they don't
| give a TDP any more) and it only has 8 cores. It is very fast,
| though.
|
| I don't think 250W is outrageous for a chip with this much
| logic on it.
| OJFord wrote:
| > 250W TDP is huge
|
| 128 cores is too though, I imagine it's pretty linear?
| jakuboboza wrote:
| is it ? Most of EPYC processors are TDP 180/200 watt and there
| are cheap mobos for them that can host even two sockets. So I
| don't think that would be a big issue.
|
| Also we don't even know how they calculate TDP, lets not forget
| every single company Intel, Amd, Nvidia etc... have their own
| weird formula to calculate TDP. Your Intel 12900k has TDP on
| paper of 125W but can easily jump to 300W of power consumed.
| Without knowing formula for TDP calculation from every
| manufacturer this type of comparison is only a guessing game.
| tehbeard wrote:
| if this was an Intel chip, then yeah 250W TDP is huge given
| their core counts..
|
| But for something packing a similar core count to
| EPYC/Threadripper, it's in the right ballpark.
| NavinF wrote:
| Is it though? My desktop CPU regularly pulls 150W. 250W is
| pretty normal for a server. That said, I agree that
| motherboards will be really expensive since we're not gonna see
| hundreds/thousands of models competing on price like we see
| with every new x86 socket.
| bee_rider wrote:
| This article has no measurements, in fact it doesn't tell us
| anything we couldn't have gotten by looking at the product
| specifications. They even took their table from Phoronix, so they
| didn't even do the legwork of comparing the marketing material
| for the products!
|
| Given that this low-effort article picked a metric which everyone
| knows will benefit the ARM processor, I can only assume it is
| marketing for Ampere. And yet, the first sentence starts with:
|
| > Ampere's flagship 128-core Altra Max M128-30 may not be the
| world's highest-performing processor
|
| Ampere is cool. It is really awesome that somebody is putting up
| a fight in CPU design without Intel/AMD's legacy advantages, or
| Apple/Amazon's infinite money. I really hope they didn't pay much
| for this fluff, that would be is pretty embarrassing.
|
| --
|
| Edit: It is neat to see that they've got a chip under $1k. I
| wonder if a Q32-17 workstation could be put together for cheaper
| than whatever the cheapest Apple M1 pro device is, to experiment
| with computationally crunchy Arm codes.
| londons_explore wrote:
| It's also quite a niche use case... An application fine with
| low single thread performance, that is highly parallelizable,
| requiring hundreds of threads, but with sufficiently branchy
| execution that CUDA/GPU doesn't work out... Oh, and it also no
| binary blobs, or you won't be able to port to ARM.
| guiand wrote:
| Sounds like exactly the use case of a server. Rack space is
| expensive, condensing it can save the operators money.
| masklinn wrote:
| > Edit: It is neat to see that they've got a chip under $1k. I
| wonder if a Q32-17 workstation could be put together for
| cheaper than whatever the cheapest Apple M1 pro device is, to
| experiment with computationally crunchy Arm codes.
|
| The cheapest M1P device is currently rather expensive ($2k for
| the 8-cores 14") but there'll almost certainly be an M1P mini
| for about the same price as the current (still on intel) high-
| ender: $1100.
|
| A Q32-17 leaves you with $300 for a bespoke box around the CPU.
| For such a CPU class I'd expect the mainboard alone to exceed
| that budget. Even if the Mini is price-bumped to, say, 1500
| (which would be somewhat in-line with the 13" -> 14" price
| differential) I don't think you can get even just the guts of
| an Altra-based workstation for less than the price of the
| processor.
| dchichkov wrote:
| An alternative could be Jetson AGX Orin - 2048 Cores @
| 1000MHz.
| bee_rider wrote:
| Yeah. Someone in another branch linked a site selling Ampere
| workstations, couldn't find anything under ~$7k.
|
| I think if we want to drop ~$7k on ARM computers to play
| around with for some reason, we'd be better off waiting for a
| hypothetical M1P mini. Plus you could get a handful of the
| things and make an adorable mini cluster -- then you get to
| deal with MPI which will make for a more fun experiment
| anyway.
| R0b0t1 wrote:
| There's cheaper arm workstations that smoke the usual
| embedded chips. They need to drop the price before devs
| will really want it.
| jhickok wrote:
| and that 7k is without storage and with a relatively small
| ram selection. Realistically the workstation is more like
| 10-15k.
|
| It would be nice to have some competition to Apple in this
| space, but with the rumored Apple Silicon Mac Pro on the
| horizon that might be the best bet:
| https://appleinsider.com/articles/22/01/02/smaller-mac-
| pro-w...
| floatboth wrote:
| > I wonder if a Q32-17 workstation could be put together for
| cheaper than whatever the cheapest Apple M1 pro device is
|
| Nah. You can't just get a chip & mainboard retail since there
| is basically no market for that.
|
| About the only option for an Altra workstation is a prebuilt
| for "as low as $7661" :(
|
| https://store.avantek.co.uk/ampere-altra-64bit-arm-workstati...
| bee_rider wrote:
| Ouch. I (hypothetically -- some alternative version of me
| that had more free time and free money) would be willing to
| pay some early adopter, "I'm buying fun parts" tax, but $7k
| for 32 threads @ 1.7GHz is a bit much even for hypothesizing.
| I wonder what they have in mind for that chip, then.
| fredoralive wrote:
| It's mainly intended as a server CPU, which often go for
| for lots of cores / threads over maximum single core
| clockspeed. That 32 core / 1.7GHz config doesn't seem too
| hot for either though.
|
| The workstation is presumably intended as a developer tool
| really, for those that want real hardware on hand instead
| of remote access (and don't want a noisy server in an
| office). Going to be even more niche and low volume than
| the server versions, so gonna cost a fair chunk.
| floatboth wrote:
| For fun parts you can get an NXP LX2160A board from
| SolidRun (or the earlier Marvell Armada8k one).
|
| The performance wouldn't be competitive (Cortex-A72 is
| really old by now) but it's nice to have a proper arm64
| desktop.
| leeter wrote:
| Someone needs the courage to make a motherboard and CPU with
| off the shelf parts and a socketed chip in the non-x86 space.
| That was one of the key reasons that the IBM PC succeeded: it
| was easy to clone. Without that it's likely it would have
| been just another interesting footnote in history. But
| because IBM didn't take the time to do everything custom and
| locked in others were able to enter the market and make
| things happen. Honestly I don't think anyone sells socketed
| ARM64 or RISC-V chips. There might be OpenPOWER but I don't
| think anyone is really fabbing them at a cost that a consumer
| could even consider.
|
| Sadly I don't think that sort of courage is in copious supply
| it's always easier to vendor lock people than it is to take
| the risk of someone else eating your lunch. So until that
| happens I just don't see non-x86 platforms really eating into
| the desktop/laptop market in any large share.
| msgilligan wrote:
| Well, Raspberry Pi, with the Compute Module 4 has done this
| on the low-end and an ecosystem appears to be developing
| rapidly with a wide variety of carrier boards and a handful
| of 3rd-party compute module cards (e.g. Pine64 SOQuartz and
| the Radxa CM3)
| chasil wrote:
| AMD actually did this in the distant past with the original
| 32-bit Athlons.
|
| They used the DEC Alpha EV6 bus from the 21264
| microprocessor. It would be interesting to have seen DEC
| StrongARM adapted to the same bus, and a single motherboard
| able to run any of these CPUs with the right BIOS.
|
| https://en.wikipedia.org/wiki/Slot_A
| leeter wrote:
| Other than BIOS/UEFI as far as I know nothing is stopping
| them from using existing chipsets/motherboards. I don't
| recall anything on the chipset itself being inherently
| tied to x86 per se. I think even port IO is handled in
| the SOC/microcode these days. So they could probably do
| it. Particularly if they reused things like the AMD IO
| die etc. No need to reinvent the wheel because the cores
| are different.
| stcredzero wrote:
| _I don 't think anyone is really fabbing them at a cost
| that a consumer could even consider._
|
| What are the Chinese getting up to?
| ohazi wrote:
| > Someone needs the courage to make a motherboard and CPU
| with off the shelf parts and a socketed chip in the non-x86
| space.
|
| Honestly? If they want an easy win, that someone should be
| Intel or AMD.
|
| They already design and release desktop and mobile chipsets
| with "platform specs" that are designed to be copied
| wholesale by all the major motherboard and PC vendors.
|
| If Intel released a socketed ARM CPU with a chipset and a
| platform spec to go with it, you'd be able to buy a
| motherboard for it from ASUS or MSI for $150 within six
| months.
| dragontamer wrote:
| AMD's original plan for Zen was to have an x86 chip +
| socket-compatible ARM chip as well.
|
| AMD didn't seem to have the funding needed to execute
| this plan though, and killed the ARM-project. To be fair,
| AMD was teetering on the edge of bankruptcy at this time,
| and it probably was just too much risk for them to even
| attempt to such a strange and risky strategy.
|
| AMD did release a few ARM chips to test the waters. But
| the ARM-based AMD chips never sold in high numbers or got
| much demand, so AMD doubled-down on x86-only Zen chips
| instead.
|
| -------
|
| Allegedly, AMD still owns their ARM-based decoder and
| therefore can convert their Zen-chips over from x86 into
| ARM-based instructions if they ever felt like it made
| business sense. Given how successful EPYC is however,
| there's probably no need for AMD to do something this
| risky (might as well keep selling these x86 chips like
| hotcakes, AMD is practically supply-constrained as it is
| and doesn't need to do weird things like that to sell its
| chips)
|
| -------
|
| See:
|
| * https://www.amd.com/en/amd-opteron-a1100
|
| * https://en.wikipedia.org/wiki/AMD_K12
| oarsinsync wrote:
| > That was one of the key reasons that the IBM PC
| succeeded: it was easy to clone. Without that it's likely
| it would have been just another interesting footnote in
| history. But because IBM didn't take the time to do
| everything custom and locked in others were able to enter
| the market and make things happen.
|
| I'm reasonable sure IBM PCs were totally custom and locked
| down, and they got reverse engineered enabling "IBM
| Compatible" clones to spring up.
| klondike_ wrote:
| The IBM PC was the first computer IBM made that used
| entirely off the shelf chips except one: the BIOS
|
| The BIOS chip was just copied outright for the first
| clones, but that was ruled illegal in court so later
| compatibles reverse engineered the code and made their
| own.
| Melatonic wrote:
| If datacenters adopt it in a few years we will be able to get
| the decomm'd stuff - once the hardware refresh cycles hit
| ebay and the like will be flooded with stuff.
| tkinom wrote:
| Love to see compiler benchmark (compile firefox, chrome) on
| this vs system with EPYC 64C/128T or 128C/256T.
| [deleted]
| jcadam wrote:
| I would love an ARM-based linux workstation, but I'm not
| willing to pay the extreme Apple premium for whatever an
| M1-based Mac Pro is going to cost.
| DenseComet wrote:
| I wonder how the pricing will end up. At the baseline, $1200
| for a MacBook Air with 16 gigs of ram is great value for the
| machine you get, even compared to Windows alternatives. For
| the new MacBook Pros, there are multiple companies who felt
| that they were worth shelling out for top of the line configs
| outside of their normal refresh cycles. Maybe Apple will
| price their top of the line stuff for a reasonable
| price/performance ratio, or they might think that the people
| who actually need that config would be willing to pay sky
| high prices and charge accordingly.
| lmilcin wrote:
| Q32-17 may have 32 cores, may have 45W TDP and may have
| whooping 128 pcie 4.0 lanes but it is still only 1.7GHz.
|
| What this means in practice is that this will heavily depend on
| the type of load you are running. A lot of workstation type
| loads just can't make use of 32 threads and on that CPU it will
| have to just to offset for the slower single core performance.
| PragmaticPulp wrote:
| The 1.7GHz boost speed is a significant limitation.
|
| Ignoring architecture differences for a moment, that's about
| 1/2 the clock speed of something like an M1 or even a modern
| cell phone chip. It's almost 1/3 of the boost clock of a
| modern Intel or AMD desktop chip.
|
| Given the choice and assuming equal cache sizes, I'd take a
| 3.4GHz version with 16 cores over a 1.7GHz version with 32
| cores.
|
| The only time a higher core count with proportionally lower
| clock speed really helps is when you're going for raw power
| efficiency.
| windexh8er wrote:
| One application, where this type of processor makes sense, is
| network based middleboxes. Think of things like firewalls
| that have evolved to have a lot of different functionality
| and a need for a lot of simultaneous thread processing
| availability. This type of chip on a SmartNIC could be very
| popular. And while this is a bit niche I'm sure there are
| plenty of other use cases. E.g. nodes running a lot of FaaS
| containers, VPS providers looking to offer higher CPU count
| but lower clock speeds, etc.
| bee_rider wrote:
| I'm thinking more as a developer machine to prepare for the
| future, rather than as a real productivity workstation.
| Assuming the predictions of ARM ascendancy actually come true
| (certainly remains to be seen) you should want your code to
| run well on 32 ARM cores @1.7GHz, because that means it'll
| run really well on the production workstations that come out
| in a couple years, right?
| wongarsu wrote:
| If production workstations become ARM based, wouldn't it be
| more likely that we get 2-4 high-speed (3-4GHz) cores and a
| large number of low speed cores? That's analogous to
| smartphone architectures with their low-power cores and
| high-speed/high-power cores, and better matches how real-
| world workloads behave.
| bee_rider wrote:
| So the smaller cores could be more like throughput-tuned
| efficiency cores, rather than 100% power saving cores --
| kind of like Alder Lake? Seems like a neat idea. Almost
| like a Xeon + Xeon Phi, but on a chip. Seem like that'd
| get around the communication issue of the Phi and
| probably the compatibility issues as well.
| zitterbewegung wrote:
| Apple used an A12z bionic for people to prepare for M1 and
| their derivatives which has a big-little configuration . I
| don't see this or any other server with this configuration
| at all. A ARM workstation would definitely have this which
| begs the question that a Mac with apple silicon would be
| the developer machine to prepare for the future .
___________________________________________________________________
(page generated 2022-01-04 23:02 UTC)