[HN Gopher] Why is Apple Rosetta 2 fast?
___________________________________________________________________
Why is Apple Rosetta 2 fast?
Author : fanf2
Score : 65 points
Date : 2024-11-19 21:42 UTC (1 hours ago)
(HTM) web link (dougallj.wordpress.com)
(TXT) w3m dump (dougallj.wordpress.com)
| Syonyk wrote:
| Post got the big one: Total Store Ordering (TSO).
|
| The rest are all techniques in reasonably common use, but unless
| you have hardware support for x86's strong memory ordering, you
| _cannot_ get very good x86-on-ARM performance, because it 's by
| no means clear when strong memory ordering matters, and when it
| doesn't, inspecting existing code - so you have to liberally
| sprinkle memory barriers around, which really kill performance.
|
| The huge and fast L1I/L1D cache doesn't hurt things either...
| emulation tends cache-intensive.
| jsheard wrote:
| It's surprising that (AFAIK) Qualcomm didn't implement TSO in
| the chips they made for the recent-ish Windows ARM machines. If
| anything they need fast x86 emulation even more than Apple does
| since Windows has a much longer tail of software support than
| macOS, there's going to be important Windows apps that
| stubbornly refuse to support native ARM basically forever.
| scottlamb wrote:
| Does Windows's translation take advantage of those where they
| exist? E.g. if I launch an aarch64 Windows VM on my M2, does
| it use the M2's support for TSO when running x86_64 .exes or
| does it insert these memory barriers?
|
| If not, it makes sense that Qualcomm didn't bother adding
| them.
| Syonyk wrote:
| I would expect it to not use TSO, because the toggle for it
| isn't, to the best of my knowledge, a general userspace
| toggle. It's something the kernel has to toggle, and so a
| VM may or may not (probably does not) even have access to
| the SCRs (system control registers) to change it.
| zeusk wrote:
| TSO toggle on Apple Silicon is a user-space
| accessible/writable register.
|
| It is used when you install rosetta2 for Linux VMs
|
| https://developer.apple.com/documentation/virtualization/
| run...
| Syonyk wrote:
| Are you sure it's userspace accessible?
|
| Based on https://github.com/saagarjha/TSOEnabler/blob/mas
| ter/TSOEnabl..., it's a field in ACTLR_EL1, which is
| explicitly (per the ARMv8 spec, at least...) _not_
| accessible to userspace (EL0) execution.
|
| There may be some kernel interface to allow userspace to
| toggle that, but that's not the same as being a
| userspace-accessible SCR (and I also wouldn't expect it
| to be passed through to a VM - you'd likely need a
| hypercall to toggle it, unless the hypervisor emulated
| that, though admittedly I'm not quite as deep weeds on
| ARMv8 virtualization as I would prefer at the moment).
| zeusk wrote:
| The OS can use what hardware supports, Mac OS does because
| SEG is a tightly integrated group at Apple whereas
| Microsoft treats hardware vendors at arm's length (pun
| unintended). There are roadmap sharing, planning events
| through leadership but it is not as cohesive as it is at
| Apple.
| deaddodo wrote:
| Microsoft's AoT+JiT techniques still pull off impressive
| performance (90+% in almost every case, 96-99% in the
| majority).
|
| But yes, if they were actually serious about Windows on ARM,
| they would have implemented TSO in their "custom" Qualcomm
| SQ1/SQ2 chips.
| Syonyk wrote:
| My guess is that the sort of "legacy x86-forever" apps for
| Windows don't really need much in the way of performance.
| Think your classic Visual Basic 6 sort of thing that a
| business relies on for decades.
|
| I'm also fairly certain that the TSO changes to the memory
| system are non-trivial, and it's possible that Qualcomm
| doesn't see it as a value-add in their chips - and they're
| _probably right._ Windows machines are such a hot mess that
| outside a relatively small group of users (who _probably_ run
| Linux anyway, so aren 't anyone's target market), nobody
| would know or care what TSO is. If it add costs and power and
| doesn't matter, why bother?
| jsheard wrote:
| > My guess is that the sort of "legacy x86-forever" apps
| for Windows don't really need much in the way of
| performance.
|
| Games are a pretty notable exception that demand high
| performance, and even if we reach a point where gamedevs
| start shipping ARM binaries of new games, it's extremely
| unlikely that anything released before that point will be
| retroactively updated to be ARM native.
| p_l wrote:
| Qualcomm has been phoning it in in various forms for over a
| decade, including forcing MS to ship machines that do not
| really pass windows requirements (like broken firmware
| support). Maybe it got fixed with recent Snapdragon X, but I
| won't hold my breath.
|
| We're talking about a company that, if certain personal
| sources are to be believed, started the Snapdragon brand by
| deciding to cheapen out on memory bandwidth despite feedback
| that increasing it was critical and leaving the client to
| find out too late in the integration stage.
|
| Deciding that they make better money by not spending on
| implementing TSO, or not spending transistors on bigger
| caches, and getting more volume at lower cost, is perfectly
| normal.
| brycewray wrote:
| (2022)
| leshokunin wrote:
| Super interesting. Putting my PM hat on, I wonder: how many x86
| apps on Apple still benefit from this much performance? What's
| the coverage? The switch to M1 happened 4 years ago, so the
| software was designed for hardware nearly half a decade old.
|
| Excellent engineering and nice that it was built properly. Is
| this something that Linux / Wine / the Steam compatibility layer
| already benefit from?
| spockz wrote:
| I think it is less of numbers game and more of a guarantee
| thing. As a user of a new Apple silicon machine you do not have
| to worry about running x86 software. (Aside from maybe specific
| audio software and such that are a pain to run on any other
| hardware and software combination.)
|
| As such it may very well be a loss leader and that is fine.
| Probably most development has been done and there is little
| maintenance needed.
|
| Also, while most native macOS apps that I encounter have an
| Apple silicon version now, I still find docker images for amd64
| without an arm64 version present. Rosetta2 also helps with
| these applications.
| aaomidi wrote:
| Games. So many games.
|
| Also, x86 containers.
| jsheard wrote:
| Then again games didn't stop Apple from dropping x86-32
| support, which nuked half of the Mac Steam library. It
| wouldn't be out of character for them to drop x86-64 support
| and nuke the rest which haven't been updated to native ARM.
| p_l wrote:
| For games on intel macs they had fallback of BootCamp so
| combined with not really caring about games outside of
| random bursts like support for Unity, they were fine
| telling people to run windows. (ironically, the only Mac I
| owned ran faster under windows than under macOS...)
| Syonyk wrote:
| "Apple M-series chips emulating x86," in certain benchmarks and
| behaviors, was right up there with the fastest x86 chips at the
| time - I'd guess largely in stuff that benefited from the huge
| L1I/L1D cache (compared to x86).
|
| I had a M1 Mini for a while, and it played Kerbal Space Program
| (x86) _far_ better than my previous Intel Mini, which had Intel
| Integrated Graphics that could barely manage a 4k monitor, much
| less actual gaming.
|
| I believe there's a way to use Rosetta with Linux VMs, too (to
| translate x86 VM applications to ARM and run them natively) -
| but I no longer have any Macs, so I've not had a chance to play
| with it.
| dhosek wrote:
| I wonder if these lessons might be applied to Wasm runtimes where
| the Wasm could be JIT compiled into native code. Of course this
| does raise the possibility of security concerns if the Wasm
| compilation has some bug, and then of course there's also the
| question of whether Wasm's requirements might mean native
| compilation doesn't give much of a performance boost (as seems to
| be the case with e.g., Java byte code).
___________________________________________________________________
(page generated 2024-11-19 23:00 UTC)