[HN Gopher] Why is Apple Rosetta 2 fast?
       ___________________________________________________________________
        
       Why is Apple Rosetta 2 fast?
        
       Author : fanf2
       Score  : 65 points
       Date   : 2024-11-19 21:42 UTC (1 hours ago)
        
 (HTM) web link (dougallj.wordpress.com)
 (TXT) w3m dump (dougallj.wordpress.com)
        
       | Syonyk wrote:
       | Post got the big one: Total Store Ordering (TSO).
       | 
       | The rest are all techniques in reasonably common use, but unless
       | you have hardware support for x86's strong memory ordering, you
       | _cannot_ get very good x86-on-ARM performance, because it 's by
       | no means clear when strong memory ordering matters, and when it
       | doesn't, inspecting existing code - so you have to liberally
       | sprinkle memory barriers around, which really kill performance.
       | 
       | The huge and fast L1I/L1D cache doesn't hurt things either...
       | emulation tends cache-intensive.
        
         | jsheard wrote:
         | It's surprising that (AFAIK) Qualcomm didn't implement TSO in
         | the chips they made for the recent-ish Windows ARM machines. If
         | anything they need fast x86 emulation even more than Apple does
         | since Windows has a much longer tail of software support than
         | macOS, there's going to be important Windows apps that
         | stubbornly refuse to support native ARM basically forever.
        
           | scottlamb wrote:
           | Does Windows's translation take advantage of those where they
           | exist? E.g. if I launch an aarch64 Windows VM on my M2, does
           | it use the M2's support for TSO when running x86_64 .exes or
           | does it insert these memory barriers?
           | 
           | If not, it makes sense that Qualcomm didn't bother adding
           | them.
        
             | Syonyk wrote:
             | I would expect it to not use TSO, because the toggle for it
             | isn't, to the best of my knowledge, a general userspace
             | toggle. It's something the kernel has to toggle, and so a
             | VM may or may not (probably does not) even have access to
             | the SCRs (system control registers) to change it.
        
               | zeusk wrote:
               | TSO toggle on Apple Silicon is a user-space
               | accessible/writable register.
               | 
               | It is used when you install rosetta2 for Linux VMs
               | 
               | https://developer.apple.com/documentation/virtualization/
               | run...
        
               | Syonyk wrote:
               | Are you sure it's userspace accessible?
               | 
               | Based on https://github.com/saagarjha/TSOEnabler/blob/mas
               | ter/TSOEnabl..., it's a field in ACTLR_EL1, which is
               | explicitly (per the ARMv8 spec, at least...) _not_
               | accessible to userspace (EL0) execution.
               | 
               | There may be some kernel interface to allow userspace to
               | toggle that, but that's not the same as being a
               | userspace-accessible SCR (and I also wouldn't expect it
               | to be passed through to a VM - you'd likely need a
               | hypercall to toggle it, unless the hypervisor emulated
               | that, though admittedly I'm not quite as deep weeds on
               | ARMv8 virtualization as I would prefer at the moment).
        
             | zeusk wrote:
             | The OS can use what hardware supports, Mac OS does because
             | SEG is a tightly integrated group at Apple whereas
             | Microsoft treats hardware vendors at arm's length (pun
             | unintended). There are roadmap sharing, planning events
             | through leadership but it is not as cohesive as it is at
             | Apple.
        
           | deaddodo wrote:
           | Microsoft's AoT+JiT techniques still pull off impressive
           | performance (90+% in almost every case, 96-99% in the
           | majority).
           | 
           | But yes, if they were actually serious about Windows on ARM,
           | they would have implemented TSO in their "custom" Qualcomm
           | SQ1/SQ2 chips.
        
           | Syonyk wrote:
           | My guess is that the sort of "legacy x86-forever" apps for
           | Windows don't really need much in the way of performance.
           | Think your classic Visual Basic 6 sort of thing that a
           | business relies on for decades.
           | 
           | I'm also fairly certain that the TSO changes to the memory
           | system are non-trivial, and it's possible that Qualcomm
           | doesn't see it as a value-add in their chips - and they're
           | _probably right._ Windows machines are such a hot mess that
           | outside a relatively small group of users (who _probably_ run
           | Linux anyway, so aren 't anyone's target market), nobody
           | would know or care what TSO is. If it add costs and power and
           | doesn't matter, why bother?
        
             | jsheard wrote:
             | > My guess is that the sort of "legacy x86-forever" apps
             | for Windows don't really need much in the way of
             | performance.
             | 
             | Games are a pretty notable exception that demand high
             | performance, and even if we reach a point where gamedevs
             | start shipping ARM binaries of new games, it's extremely
             | unlikely that anything released before that point will be
             | retroactively updated to be ARM native.
        
           | p_l wrote:
           | Qualcomm has been phoning it in in various forms for over a
           | decade, including forcing MS to ship machines that do not
           | really pass windows requirements (like broken firmware
           | support). Maybe it got fixed with recent Snapdragon X, but I
           | won't hold my breath.
           | 
           | We're talking about a company that, if certain personal
           | sources are to be believed, started the Snapdragon brand by
           | deciding to cheapen out on memory bandwidth despite feedback
           | that increasing it was critical and leaving the client to
           | find out too late in the integration stage.
           | 
           | Deciding that they make better money by not spending on
           | implementing TSO, or not spending transistors on bigger
           | caches, and getting more volume at lower cost, is perfectly
           | normal.
        
       | brycewray wrote:
       | (2022)
        
       | leshokunin wrote:
       | Super interesting. Putting my PM hat on, I wonder: how many x86
       | apps on Apple still benefit from this much performance? What's
       | the coverage? The switch to M1 happened 4 years ago, so the
       | software was designed for hardware nearly half a decade old.
       | 
       | Excellent engineering and nice that it was built properly. Is
       | this something that Linux / Wine / the Steam compatibility layer
       | already benefit from?
        
         | spockz wrote:
         | I think it is less of numbers game and more of a guarantee
         | thing. As a user of a new Apple silicon machine you do not have
         | to worry about running x86 software. (Aside from maybe specific
         | audio software and such that are a pain to run on any other
         | hardware and software combination.)
         | 
         | As such it may very well be a loss leader and that is fine.
         | Probably most development has been done and there is little
         | maintenance needed.
         | 
         | Also, while most native macOS apps that I encounter have an
         | Apple silicon version now, I still find docker images for amd64
         | without an arm64 version present. Rosetta2 also helps with
         | these applications.
        
         | aaomidi wrote:
         | Games. So many games.
         | 
         | Also, x86 containers.
        
           | jsheard wrote:
           | Then again games didn't stop Apple from dropping x86-32
           | support, which nuked half of the Mac Steam library. It
           | wouldn't be out of character for them to drop x86-64 support
           | and nuke the rest which haven't been updated to native ARM.
        
             | p_l wrote:
             | For games on intel macs they had fallback of BootCamp so
             | combined with not really caring about games outside of
             | random bursts like support for Unity, they were fine
             | telling people to run windows. (ironically, the only Mac I
             | owned ran faster under windows than under macOS...)
        
         | Syonyk wrote:
         | "Apple M-series chips emulating x86," in certain benchmarks and
         | behaviors, was right up there with the fastest x86 chips at the
         | time - I'd guess largely in stuff that benefited from the huge
         | L1I/L1D cache (compared to x86).
         | 
         | I had a M1 Mini for a while, and it played Kerbal Space Program
         | (x86) _far_ better than my previous Intel Mini, which had Intel
         | Integrated Graphics that could barely manage a 4k monitor, much
         | less actual gaming.
         | 
         | I believe there's a way to use Rosetta with Linux VMs, too (to
         | translate x86 VM applications to ARM and run them natively) -
         | but I no longer have any Macs, so I've not had a chance to play
         | with it.
        
       | dhosek wrote:
       | I wonder if these lessons might be applied to Wasm runtimes where
       | the Wasm could be JIT compiled into native code. Of course this
       | does raise the possibility of security concerns if the Wasm
       | compilation has some bug, and then of course there's also the
       | question of whether Wasm's requirements might mean native
       | compilation doesn't give much of a performance boost (as seems to
       | be the case with e.g., Java byte code).
        
       ___________________________________________________________________
       (page generated 2024-11-19 23:00 UTC)