[HN Gopher] Box86/Box64 vs. QEMU vs. FEX (Vs Rosetta2)
       ___________________________________________________________________
        
       Box86/Box64 vs. QEMU vs. FEX (Vs Rosetta2)
        
       Author : pantalaimon
       Score  : 113 points
       Date   : 2022-07-19 08:51 UTC (1 days ago)
        
 (HTM) web link (box86.org)
 (TXT) w3m dump (box86.org)
        
       | CoastalCoder wrote:
       | Anyone know why qemu is so slow vs. the others?
       | 
       | The article discusses differences in floating point handling and
       | GPU passthrough, but I don't think the 7z benchmark uses either
       | of those.
        
         | yjftsjthsd-h wrote:
         | Possibly because they don't care as much. Until very recently,
         | the heaviest use of qemu was to run hardware accelerated
         | virtual machines on the same architecture. If you're using it
         | with KVM/HAXM/whatever, it is fast. I expect they would be
         | happy to take performance enhancements for emulation, but that
         | it simply hasn't been a priority.
        
         | lunixbochs wrote:
         | TCG has historically had more of a focus on accuracy than
         | performance. It lifts a lot of guest architectures to a lot of
         | host architectures, and isn't particularly specialized to any
         | given host cpu type. It lifts many instructions to C helpers
         | instead of bothering to jit them. Last I checked it had no
         | vector -> vector jit. It's also not single address mapped -
         | memory IO undergoes indirection, which is expensive. I think
         | Rosetta for example has a shared address space for the guest
         | and host code. Honestly on 64-bit CPUs, especially with pointer
         | authentication on M1, the risk of the guest accidentally
         | messing with host/jit memory is low.
        
       | simjnd wrote:
       | This is a great post, petitSeb is doing an outstanding job on
       | Box86/64. I'm also keeping an eye on FEX which is evolving very
       | rapidly. There has been 4 releases since the linked blog post
       | which was written in late March, introducing very welcome
       | features such as support for pressure-vessel or better OpenGL and
       | Vulkan thunking.
        
         | olliej wrote:
         | yeah, I liked how they explicitly distinguished the benchmark
         | apps that made significant use of x87 as supporting x87 is
         | necessarily an all software floating point implementation -
         | it's impossible to get close to native performance for x87
         | heavy code on non-x86 architectures.
        
           | lunixbochs wrote:
           | I don't know if I agree with "impossible". There's a lot of
           | performance left on the table with SoftFloat. A non x86
           | architecture can add an 80-bit FPU if they want. There are
           | architectures with 128-bit float, and CPUs with FPGA
           | coprocessors. I suspect x87 is also not the most optimized
           | path in modern x86 cpus (some instructions may even be fully
           | emulated in microcode).
           | 
           | Realistically, an x87-specific JIT could do significant
           | instruction reordering, lift/reoptimize the underlying code
           | (much of existing x87 code was compiled a very long time ago
           | on older compilers), and vectorize the underlying integer
           | float emulation, or even trace and move some computation to
           | another core or a coprocessor like a GPU or DSP (often idle
           | in embedded cpus).
           | 
           | Many games work fine with x87 lowered to 64-bit or even
           | 32-bit floats, and depending on the workload there's a middle
           | ground where you could understand (or approximate) the
           | current level of precision error for a value, generally run
           | at a lower precision, and trace operations / "catch up" on
           | precision at batched intervals.
        
             | olliej wrote:
             | Sorry, it is obviously possible to add hardware support for
             | the 80bit ieee754 format (the format itself is not great,
             | and in reality the precision isn't necessary in all but the
             | most extreme cases, and those where it is are likely to
             | prefer 128bit float), but it isn't something that is going
             | to happen in the real world, and even if it was we're
             | talking about software for generally available systems.**
             | 
             | You could also emulate it by arbitrarily dropping
             | precision, but as a translator that means breaking
             | bincompat, and more importantly breaking programs the use
             | 80bit format (a lot of fortran).
             | 
             | Obviously many games (especially old ones) perform fine as
             | they're only using 80bit because at the time x87 was the
             | only hardware fp available on x86 hardware, not because
             | they needed that perf.
             | 
             | Even lowering the precision of the x87 unit isn't
             | sufficient as that only reduces the precision of the
             | mantissa not the exponent.
             | 
             | Even outside of the core arithmetic (excluding negation
             | which is really easy in all ieee754 formats) there is a
             | whole bunch of state that you need to keep track of to
             | ensure identical behavior.
             | 
             | Obviously if you are willing to break precision guarantees,
             | etc then breaking state isn't a problem, but if you're
             | trying to be something like Rosetta - eg completely general
             | and running anything - you don't really have the freedom to
             | do that.
             | 
             | ** sorry skim reading I missed your 128bit and x87 perf
             | questions. Yes an emulator can (should?) use hw 128bit for
             | the arithmetic if it's available but on vast majority of
             | hardware it isn't.
             | 
             | You are also right about x87 perf being slow compared to
             | everything else, but it's still faster than anything you
             | can do in software (addition especially does not work
             | interact nicely) due to the GRS tracking a software impl
             | needs to do through many bitewise operations.
        
               | lunixbochs wrote:
               | My middle paragraph up-thread proposes that you can
               | emulate it much faster than we're doing now, at full
               | precision with integer SIMD and a specialized JIT. I'll
               | reiterate the 80-bit softfloat stuff I've seen in use now
               | is not really optimized. I suspect that beating the
               | performance of a cpu on x87 from the era where x87 was
               | relevant is somewhere between realistic and trivial.
               | Beating a modern cpu on x87 from another architecture
               | still feels possible (but it's a less useful thing to
               | spend time on).
               | 
               | > Even lowering the precision of the x87 unit isn't
               | sufficient as that only reduces the precision of the
               | mantissa not the exponent.
               | 
               | I don't know what you mean by "isn't sufficient". To be
               | clear, I'm speaking from experience emulating x86 games
               | on low resource arm devices, where I had success
               | emulating x87 in lower precision.
               | 
               | For QEMU, IMO the bigger performance issue is that it
               | doesn't natively JIT _any_ FPU or vector instructions,
               | and the indirect memory mapping hurts general performance
               | quite a bit too.
        
       | lunixbochs wrote:
       | I'm excited Rosetta2 can be used in Linux VMs as of macOS
       | Ventura. QEMU tends to be the most accurate emulation option for
       | me on Linux and is nowhere near the speed of Rosetta. (Rosetta is
       | quite accurate as well, it just wasn't available for Linux). I
       | only have one remaining edge case with Rosetta around the FPU
       | config register not behaving quite the same way as an x86 CPU,
       | everything else has been great.
       | 
       | FEX can be quite fast at some workloads, but was slower than QEMU
       | for others, and had some glitches for me. I ended up porting my
       | app to arm64 Linux for Linux dev on M1 rather than continue to
       | slog through the issues I had with emulation on Linux.
       | 
       | > I couldn't include FEX in the bench as it's not compatible with
       | the 16k page actualy used on Asahi/M1.
       | 
       | FEX ran fine for me in a Parallels VM on M1.
        
       | ThatPlayer wrote:
       | I've setup box86 on an ARM gaming handheld with a Qualcomm SDM845
       | recently and it's pretty amazing what it can do. The SDM845 has a
       | pretty good Linux support (with postmarketOS supporting the
       | mainline kernel [0]). The open source drivers for the Adreno GPU
       | even support Vulkan and full desktop OpenGL
       | 
       | With box86/box64, I've been able to run Steam and even
       | Wine/Proton with DXVK translating DirectX to Vulkan. I can even
       | run older 3D Windows games like Skyrim! Though it did glitch on
       | the infamous cart intro.
       | 
       | [0] https://wiki.postmarketos.org/wiki/SDM845_Mainlining
        
         | phh wrote:
         | Can you share more about your setup? I also have a sdm845
         | gaming handheld (Ayn Odin, currently running Android), and I
         | was contemplating installing Windows on it to get Steam, but I
         | much prefer your way.
         | 
         | Do you have some clean distro that boots into something usable
         | without mouse/keyboard? Some documentation for first stage
         | boots? Gits?
        
           | ThatPlayer wrote:
           | Yes, that's the same one I have. It's the similar install to
           | windows, running an edk2 bootloader on another partition. The
           | developer for that has released a Debian 11 install that has
           | working touch controls and software keyboard, though I have
           | been using an USB-C hub with actual mouse and keyboard for
           | setup as the UI isn't scaled well for a 5" screen.
           | 
           | https://github.com/ProjectValhalla/OdinMultiBootGuides
           | 
           | I don't think it's ready yet for full time use, as the
           | joystick is mapped incorrectly for most games, but something
           | to keep an eye on.
        
         | eptcyka wrote:
         | What's the device you're using? Color me interested.
        
           | ThatPlayer wrote:
           | The Ayn Odin. I'm not sure I'd recommend it with all the new
           | upcoming x86_64 gaming handhelds coming out soon with similar
           | pricing, better GPU drivers, and not having to deal with
           | box86 compatibility issues.
           | 
           | https://liliputing.com/2022/06/compare-handheld-gaming-pc-
           | sp...
        
       | rnk wrote:
       | I don't have an apple arm device, I was waiting until I could run
       | x86 vms reasonably efficiently, because I need that all the time.
       | Up to now it seemed the answer was it's too slow if you want to
       | use an x86 basically in a normal way with a vm. This article
       | suggests rosetta2 would let you have usable performance, can
       | someone provide the high level view? r2 was about 2/3 the speed
       | of native exec on that last benchmark the article.
        
         | olliej wrote:
         | You would be able to run x86 code on an arm Linux vm. There is
         | no VM option better than qemu or similar.
         | 
         | The problem as far as I can infer is that for a binary
         | translator the translator is given a bunch of context a full VM
         | can't have (what random clump of bytes is an executable, etc)
        
       ___________________________________________________________________
       (page generated 2022-07-20 23:02 UTC)