Post AdSpk2mtu83GLPSwIi by agraf@fosstodon.org
(DIR) More posts by agraf@fosstodon.org
(DIR) Post #AdRpAxbYzJJCybTQwa by zhuowei@notnow.dev
2024-01-02T21:25:14.433843Z
0 likes, 0 repeats
Current silly project: “User Mode Linux, but worse”. I modified QEMU to map userspace code from the virtual machine directly into QEMU’s (host) address space and jumping to it.I want to speed up a QEMU virtual machine when hardware acceleration is not available, but the host and guest architecture match; in this case, arm64->arm64.I’m hoping that this would allow QEMU to only emulate the kernel code but run userspace code natively.It kinda works! I had to avoid libc and make my own syscalls so I can replace the svc instruction with brk, the address is hardcoded, security goes out the window, and I didn’t even map a stack (it’s using the host stack!), but it prints. [ 0.779478] Run /init as init processfault! 400010214translated! 2c32f0214 a9be7bfd400010000 2c32f0000sigtrap! 400010200handled signalfault! 400000158Hello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 400010200handled signalHello from zhuowei's init!sigtrap! 40001020chandled signal[ 0.797666] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00002a00[ 0.797954] CPU: 0 PID: 1 Comm: init Not tainted 6.5.0-9-generic-64k #9-Ubuntu[ 0.798166] Hardware name: linux,dummy-virt (DT)[ 0.798366] Call trace:[ 0.798540] dump_backtrace+0xa0/0x150[ 0.798730] show_stack+0x24/0x50[ 0.798822] dump_stack_lvl+0x78/0xf8[ 0.798908] dump_stack+0x1c/0x38[ 0.798980] panic+0x360/0x400[ 0.799050] do_exit+0x56c/0x5d8[ 0.799124] __arm64_sys_exit+0x24/0x30[ 0.799204] invoke_syscall+0x7c/0x128[ 0.799284] el0_svc_common.constprop.0+0x5c/0x168[ 0.799376] do_el0_svc+0x38/0x68[ 0.799448] el0_svc+0x30/0xe0[ 0.799520] el0t_64_sync_handler+0x148/0x158[ 0.799610] el0t_64_sync+0x1b0/0x1b8[ 0.799968] Kernel Offset: 0x388cf2000000 from 0xffff800080000000[ 0.800088] PHYS_OFFSET: 0xffffe1ea80000000[ 0.800176] CPU features: 0x40000100,9e010000,0000421b[ 0.800382] Memory Limit: none[ 0.800686] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00002a00 ]---
(DIR) Post #AdRvappf6lse7ln0hE by jevinskie@mastodon.social
2024-01-02T21:40:16Z
1 likes, 0 repeats
@zhuowei Nooow I get where you’re going. Spiffy!
(DIR) Post #AdS1cWmvU5J3MdhTay by Nukular@chaos.social
2024-01-02T23:37:16Z
0 likes, 0 repeats
@zhuowei noice! I wonder if you could skip the s/SVC/BRK dance by using seccomp with SECCOMP_RET_USER_NOTIF and outright blocking all syscalls. I think that's the trick gvisor uses, but I might be completely off.
(DIR) Post #AdS1cXjlxINUJ9ETJo by zhuowei@notnow.dev
2024-01-02T23:44:43.356446Z
0 likes, 0 repeats
@Nukular The host is macOS.macOS has similar syscall filtering, although it’s not as comprehensive (some syscalls, such as mach_absolute_time and mach_continuous_time, don’t check sandbox).If this actually works, I’d need to rebuild any OSes running on this to:use a 16k or 64k page kernel (RHEL ships with 64K support, but many other distros only expect to be run on 4K page size)avoid register x18 (macOS reserves it for Rosetta and zeroes it out on non-Rosetta apps to make sure no-one can use it)so I’ll worry about syscalls when I get that far.
(DIR) Post #AdS1n1NEceZJ7TgeQq by zhuowei@notnow.dev
2024-01-02T23:46:39.497459Z
0 likes, 0 repeats
@Nukular (I have seen https://gvisor.dev/blog/2023/04/28/systrap-release/ a while back, but thanks for letting me know)
(DIR) Post #AdS4IabDu7eFFtPVnU by Nukular@chaos.social
2024-01-02T23:55:27Z
0 likes, 0 repeats
@zhuowei ouch, I didn't know about that x18 issue on MacOS. Does it only zero it on context switch or is this a hardware thing somehow?
(DIR) Post #AdS4IbS2jjtnti7h7w by zhuowei@notnow.dev
2024-01-03T00:14:46.997894Z
0 likes, 0 repeats
@Nukular x18 is reserved on iOS and macOS, and (for macOS apps compiled >macOS 13.0, and all iOS apps) it's zeroed upon context switch/kernel return.https://developer.apple.com/documentation/xcode/writing-arm64-code-for-apple-platformshttps://twitter.com/never_released/status/1588403958296375296Initially Apple didn't zero out the register, but they ended up needing the x18 register for Meltdown mitigation on older iPhones.Fun fact: when Apple first implemented the Meltdown mitigation, they zeroed the register too late, which meant the mitigation that was supposed to prevent attackers from getting kernel pointers ended up giving attackers a kernel pointer: https://bazad.github.io/2018/04/kernel-pointer-crash-log-ios/
(DIR) Post #AdS5K7Ek919QiqBg80 by zhuowei@notnow.dev
2024-01-03T00:26:16.043991Z
0 likes, 0 repeats
@Nukular As for other uses of x18:- It's marked as reserved in the official Arm ABI- Linux lets apps use it freely- Recent Android reserves it for Clang ShadowStack, although if your app doesn't use ShadowStack, you can use it freely- Windows uses it for the TEB (thread information block - thread local storage and stuff)- This means Wine needs to carefully save and restore x18 when running a aarch64 Windows app (which uses x18 to point to TEB) on Linux (which lets any function set x18): https://bugs.winehq.org/show_bug.cgi?id=38780
(DIR) Post #AdSHKWv3r9J3pDwfEO by rpetrich@hachyderm.io
2024-01-03T01:44:56Z
0 likes, 0 repeats
@zhuowei you might consider looking at https://github.com/linux-noah/noah. It takes a somewhat different approach to a similar goal
(DIR) Post #AdSHKXoiWDpGbpz6yu by zhuowei@notnow.dev
2024-01-03T02:40:46.099703Z
0 likes, 0 repeats
@rpetrich Doesn't gVisor also use KVM to trap syscalls? And Ryujinx uses Hypervisor.framework to run Switch games's arm64 CPU code on Apple Silicon...Anyways, Noah's innovation is using hardware virtualization to trap syscalls (https://events.static.linuxfound.org/sites/events/files/slides/Noah%20Hypervisor-Based%20Darwin%20Subsystem%20for%20Linux.pdf); I'm trying to avoid hardware virtualization
(DIR) Post #AdSpk2mtu83GLPSwIi by agraf@fosstodon.org
2024-01-03T07:54:01Z
0 likes, 0 repeats
@zhuowei Are you trying to avoid Hvf because you want to run this on iOS?How are you planning to handle multiple different virtual address spaces? I guess you could run nommu Linux 🤔. With that, guest virt == host virt + offset. Then jump between "1:1 executed" (could even be via another region plus translator that only in place replaces svc insns) and "fully translated" (tcg) through a global broker. Nifty 😁👍
(DIR) Post #AdSpk44f7ZqyKmmdcm by zhuowei@notnow.dev
2024-01-03T09:06:21.508086Z
0 likes, 0 repeats
@agraf Jailbroken iOS, eventually, which will let me use more processes. (I believe it's possible to spawn more processes on iOS with NSExtensions and such, but they'll be severely throttled as background processes)
(DIR) Post #AdUQu71Uu2DHq1uw6K by zhuowei@notnow.dev
2024-01-04T03:37:24.381342Z
0 likes, 0 repeats
@jevinskie Yeah, that's my other inspiration for this. (I actually downloaded the QEMU 0.11 source to see where kqemu gets called in the emulation loop)