[HN Gopher] A toy RTOS inside Super Mario Bros. using emulator s...
       ___________________________________________________________________
        
       A toy RTOS inside Super Mario Bros. using emulator save states
        
       This started as a throwaway metaphor in a blog post, but is now
       fully runnable: a toy RTOS with preemptive multitasking inside of
       Super Mario Bros. on the NES.  Essentially, this is:  - A
       rudimentary preemptive RTOS  - Using an unmodified NES emulator
       (FCEUX) as the CPU                   - "Unmodified" depending on
       how you define terms       - With emulator save states as the
       thread contexts  - With support for (very basic) mutexes, interrupt
       masking, and condition variables  - Demonstrated using Super Mario
       Bros. 1-1 with sections of the map dedicated to various
       synchronization primitives  There are many simplifications and
       shortcuts taken (doesn't even have task priorities), and it doesn't
       map 1:1 to true multithreading (e.g., emulator save states
       represent the state of the entire machine including RAM, whereas
       thread contexts represent a much more minimal slice), but I think
       it's A) pretty interesting and B) a unique visceral explanation of
       threads.
        
       Author : notorious_pgb
       Score  : 242 points
       Date   : 2025-05-28 20:15 UTC (1 days ago)
        
 (HTM) web link (prettygoodblog.com)
 (TXT) w3m dump (prettygoodblog.com)
        
       | ignormies wrote:
       | This is a super cool visual demonstration of RTOS/scheduling! I
       | love the region-based critical sections!
       | 
       | I took a real-time operating systems course in university as an
       | elective. One of the hardest courses I took the whole four years,
       | but also one of the most interesting. Had a great professor, who
       | gave really demanding, but very instructive, project-based
       | assignments.
       | 
       | I need to find a toy project to play around with this domain
       | again.
        
         | notorious_pgb wrote:
         | Thank you!
         | 
         | I'm curious how effective you feel this specific example
         | might've been if it were delivered during your course. I
         | _suspect_ I 've stumbled across a really helpful teaching tool,
         | but having not gone to university, I don't actually know how
         | this stuff is being taught :v
        
           | throwanem wrote:
           | Nor I, but _I 've_ learned from the metaphor, for what that's
           | worth. And the demystification of primitives has considerable
           | value these days.
        
             | notorious_pgb wrote:
             | > Nor I, but I've learned from the metaphor, for what
             | that's worth.
             | 
             | It's the sickest possible outcome for me, so: worth a lot!
        
       | Refreeze5224 wrote:
       | This is a fantastic way of demonstrating threading and
       | scheduling, well done!
        
       | bitwize wrote:
       | "...But first we need to talk about parallel universes."
        
       | dsfdsfsfew wrote:
       | Kind of pushing the definition of RTOS here, I was expecting
       | something like using ACE to run a concurrent program, still an
       | interesting and informative post.
        
         | notorious_pgb wrote:
         | You're goddamn right it's pushing it.
         | 
         | I struggled with how to convey this; the ultimate goal is to
         | _viscerally_ demonstrate _what a thread IS_ , so I'm
         | comfortable with it being... something strange... but I do want
         | it to resemble actual concurrent systems to a useful degree.
         | 
         | Like, the thread scheduler code is userland code -- similar to,
         | say, FreeRTOS -- but the "threads" themselves aren't _exactly_
         | threads in an exact sense, since their context packages are
         | entire save states for a console, not just its processor.
         | 
         | Also, the game code (the true userland?) has no ability to
         | access the threading API whatsoever, which really harms the
         | analogy.
         | 
         | So I went with RTOS because it's a preemptive scheduler with
         | synchronous reschedule-on-block; where the scheduler is
         | implemented in the same codebase as the code using it. But to
         | be honest, nothing I know of fits.
        
       | loas wrote:
       | I really enjoyed this, thank you.
        
       | shakna wrote:
       | If you're going to make this your DooM thing, clearly the next
       | thing to do is grab a DooM engine and get to work. (PsyDoom has
       | Lua support even).
       | 
       | But congrats. This is worth being included in a uni course.
        
         | notorious_pgb wrote:
         | It'd be really interesting to push this to that level -- the
         | big challenge likely being that DooM's state (including video
         | memory) is certainly much larger and expensive to swap than an
         | NES save state.
         | 
         | One would have to get really clever...
        
           | noone_youknow wrote:
           | ... and introduce something like virtual memory?
        
       | anthk wrote:
       | Forth does that too.
       | https://www.bradrodriguez.com/papers/mtasking.html
        
       | gwbas1c wrote:
       | Last night I read an article about eliminating lag in emulators.
       | It's done with a similar concept. Basically, for each frame, the
       | emulator calculates the state for different button combinations.
       | Then, based on the button you actually push, the state shown
       | moves to the precalculated one.
        
         | jchw wrote:
         | Not 100% familiar with exactly what this is but I am familiar
         | with run-ahead, which is basically also the same idea as GGPO,
         | but for eliminating lag:
         | 
         | - Run the emulator one frame normally, using real polled input.
         | Somehow snapshot the state.
         | 
         | - Then, run the emulator n frames (usually just one) with the
         | same input. Present the video + audio from the _last_ frame.
         | 
         | - Synchronize. (Some emulators can get very fancy; instead of
         | just waiting for vsync, they'll also delay until the end of the
         | window minus estimated processing time, to poll input at the
         | last possible moment.)
         | 
         | - Roll back to the saved snapshot. (I believe you can also
         | optimize if you know the inputs really didn't change, avoiding
         | a lot of rollbacks at the cost of less predictable frame
         | times.)
         | 
         | The main reason this is even a good idea is because most games
         | will have some of their own processing latency by design, so
         | jumping a frame or two ahead usually doesn't have any
         | noticeable side-effects. This is a pretty cool idea since
         | obviously modern computers with LCD screens have a lot more
         | latency basically everywhere versus older simpler machines
         | connected to CRTs.
         | 
         | Unfortunately, this sort of approach only works when your
         | emulator's state is small enough and fast enough to restore.
         | 
         | I actually have been dying to experiment with designing an
         | emulator to have fast incremental snapshots from the ground up
         | to see if you could manage to make this feasible for more
         | modern consoles. You could, for example, track dirty memory
         | pages with userfaultfd/MEM_WRITE_WATCH, and design structures
         | like JIT caches to be able to handle rewinding without having
         | to drop the entire cache. I'm actually not sure that all
         | emulators clear their caches upon loading state, but you know,
         | more generally, I would like to know how fast and small you
         | could get save states to be if you were designing for that from
         | the ground up.
        
           | achierius wrote:
           | What do you mean by JIT cache in this case -- like an inline
           | data cache, or like a cache for the jitted binary code?
        
             | jchw wrote:
             | The latter, though I'm just using it as an example really.
             | I think for most emulator core designs today you wind up
             | dumping a lot of ephemeral state when you rewind or load
             | state (and sometimes even saving state has some overhead
             | due to synchronization.)
        
         | cortesoft wrote:
         | i think this is kind of the inverse of how some emulators that
         | add remote play deal with network lag... they will predict what
         | control the opponent will give for the current frame and
         | display it, but then when they actually receive the real
         | opponent control some number of frames later, they will go back
         | to that frame and re-calculate based on knowing the actual
         | control input for that frame, and then catch up to the current
         | frame. That way the game isn't waiting for remote inputs to
         | refresh the screen, but can then reconcile once all information
         | is known.
        
         | potato3732842 wrote:
         | Sounds a lot like speculative execution for hardware.
        
           | sparky_ wrote:
           | Was just thinking the same, this is effectively branch
           | prediction/precalculation.
        
         | cevn wrote:
         | Umm, that's brilliant. Is this technique employed in all modern
         | emulators or a recent thing?
        
         | notorious_pgb wrote:
         | Interesting; which emulator(s) was this for? I have to imagine
         | this strategy is mostly effective below a certain total
         | complexity level -- however you want to define that -- and the
         | more modern you get, the more you're lucky to even render a
         | frame at all.
        
       | liamwire wrote:
       | Brilliant visual demonstration, really helps build an intuition
       | for the concepts at play.
        
       | Sohcahtoa82 wrote:
       | That's pretty neat.
       | 
       | The metaphor I usually go for when it comes to for threads is
       | having several widgets that need to be assembled. To begin
       | working on Widget B, you have to pause the work on Widget A.
       | Adding SMP is like adding a second person to help assemble
       | widgets, but you're still using the same toolbox, so if you both
       | need the same screwdriver, there's no performance benefit to
       | having a second person. Multicore is having multiple workbenches
       | each with their own toolbox so they can operate completely in
       | parallel.
       | 
       | A mutex is like having a specialized tool that can only be used
       | by one widget assembly at a time. Like, each workbench might have
       | a full set of screwdrivers, hammers, wrenches, sockets, etc., but
       | you only have 1 welder.
       | 
       | A semaphore is a tool that can be used by a limited number of
       | widgets, like an oven that can fit up to 4 widgets.
        
         | eru wrote:
         | As a special challenge: explain (Software) Transactional
         | Memory, as eg implemented in Haskell, in your metaphor.
         | 
         | STM is pretty similar to how some databases implement
         | transactions.
        
         | ImPostingOnHN wrote:
         | _> Adding SMP is like adding a second person to help assemble
         | widgets, but you 're still using the same toolbox, so if you
         | both need the same screwdriver, there's no performance benefit
         | to having a second person._
         | 
         | This sounds like hyperthreading - or, maybe I'm missing the
         | screwdriver in the metaphor :)
        
           | ranger207 wrote:
           | Hyperthreading is Intel's implementation of SMP
        
             | Sohcahtoa82 wrote:
             | Funny thing is, when I was first typing that comment, I
             | started with "Hyperthreading" and deleted it in favor of
             | SMP to be vendor agnostic.
             | 
             | EDIT: And it should have been SMT anyways!
        
             | _ihaque wrote:
             | I think you mean SMT (simultaneous multithreading), as
             | opposed to SMP (symmetric multiprocessing)?
        
               | Sohcahtoa82 wrote:
               | Ah hell, that was MY mistake. It should be SMT in my
               | original comment and it's way too late to edit it now.
        
         | notorious_pgb wrote:
         | I think this is a pretty damn good metaphor for the
         | _introduction_ of these concepts, especially in a context where
         | the learner(s) might need to imminently apply this knowledge.
         | 
         | What drove me to write this post (as someone who didn't attend
         | college where I might've learned exactly what I'm about to say
         | -- big grain of salt) is that it took me being forced by
         | circumstance to become a firmware engineer for a time to
         | actually understand not just _how to multithread well_ , but
         | _what the hell this all even was_.
         | 
         | And I guess I saw one too many Reddit comments (mistake #1)
         | which downplayed the idea that there's any benefit to knowing
         | the answer to such things.
         | 
         | That all said, this blog post / metaphor is definitely aimed at
         | someone who's already been using threads effectively for some
         | time -- I don't think it'd fare very well as a first
         | introduction.
        
         | iwontberude wrote:
         | SMP is a second processor, even more redundant than a second
         | core. Its own cache, memory, Northbridge access, etc.
        
           | Sohcahtoa82 wrote:
           | Yeah, I messed up SMT vs SMP. Can't edit it now, though.
        
             | notorious_pgb wrote:
             | Ah well -- the article itself has its own fair share of
             | accuracy issues -- and I learned something from your
             | conversation, which wouldn't have happened without said
             | mistake.
        
       | aftbit wrote:
       | https://web.archive.org/web/20250528203416/https://prettygoo...
       | 
       | https://archive.ph/msHkO
        
       | hei-lima wrote:
       | Wow! So cool!
        
       | noone_youknow wrote:
       | This is simply amazing work, I take my hat off to you. And not
       | just the threading in an emulator (which is a genius concept all
       | by itself) but for the article itself - I have spent years trying
       | to find the right way to articulate exactly your points in hopes
       | of engaging people to learn more about the systems they're
       | building on top of.
       | 
       | From now on, I'll simply send out a link to this. Thank you!
        
         | notorious_pgb wrote:
         | My person!
         | 
         | I worried that I'd come across as a high-horse dickhead in the
         | mystification section, when that's the fundamental opposite of
         | my intention and attitude here. I'm glad to hear that it
         | resonated with another.
         | 
         | Although I guess we could both be high-horse dickheads.
        
           | upghost wrote:
           | Not only that but now, thanks to you, we can all be high-
           | horse dickheads _concurrently_.
        
           | peterkelly wrote:
           | The mystification section was my favorite part - you make
           | some excellent points there.
        
           | noone_youknow wrote:
           | IMO the mystification section is right on the money, I
           | couldn't have said it better myself (and believe me, I've
           | tried).
        
       | CGamesPlay wrote:
       | Would be neat if the subworld memory was shared between the save
       | states, to really drive home the mutex concept.
        
         | notorious_pgb wrote:
         | This is one of the core weaknesses of the analogy that I'm
         | trying to figure out how to resolve: there's no shared memory,
         | unlike actual threads.
         | 
         | I think the solution requires that the _game code_ is what 's
         | executing threading syscalls, not the Lua around the game code.
         | 
         | It'd definitely be a super sick evolution to add true shared
         | in-game memory.
        
           | kelnos wrote:
           | I guess in a way the three Mario threads are more like
           | processes? And they're using OS-level or shared-memory
           | mutexes and condvars that can be shared between processes?
        
             | notorious_pgb wrote:
             | I see what you're getting at -- though then it raises the
             | question: what would it mean for those processes to have
             | multiple threads within them?
             | 
             | The way I've been looking at it, if I want to take this
             | further, processes would maybe be _NES ROMs_, and each
             | process can have multiple threads. Swapping between
             | processes would inherently be more expensive than between
             | threads within a process, which I think would be a really
             | instructive aspect. Also, the whole idea of "sharing
             | memory" between _entirely different games_ is obviously
             | ridiculous, which could be yet another teaching tool.
             | 
             | Then if we want to expand it even _further_ , maybe
             | multicore support would look like multiple emulator
             | instances, connected via sockets to support cross-core IPC.
             | You'd probably want to institute an "all the threads of a
             | process have to be on the same core" rule (so they can
             | locally share primitives in the same Lua script), which
             | would be a great way to demonstrate the principle of
             | building systems which adapt to their realistic
             | constraints, as opposed to building systems which exactly
             | model a textbook.
             | 
             | I've gotten ahead of myself, though.
        
               | Ericson2314 wrote:
               | Easier to leave out the parallelism and just look at
               | concurrency.
               | 
               | For a game like Mario, you can split up the memory and
               | decide what you want to be shared and what do you want to
               | be per-thread. E.g. starting small you can rig up just a
               | few fixed variables like scores or whatever to be
               | persisted from one game to the next. trying to push that
               | further without causing endless corruption should be fun
               | :)
        
           | leni536 wrote:
           | You have critical section in the bonus section behind the
           | pipe. Maybe shared state protected by that critical section
           | could be your shared memory, like the state of the coins
           | there.
        
       | Cloudef wrote:
       | Isn't this more like coroutines / green threads? I guess it
       | doesn't matter as they are essentially same concept apart from
       | cooperative scheduling vs. preemptive scheduling.
        
         | notorious_pgb wrote:
         | I suppose the _actual thing it is_ is closer to what you 're
         | describing, but the _thing that it makes viscerally
         | understandable_ is threads itself -- I hope.
        
       | abyesilyurt wrote:
       | Tsoding recently built coroutines for C from scratch. The concept
       | of coroutines is related to threads, so I bet you'd like the
       | video if you liked the article.
       | 
       | https://youtu.be/sYSP_elDdZw
        
       | Ericson2314 wrote:
       | Sorry to be an pedant.
       | 
       | From your previous post
       | 
       | > The first thing to understand about threads is that processors
       | do not support them.
       | 
       | Yes, but I would say by changing the emulator (augmenting the
       | emulator software with your other software, sure) in the
       | _outside_ world, not doing something in the _inside_ world, you
       | have are implementing something closer akin to hardware
       | concurrency support or processes.
       | 
       | > It doesn't map 1:1 to true multithreading (e.g., emulator save
       | states represent the state of the entire machine including RAM,
       | whereas thread contexts represent a much more minimal slice
       | 
       | Right, but I think if we're going to make a threads vs process
       | distinction (vs concurrency in general) whether we have only a
       | small opt-in set of communication channels / synchronization
       | primitives (the mutex and barrier), or a wide open one (shared
       | memory, be careful!) is the operative distinction.
        
       | mort96 wrote:
       | This is probably a stupid question, but why real-time? Isn't it
       | enough to "just" make an OS in SMB, doesn't adding real-time
       | guarantees make it even more difficult, by restricting what
       | allocator and scheduling algorithms you can use?
       | 
       | In any case, this looks like a really cool project.
        
       ___________________________________________________________________
       (page generated 2025-05-29 23:01 UTC)