[HN Gopher] Fixing the Loading in Myst IV: Revelation
       ___________________________________________________________________
        
       Fixing the Loading in Myst IV: Revelation
        
       Author : davikr
       Score  : 248 points
       Date   : 2024-12-10 01:13 UTC (3 days ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | tomcam wrote:
       | > the author explains they used a tool called Luke Stackwalker to
       | profile the game
       | 
       | Can anyone confirm my memory that Microsoft had a tool called
       | Luke Heapwalker in the mid-1980s, and that Lucasfilms demanded
       | they change the name?
        
         | keyle wrote:
         | I can't find an exact story but this pdf [1] has a foot note
         | about the naming being changed (search for Luke).
         | 
         | [1]
         | https://libros.metabiblioteca.org/server/api/core/bitstreams...
         | [pdf]
        
           | tomcam wrote:
           | Many thanks. I love the whole book.
        
       | EDEdDNEdDYFaN wrote:
       | Very good read! love detailed explanations on the "bad" original
       | code and steps taken toward improving it. a lot of it comes down
       | to personal preference and the author did a good job at
       | respecting what might have been an intentional design decision
       | with their optimizations by making it all configurable
        
       | zetafunction wrote:
       | Great read! Though there is an unnecessary double map lookup in
       | part 2:
       | https://github.com/tomysshadow/M4Revolution/blob/094764c87aa...
        
       | rkagerer wrote:
       | This is awesome (and very impressive)!
       | 
       | Two questions:
       | 
       | 1. What tool was used to generate that "Ange Albertini-inspired
       | file format diagram"?
       | 
       | 2. Is there an emulator that would make this easy to play under
       | Android?
        
       | kubb wrote:
       | My manager takes one look at this and asks: so, in the end the
       | effort was unsuccessful? No impact? That's OK, let's get you
       | refocused on something productive :)
        
       | throwaway284534 wrote:
       | I really enjoyed the author's technical deep-dive and approach to
       | debugging performance issues. Mild spoilers for anyone who hasn't
       | played Riven, but the method for fixing Gehn's faulty linking
       | books is a perfect analogy for the author's more counterintuitive
       | performance optimizations.
       | 
       | While I don't have a write-up as detailed as this one, I spent a
       | month on a similar journey optimizing an animated ASCII art
       | rasterizer. What started as an excuse to learn more about browser
       | performance became a deep dive into image processing, WebGL, and
       | the intricacies of the Canvas API. I'm proud of the results but
       | I've annotated the source for a greater mind to squeeze another 5
       | or 10 FPS out of the browser.
       | 
       | Maybe it's time to brush up on those WebGL docs again...
       | 
       | - [1] https://asciify.sister.software/
       | 
       | - [2] https://github.com/sister-
       | software/asciify/blob/main/Asciify...
        
       | iforgotpassword wrote:
       | Great writeup. A typical "this shouldn't be too hard" story with
       | yet another surprise around every corner. Seems familiar... :)
       | 
       | One thing I wondered is whether with that optimized loader
       | library, is it even still necessary to do the DXT conversion at
       | all? Sounds like mango and pixman could be fast enough
       | already....
        
       | lbj wrote:
       | My hat is off to this, I really appreciate how he documented
       | every step he took. It's lengthy but definitely worth the read.
        
       | tomovo wrote:
       | STB image didn't get used in the end because some other library
       | was faster but I think the author missed the possibility of
       | #defining their own allocator using STBI_MALLOC (which could just
       | return a pointer to an existing memory block).
        
       | feintruled wrote:
       | I think any software engineer can identify with the feeling you
       | get at the moment you do the first run of the solution you have
       | implemented that you are 100% sure has to fix it only to find
       | nothing has changed.
        
         | finnh wrote:
         | This has happened to me so many times. Especially in the
         | distributed database I work on ... "hmm maybe I need to let the
         | experiment run for longer, this data is noisy so it probably
         | needs more time to show a trend line".
        
         | ramon156 wrote:
         | This reminds me of when I was trying to do Minecraft style
         | chunking in Bevy. I was in a situation where (instead of doing
         | the not-so-obvious fix) I threw parallelization, compiler
         | optimization, caching, release flags etc. at my project and
         | nothing made it go faster. I could not figure out why it was so
         | slow. Turns out what I was doing was so unoptimized that I
         | might've as well loaded the whole world per frame.
         | 
         | You live and you learn :)
        
         | danudey wrote:
         | Corollary: the relief/anguish when you discover that the reason
         | none of your fixes have worked, nor your debugging print
         | statements produced output, is because you were editing a
         | different copy of the file than was getting built/run because
         | you moved or renamed something and your editor didn't notice.
        
           | 01HNNWZ0MV43FF wrote:
           | Running the Windows build but editing the WSL files
        
       | shiomiru wrote:
       | > As any good programmer knows, division is slow, it's a
       | serializing instruction, and we want to avoid it as much as
       | possible. The favourite programmer tricks to avoid division are
       | to use bit shifts (for division by multiples of two) or flip it
       | into a multiplication -- for example, to multiply by 0.333
       | instead of dividing by 3. In this case though, we are dividing by
       | the alpha, so we can't know what number we will be dividing by in
       | advance.
       | 
       | > However, because the channels are 8-bit, we will only ever be
       | dividing numbers from 1 to 255 (yes, some values will be zero --
       | but we sure won't be dividing by them then!) That means there are
       | only about 65K possible combinations, so we can use another
       | classic solution: a lookup table! This is a perfect place to use
       | constexpr to bake the array directly into the compiled result.
       | 
       | Interestingly, when I benchmarked this same problem, three
       | integer divisions would easily beat the LUT on my computer. Maybe
       | because the it's easier on the cache? (Or I did something wrong.)
        
         | Palomides wrote:
         | I think, on a modern processor, a lookup table of 65k is way
         | too big to be worthwhile for something as simple as a division
         | 
         | memory access is relatively very slow; OP probably should have
         | benchmarked it
        
           | Jerrrry wrote:
           | Depends on the use and context, benchmarking can hint at the
           | "truth", that it is faster to just divide: ultimately the
           | benchmark would have to have the two implementations and run
           | them thru all possible inputs to know for sure.
           | 
           | It may clearly state that memoization is slower, until you
           | plop it into a benchmark that inlines division explicitly
           | without using a given hardware accelerate.
           | 
           | division by a constant input can be solved faster still, so
           | there can be optimizations on top of memoization that would
           | beat raw division on most peocessors/runtime environments.
        
           | wongarsu wrote:
           | Chances are only a small portion of the lookup table is used.
           | It's conceivable that this is almost entirely L1 access.
           | 
           | But I'm also very skeptical that this is actually faster.
           | Would have been nice to see some numbers
        
             | Palomides wrote:
             | I think an l2 cache hit is about the same number of cycles
             | as a single byte divide, might as well save the cache for
             | something else
             | 
             | but yeah, this is just my speculation
        
         | wat10000 wrote:
         | It also depends a lot on your CPU. Operations can be made
         | faster by dedicating more resources to them. Most CPUs go for a
         | slowish division operation that isn't too expensive in terms of
         | resources, or lower-end ones don't provide it at all and make
         | you do it in software. Apple seems to have decided that integer
         | division is pretty important, so they dedicated a lot of
         | resources to it and the M1 does a divide in something like two
         | cycles. Even thinking about loading from memory is probably
         | slower than that.
        
         | UniverseHacker wrote:
         | Instead of compiling and running code, just replace all of your
         | code with composable functions that are each small enough to
         | lookup the results for in a precomputed hardware dictionary. In
         | this way you have a Turing complete cpu with only one
         | instruction.
         | 
         | I started this comment as a tongue in cheek satire of yours,
         | but now I'm honestly wondering if it could be a viable idea for
         | a radically simplified cpu (I'm not a computer engineer). I
         | suppose the lookup tables rapidly become too large, possibly
         | before they are useful?
        
           | wongarsu wrote:
           | For an 8-bit CPU you could maybe do it. For a 64 bit cpu, the
           | lookup table for addition alone would be massive. (You can of
           | course do addition in smaller increments, like adding one
           | digit at a time and keeping track of the carry, but then you
           | just have a normal full adder).
           | 
           | The biggest issue is that this CPU would be conceptually
           | simple, but very difficult to make fast. Memory is slow, and
           | accessing a 64k lookup table uses more transistors than just
           | doing the addition
        
             | UniverseHacker wrote:
             | Again, I'm no expert here but find this stuff fascinating.
             | Could the simplicity possibly allow some radically
             | different CPU design that makes the lookups alone nearly
             | instantaneous? I could imagine some types of
             | optical/photonic physical 3D hash table structures - even
             | ones where the same physical structure could support a
             | large number of read operations in parallel if pre-arranged
             | by a compiler to not physically interfere. I imagine a cpu
             | with only one instruction could be physically miniscule,
             | and therefore pack a lot of cores in a small space.
             | 
             | Hypothetically, if it were orders of magnitude faster than
             | a normal CPU, one could still perform rapid computation on
             | larger numbers as needed while keeping the cpu physically 8
             | bit- yet be able to revert to even higher performance when
             | less precision is needed.
        
               | spencerflem wrote:
               | If you're interested in One Instruction Computers, check
               | out: https://en.m.wikipedia.org/wiki/One-
               | instruction_set_computer
        
               | UniverseHacker wrote:
               | Thanks! I half expected to be told one instruction
               | computers were impossible.
        
             | WorldMaker wrote:
             | Just build 4 of them and chain them together to get
             | 64-bits.
             | 
             | (Beyond just being a joke, it is often how things like this
             | get scaled. 1-Bit Adders become 2-Bits with a Carry Line
             | between, then 3-bits with another Carry Line, and so forth,
             | ad infinitum and beyond. The real joke is the complexity
             | involved in however you "just" chain them together.)
        
           | spookie wrote:
           | CPUs do use a ton o LUTs under the hood.
        
             | UniverseHacker wrote:
             | I wonder if this was inspired by this discussion, but this
             | just made the front page. I didn't realize the famous
             | Pentium fdiv bug was due to an error in a lookup table used
             | for division: https://news.ycombinator.com/item?id=42391079
        
           | spencerflem wrote:
           | I've been (very slowly) working on this for a 4-bit CPU. Not
           | for any practical reason, just for fun.
        
             | UniverseHacker wrote:
             | Any more details you could share? Is this a physical cpu
             | design, or a VM/assembler?
        
           | fifilura wrote:
           | There are packages for those things in npm, but they you'll
           | have to start using javascript...
           | 
           | https://www.npmjs.com/package/@samuelmarina/is-even
           | 
           | The javascript crowd. They are always one step ahead!
        
             | UniverseHacker wrote:
             | Haha, a 100mb library, it doesn't say so but is this really
             | a dictionary of numbers with even/odd as values? I love
             | that they have an entirely separate is-odd package.
             | 
             | //edit: looked at the code, it's literally 100mb of if else
             | statements- and a bunch of them are wrong! The github pull
             | requests to add additional integers are hilarious.
        
         | bluGill wrote:
         | With all the discussion of lookup tables here it is worthwhile
         | to remember the https://en.wikipedia.org/wiki/Pentium_FDIV_bug
         | from the mid 1990s.
        
           | UniverseHacker wrote:
           | I was a little kid the age my son is now when I went with my
           | dad to buy a 486 cpu and motherboard... he said he wasn't
           | getting the pentium because he did scientific computing and
           | was worried about this bug. It's funny that I only now
           | understand what the significance of that was due to this
           | thread.
        
         | berkut wrote:
         | A 256-item float32 LUT for 8-bit sRGB -> linear conversion is
         | definitely still faster than doing the division live (I re-
         | benchmarked it on Zen4 and Apple M3 last month), however
         | floating point division with the newer microarchs is not as
         | slow as it was on processors 10 years ago or so, so I can
         | imagine using a much larger LUT cache is not worth it.
        
       | Cthulhu_ wrote:
       | Love seeing how the optimization parameters were different back
       | then, that is, size constraints were more important than loading
       | speeds, even though both drives and CPUs were much slower back
       | then.
       | 
       | Ideally companies like this that make games keep all the original
       | assets and make things like image format a build switch, for when
       | the parameters change in the future. That said, back then they
       | released on a DVD (I'm reading it would've taken 12 CDs
       | otherwise), I don't believe any higher capacity storage devices
       | were in the pipeline yet at that point. That said, hard drives
       | back then were around the 100 GB mark, so a multi-dvd release
       | would've been doable.
       | 
       | Ironically nowadays, some games (like the FFVII Remakes) are on
       | two disks again, an install and a run disk, despite them having a
       | 50 or 100 GB capacity nowadays.
        
         | tobr wrote:
         | It was released on 2 DVDs as far as I remember.
        
         | Uvix wrote:
         | It was already on two discs. You could choose whether to copy
         | one or both to the hard drive.
         | 
         | The game ran on machines as old as late 1999, so the typical
         | disk size would've been more in the 10-20 GB range. Even the
         | existing 3.4 GB minimum requirement (installing just one disc
         | and streaming content off the other) was a pretty hefty ask.
        
         | hirako2000 wrote:
         | I remember on PS1 it sold with 3* CDs. It was beautiful as they
         | had to make a special double edges holding box. Stood nicely
         | aligned with all the other games but could be spotted from
         | afar.
         | 
         | We see physical media as a burden today, that's sad as they
         | used to be pieces of arts.
         | 
         | Edit: 3, not 4.
        
       | mananaysiempre wrote:
       | I've just read both parts of the article and I still feel like
       | I'm left with more questions than answers.
       | 
       | The game is bottlenecked on memcpy so hard it takes two seconds
       | to load each time? On a modern machine with double-digit GB/s RAM
       | bandwidth and single-digit GB/s SSD bandwidth, when the game was
       | released on two DVDs and thus can't have more than couple dozen
       | GB of assets total[1]. How? OK, they're doing a memcpy per image
       | row, that's not nice and can probably cost you an order of
       | magnitude or so, and the assets are JPEG-compressed so it's
       | another order of magnitude to copy around uncompressed pixels,
       | but still, _how?_
       | 
       | Furthermore, if it really is bottlenecked on memcpy, why does
       | running on a modern machine not improve things? I almost want to
       | think there's a fixed amount of per-frame work hardcoded
       | somewhere, and loading DDS is just accounted for incorrectly.
       | 
       | [1] In fact, a screenshot in part 1 shows data.m4b taking up
       | 1.4GB, and the rest of the files shown are either video, sound,
       | or small.
        
         | mkesper wrote:
         | There is a hardcoded value of 550ms but else you're right. I
         | guess it's still bottlenecked because it runs in a 32bit
         | environment.
        
         | iforgotpassword wrote:
         | It's what the profiling hinted at at least, but I don't know
         | how much overhead that tool adds per function call, so if you
         | profile a lot of very small/fast functions you basically just
         | measure which function gets called most.
         | 
         | But you should not underestimate the impact of unnecessarily
         | shoving data around in memory even with fast ram. Cpu speed has
         | improved _much much more_ than memory speed over the past
         | decades. If your data layout sucks and you hit L3 or even worse
         | actual memory, it 's slow as heck relative to L1, or even
         | better no copy at all. And then the overhead of the plain
         | function call itself. As this is a 3rd party library you're
         | guaranteed that each call to this wrapped memcpy is an actual
         | call and not inlined.
         | 
         | But in addition to that I'm pretty sure the decoding library
         | used originally isn't nearly as fast as mango.
        
           | comex wrote:
           | If the profiler used is a sampling profiler (and it seems to
           | be), then unlike with instrumentation-based profilers, it
           | doesn't add any function call overhead. It just pauses the
           | program every few ms and records what the call stack is at
           | that point. While this makes the data noisier compared to
           | instrumenting all calls, it also makes the data an unbiased
           | approximation of how the program behaves when not being
           | profiled.
           | 
           | But sampling profilers do still tend to "basically just
           | measure which function gets called most". They can tell a
           | program spent a lot of time in a particular function, but
           | they can't count how many times that function was called - so
           | they can't determine whether it's a slow function, or a fast
           | function that was called many times.
        
       | jobbr wrote:
       | This guy Mysts.
        
       | SideQuark wrote:
       | Unfortunately the author and the paper he links apply alpha
       | premultiply to the gamma compressed image. To be correct, this
       | should be done in a linear colorspace. His solution will make
       | some color edge combos get halos.
       | 
       | Basically, alpha in all formats I've seen is stored linear, but
       | colors are gamma compressed (sRGB, HDR stuff, etc.). If you apply
       | alpha premultiply, then linearize, you've misapplied alpha. If
       | you ignore linearizing (as even this author shows), you get
       | immediate black halos since your blend is effectively multiplying
       | colors, not adding them.
        
       | ada1981 wrote:
       | This was very impressive to read and while I don't have the
       | technical Knowledge to do this, it reminded me of "fixing" my
       | mental health when Stanford Psychiatrists diagnosed me and said
       | I'd be on pills the rest of my life, incurable.
       | 
       | Years later, after rebuilding my psyche from scratch, happy to
       | report they were wrong.
       | 
       | But striking similarities, where the "professionals" just didn't
       | bother to solve a solvable problem.
        
         | 01HNNWZ0MV43FF wrote:
         | You might feel that your mind is a computer program you can
         | simply rewrite, but I feel that mine is a horse that does not
         | listen to me
         | 
         | Pills work for a lot of things (ADHD for instance) and they're
         | a lot faster than years
        
       | withinrafael wrote:
       | Tangentially related, previous releases of the game were also hit
       | by a DirectInput device enumeration regression that Microsoft
       | (behind the scenes) refused to fix. (I haven't checked the latest
       | re-release.)
       | 
       | https://github.com/riverar/IndirectInput
        
       | brcmthrowaway wrote:
       | What is this weird game?
        
         | evanmoran wrote:
         | Myst is a very old game where you walk by tapping a series of
         | pre-rendered 3d images. This is back when each image would take
         | minutes to hours to render (like Pixar did with Toy Story 1),
         | so graphically it looks amazing for its time, but it actually
         | existed at the same time when Super Nintendo was popular
         | (1990s)
        
           | MrDOS wrote:
           | The first game was unimaginably popular after its release in
           | 1993. It quickly became the best-selling computer game of all
           | time, and stayed there for _nine years_ until The Sims
           | unseated it in 2002. It 's _still_ the 25th highest-selling
           | computer game ever[0]. Kids these days!
           | 
           | [0]: https://en.wikipedia.org/wiki/List_of_best-
           | selling_PC_games
        
       ___________________________________________________________________
       (page generated 2024-12-13 23:01 UTC)