[HN Gopher] Fixing the Loading in Myst IV: Revelation
___________________________________________________________________
Fixing the Loading in Myst IV: Revelation
Author : davikr
Score : 248 points
Date : 2024-12-10 01:13 UTC (3 days ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| tomcam wrote:
| > the author explains they used a tool called Luke Stackwalker to
| profile the game
|
| Can anyone confirm my memory that Microsoft had a tool called
| Luke Heapwalker in the mid-1980s, and that Lucasfilms demanded
| they change the name?
| keyle wrote:
| I can't find an exact story but this pdf [1] has a foot note
| about the naming being changed (search for Luke).
|
| [1]
| https://libros.metabiblioteca.org/server/api/core/bitstreams...
| [pdf]
| tomcam wrote:
| Many thanks. I love the whole book.
| EDEdDNEdDYFaN wrote:
| Very good read! love detailed explanations on the "bad" original
| code and steps taken toward improving it. a lot of it comes down
| to personal preference and the author did a good job at
| respecting what might have been an intentional design decision
| with their optimizations by making it all configurable
| zetafunction wrote:
| Great read! Though there is an unnecessary double map lookup in
| part 2:
| https://github.com/tomysshadow/M4Revolution/blob/094764c87aa...
| rkagerer wrote:
| This is awesome (and very impressive)!
|
| Two questions:
|
| 1. What tool was used to generate that "Ange Albertini-inspired
| file format diagram"?
|
| 2. Is there an emulator that would make this easy to play under
| Android?
| kubb wrote:
| My manager takes one look at this and asks: so, in the end the
| effort was unsuccessful? No impact? That's OK, let's get you
| refocused on something productive :)
| throwaway284534 wrote:
| I really enjoyed the author's technical deep-dive and approach to
| debugging performance issues. Mild spoilers for anyone who hasn't
| played Riven, but the method for fixing Gehn's faulty linking
| books is a perfect analogy for the author's more counterintuitive
| performance optimizations.
|
| While I don't have a write-up as detailed as this one, I spent a
| month on a similar journey optimizing an animated ASCII art
| rasterizer. What started as an excuse to learn more about browser
| performance became a deep dive into image processing, WebGL, and
| the intricacies of the Canvas API. I'm proud of the results but
| I've annotated the source for a greater mind to squeeze another 5
| or 10 FPS out of the browser.
|
| Maybe it's time to brush up on those WebGL docs again...
|
| - [1] https://asciify.sister.software/
|
| - [2] https://github.com/sister-
| software/asciify/blob/main/Asciify...
| iforgotpassword wrote:
| Great writeup. A typical "this shouldn't be too hard" story with
| yet another surprise around every corner. Seems familiar... :)
|
| One thing I wondered is whether with that optimized loader
| library, is it even still necessary to do the DXT conversion at
| all? Sounds like mango and pixman could be fast enough
| already....
| lbj wrote:
| My hat is off to this, I really appreciate how he documented
| every step he took. It's lengthy but definitely worth the read.
| tomovo wrote:
| STB image didn't get used in the end because some other library
| was faster but I think the author missed the possibility of
| #defining their own allocator using STBI_MALLOC (which could just
| return a pointer to an existing memory block).
| feintruled wrote:
| I think any software engineer can identify with the feeling you
| get at the moment you do the first run of the solution you have
| implemented that you are 100% sure has to fix it only to find
| nothing has changed.
| finnh wrote:
| This has happened to me so many times. Especially in the
| distributed database I work on ... "hmm maybe I need to let the
| experiment run for longer, this data is noisy so it probably
| needs more time to show a trend line".
| ramon156 wrote:
| This reminds me of when I was trying to do Minecraft style
| chunking in Bevy. I was in a situation where (instead of doing
| the not-so-obvious fix) I threw parallelization, compiler
| optimization, caching, release flags etc. at my project and
| nothing made it go faster. I could not figure out why it was so
| slow. Turns out what I was doing was so unoptimized that I
| might've as well loaded the whole world per frame.
|
| You live and you learn :)
| danudey wrote:
| Corollary: the relief/anguish when you discover that the reason
| none of your fixes have worked, nor your debugging print
| statements produced output, is because you were editing a
| different copy of the file than was getting built/run because
| you moved or renamed something and your editor didn't notice.
| 01HNNWZ0MV43FF wrote:
| Running the Windows build but editing the WSL files
| shiomiru wrote:
| > As any good programmer knows, division is slow, it's a
| serializing instruction, and we want to avoid it as much as
| possible. The favourite programmer tricks to avoid division are
| to use bit shifts (for division by multiples of two) or flip it
| into a multiplication -- for example, to multiply by 0.333
| instead of dividing by 3. In this case though, we are dividing by
| the alpha, so we can't know what number we will be dividing by in
| advance.
|
| > However, because the channels are 8-bit, we will only ever be
| dividing numbers from 1 to 255 (yes, some values will be zero --
| but we sure won't be dividing by them then!) That means there are
| only about 65K possible combinations, so we can use another
| classic solution: a lookup table! This is a perfect place to use
| constexpr to bake the array directly into the compiled result.
|
| Interestingly, when I benchmarked this same problem, three
| integer divisions would easily beat the LUT on my computer. Maybe
| because the it's easier on the cache? (Or I did something wrong.)
| Palomides wrote:
| I think, on a modern processor, a lookup table of 65k is way
| too big to be worthwhile for something as simple as a division
|
| memory access is relatively very slow; OP probably should have
| benchmarked it
| Jerrrry wrote:
| Depends on the use and context, benchmarking can hint at the
| "truth", that it is faster to just divide: ultimately the
| benchmark would have to have the two implementations and run
| them thru all possible inputs to know for sure.
|
| It may clearly state that memoization is slower, until you
| plop it into a benchmark that inlines division explicitly
| without using a given hardware accelerate.
|
| division by a constant input can be solved faster still, so
| there can be optimizations on top of memoization that would
| beat raw division on most peocessors/runtime environments.
| wongarsu wrote:
| Chances are only a small portion of the lookup table is used.
| It's conceivable that this is almost entirely L1 access.
|
| But I'm also very skeptical that this is actually faster.
| Would have been nice to see some numbers
| Palomides wrote:
| I think an l2 cache hit is about the same number of cycles
| as a single byte divide, might as well save the cache for
| something else
|
| but yeah, this is just my speculation
| wat10000 wrote:
| It also depends a lot on your CPU. Operations can be made
| faster by dedicating more resources to them. Most CPUs go for a
| slowish division operation that isn't too expensive in terms of
| resources, or lower-end ones don't provide it at all and make
| you do it in software. Apple seems to have decided that integer
| division is pretty important, so they dedicated a lot of
| resources to it and the M1 does a divide in something like two
| cycles. Even thinking about loading from memory is probably
| slower than that.
| UniverseHacker wrote:
| Instead of compiling and running code, just replace all of your
| code with composable functions that are each small enough to
| lookup the results for in a precomputed hardware dictionary. In
| this way you have a Turing complete cpu with only one
| instruction.
|
| I started this comment as a tongue in cheek satire of yours,
| but now I'm honestly wondering if it could be a viable idea for
| a radically simplified cpu (I'm not a computer engineer). I
| suppose the lookup tables rapidly become too large, possibly
| before they are useful?
| wongarsu wrote:
| For an 8-bit CPU you could maybe do it. For a 64 bit cpu, the
| lookup table for addition alone would be massive. (You can of
| course do addition in smaller increments, like adding one
| digit at a time and keeping track of the carry, but then you
| just have a normal full adder).
|
| The biggest issue is that this CPU would be conceptually
| simple, but very difficult to make fast. Memory is slow, and
| accessing a 64k lookup table uses more transistors than just
| doing the addition
| UniverseHacker wrote:
| Again, I'm no expert here but find this stuff fascinating.
| Could the simplicity possibly allow some radically
| different CPU design that makes the lookups alone nearly
| instantaneous? I could imagine some types of
| optical/photonic physical 3D hash table structures - even
| ones where the same physical structure could support a
| large number of read operations in parallel if pre-arranged
| by a compiler to not physically interfere. I imagine a cpu
| with only one instruction could be physically miniscule,
| and therefore pack a lot of cores in a small space.
|
| Hypothetically, if it were orders of magnitude faster than
| a normal CPU, one could still perform rapid computation on
| larger numbers as needed while keeping the cpu physically 8
| bit- yet be able to revert to even higher performance when
| less precision is needed.
| spencerflem wrote:
| If you're interested in One Instruction Computers, check
| out: https://en.m.wikipedia.org/wiki/One-
| instruction_set_computer
| UniverseHacker wrote:
| Thanks! I half expected to be told one instruction
| computers were impossible.
| WorldMaker wrote:
| Just build 4 of them and chain them together to get
| 64-bits.
|
| (Beyond just being a joke, it is often how things like this
| get scaled. 1-Bit Adders become 2-Bits with a Carry Line
| between, then 3-bits with another Carry Line, and so forth,
| ad infinitum and beyond. The real joke is the complexity
| involved in however you "just" chain them together.)
| spookie wrote:
| CPUs do use a ton o LUTs under the hood.
| UniverseHacker wrote:
| I wonder if this was inspired by this discussion, but this
| just made the front page. I didn't realize the famous
| Pentium fdiv bug was due to an error in a lookup table used
| for division: https://news.ycombinator.com/item?id=42391079
| spencerflem wrote:
| I've been (very slowly) working on this for a 4-bit CPU. Not
| for any practical reason, just for fun.
| UniverseHacker wrote:
| Any more details you could share? Is this a physical cpu
| design, or a VM/assembler?
| fifilura wrote:
| There are packages for those things in npm, but they you'll
| have to start using javascript...
|
| https://www.npmjs.com/package/@samuelmarina/is-even
|
| The javascript crowd. They are always one step ahead!
| UniverseHacker wrote:
| Haha, a 100mb library, it doesn't say so but is this really
| a dictionary of numbers with even/odd as values? I love
| that they have an entirely separate is-odd package.
|
| //edit: looked at the code, it's literally 100mb of if else
| statements- and a bunch of them are wrong! The github pull
| requests to add additional integers are hilarious.
| bluGill wrote:
| With all the discussion of lookup tables here it is worthwhile
| to remember the https://en.wikipedia.org/wiki/Pentium_FDIV_bug
| from the mid 1990s.
| UniverseHacker wrote:
| I was a little kid the age my son is now when I went with my
| dad to buy a 486 cpu and motherboard... he said he wasn't
| getting the pentium because he did scientific computing and
| was worried about this bug. It's funny that I only now
| understand what the significance of that was due to this
| thread.
| berkut wrote:
| A 256-item float32 LUT for 8-bit sRGB -> linear conversion is
| definitely still faster than doing the division live (I re-
| benchmarked it on Zen4 and Apple M3 last month), however
| floating point division with the newer microarchs is not as
| slow as it was on processors 10 years ago or so, so I can
| imagine using a much larger LUT cache is not worth it.
| Cthulhu_ wrote:
| Love seeing how the optimization parameters were different back
| then, that is, size constraints were more important than loading
| speeds, even though both drives and CPUs were much slower back
| then.
|
| Ideally companies like this that make games keep all the original
| assets and make things like image format a build switch, for when
| the parameters change in the future. That said, back then they
| released on a DVD (I'm reading it would've taken 12 CDs
| otherwise), I don't believe any higher capacity storage devices
| were in the pipeline yet at that point. That said, hard drives
| back then were around the 100 GB mark, so a multi-dvd release
| would've been doable.
|
| Ironically nowadays, some games (like the FFVII Remakes) are on
| two disks again, an install and a run disk, despite them having a
| 50 or 100 GB capacity nowadays.
| tobr wrote:
| It was released on 2 DVDs as far as I remember.
| Uvix wrote:
| It was already on two discs. You could choose whether to copy
| one or both to the hard drive.
|
| The game ran on machines as old as late 1999, so the typical
| disk size would've been more in the 10-20 GB range. Even the
| existing 3.4 GB minimum requirement (installing just one disc
| and streaming content off the other) was a pretty hefty ask.
| hirako2000 wrote:
| I remember on PS1 it sold with 3* CDs. It was beautiful as they
| had to make a special double edges holding box. Stood nicely
| aligned with all the other games but could be spotted from
| afar.
|
| We see physical media as a burden today, that's sad as they
| used to be pieces of arts.
|
| Edit: 3, not 4.
| mananaysiempre wrote:
| I've just read both parts of the article and I still feel like
| I'm left with more questions than answers.
|
| The game is bottlenecked on memcpy so hard it takes two seconds
| to load each time? On a modern machine with double-digit GB/s RAM
| bandwidth and single-digit GB/s SSD bandwidth, when the game was
| released on two DVDs and thus can't have more than couple dozen
| GB of assets total[1]. How? OK, they're doing a memcpy per image
| row, that's not nice and can probably cost you an order of
| magnitude or so, and the assets are JPEG-compressed so it's
| another order of magnitude to copy around uncompressed pixels,
| but still, _how?_
|
| Furthermore, if it really is bottlenecked on memcpy, why does
| running on a modern machine not improve things? I almost want to
| think there's a fixed amount of per-frame work hardcoded
| somewhere, and loading DDS is just accounted for incorrectly.
|
| [1] In fact, a screenshot in part 1 shows data.m4b taking up
| 1.4GB, and the rest of the files shown are either video, sound,
| or small.
| mkesper wrote:
| There is a hardcoded value of 550ms but else you're right. I
| guess it's still bottlenecked because it runs in a 32bit
| environment.
| iforgotpassword wrote:
| It's what the profiling hinted at at least, but I don't know
| how much overhead that tool adds per function call, so if you
| profile a lot of very small/fast functions you basically just
| measure which function gets called most.
|
| But you should not underestimate the impact of unnecessarily
| shoving data around in memory even with fast ram. Cpu speed has
| improved _much much more_ than memory speed over the past
| decades. If your data layout sucks and you hit L3 or even worse
| actual memory, it 's slow as heck relative to L1, or even
| better no copy at all. And then the overhead of the plain
| function call itself. As this is a 3rd party library you're
| guaranteed that each call to this wrapped memcpy is an actual
| call and not inlined.
|
| But in addition to that I'm pretty sure the decoding library
| used originally isn't nearly as fast as mango.
| comex wrote:
| If the profiler used is a sampling profiler (and it seems to
| be), then unlike with instrumentation-based profilers, it
| doesn't add any function call overhead. It just pauses the
| program every few ms and records what the call stack is at
| that point. While this makes the data noisier compared to
| instrumenting all calls, it also makes the data an unbiased
| approximation of how the program behaves when not being
| profiled.
|
| But sampling profilers do still tend to "basically just
| measure which function gets called most". They can tell a
| program spent a lot of time in a particular function, but
| they can't count how many times that function was called - so
| they can't determine whether it's a slow function, or a fast
| function that was called many times.
| jobbr wrote:
| This guy Mysts.
| SideQuark wrote:
| Unfortunately the author and the paper he links apply alpha
| premultiply to the gamma compressed image. To be correct, this
| should be done in a linear colorspace. His solution will make
| some color edge combos get halos.
|
| Basically, alpha in all formats I've seen is stored linear, but
| colors are gamma compressed (sRGB, HDR stuff, etc.). If you apply
| alpha premultiply, then linearize, you've misapplied alpha. If
| you ignore linearizing (as even this author shows), you get
| immediate black halos since your blend is effectively multiplying
| colors, not adding them.
| ada1981 wrote:
| This was very impressive to read and while I don't have the
| technical Knowledge to do this, it reminded me of "fixing" my
| mental health when Stanford Psychiatrists diagnosed me and said
| I'd be on pills the rest of my life, incurable.
|
| Years later, after rebuilding my psyche from scratch, happy to
| report they were wrong.
|
| But striking similarities, where the "professionals" just didn't
| bother to solve a solvable problem.
| 01HNNWZ0MV43FF wrote:
| You might feel that your mind is a computer program you can
| simply rewrite, but I feel that mine is a horse that does not
| listen to me
|
| Pills work for a lot of things (ADHD for instance) and they're
| a lot faster than years
| withinrafael wrote:
| Tangentially related, previous releases of the game were also hit
| by a DirectInput device enumeration regression that Microsoft
| (behind the scenes) refused to fix. (I haven't checked the latest
| re-release.)
|
| https://github.com/riverar/IndirectInput
| brcmthrowaway wrote:
| What is this weird game?
| evanmoran wrote:
| Myst is a very old game where you walk by tapping a series of
| pre-rendered 3d images. This is back when each image would take
| minutes to hours to render (like Pixar did with Toy Story 1),
| so graphically it looks amazing for its time, but it actually
| existed at the same time when Super Nintendo was popular
| (1990s)
| MrDOS wrote:
| The first game was unimaginably popular after its release in
| 1993. It quickly became the best-selling computer game of all
| time, and stayed there for _nine years_ until The Sims
| unseated it in 2002. It 's _still_ the 25th highest-selling
| computer game ever[0]. Kids these days!
|
| [0]: https://en.wikipedia.org/wiki/List_of_best-
| selling_PC_games
___________________________________________________________________
(page generated 2024-12-13 23:01 UTC)