https://fabiensanglard.net/fastdoom/index.html
FABIEN SANGLARD'S WEBSITE
---------------------------------------------------------------------
ABOUT CONTACT RSS GIVE
Mar 04, 2025
Why fastDOOM is fast
---------------------------------------------------------------------
During the winter of 2024, I restored an IBM PS/1 486-DX2 66Mhz,
"Mini-Tower", model 2168. It was the computer I always wanted as a
teenager but could never afford. Words cannot do justice to the joy I
felt while working on this machine.
[2168]
As soon as I got something able to boot, I benchmarked the one
software I wanted to run.
C:\DOOM>doom.exe -timedemo demo1
timed 1710 gametics in 2783 realtics
Doom doesn't give the fps right away. You have to do a bit of math to
get the framerate. In this instance, that's 1710/2783*35 = 21.5 fps.
An honorable performance for the best machine money could (reasonably
) buy in Dec 1993 (specs, chipset, video, disk1, disk2, speedsys).
I was resigned to playing under Ibuprofen until I heard of fastDOOM.
I am usually not a fan of ports because they tend to add features
without cohesion (except for the dreamy Chocolate DOOM) but I gave it
a try out of curiosity.
C:\DOOM>fdoom.exe -timedemo demo1
Timed 1710 gametics in 1988 realtics. FPS: 30.1
30% faster without cutting any features^[1]! On a demanding map like
doom2's demo1, the gain is even higher, from 16.8 fps to 24.9 fps.
That is 48% faster!
I did not suspect that DOOM had left that much on the table.
Obviously shipping within one year left little time to optimize. I
had to understand how this magic trick happened.
A byte of history
---------------------------------------------------------------------
Before digging into fastDOOM, let's understand where the code comes
from. DOOM was originally developed on NeXT Workstation. The game was
structured to be easy to port with most of the code in a core
surrounded by small sub-systems performing I/O.
[doom_arch]
Source: Game Engine Black Book: DOOM
During development, DOS I/Os were written by id Software. This became
the commercial release of DOOM. But that version could not be open
sourced in 1997 because it relied on a proprietary sound library
called DMX.
What ended up being open sourced was the linux version, cleaned up by
Bernd Kreimeier when he was working on a book project to explain the
engine.
A DOS version of DOOM was reconstructed by using linux's core,
Heretic I/O, and APODMX (Apogee Sound wrapper) to emulate DMX.
Because Heretic used video mode 13h while DOOM used video mode Y, the
graphic I/O (i_ibm.c) was reverse-engineered from DOOM.EXE
disassembly. That is how the community got PCDOOM v2^[2].
fastDOOM starting point was PCDOOM v2.
+---------------+
| NeXTStep DOOM |
+-----+----+----+
| |
| |
| |
+------------+ | | +------+ +---------+
| Linux DOOM |<-+ +->| DOOM +----->| Heretic |
+------+-----+ +------+ +----+----+
| [?] |
| V |
| +----------+ |
+------------->| PCDOOMv2 |<--------+
+-----+----+
V
+----------+
| fastDOOM |
fastDoom genealogy +----------+
------------------
The big performance picture
---------------------------------------------------------------------
Victor "Viti95" Nieto, wrote release notes to describe the
performance improvement of each version but he seemed more interested
in making FDOOM.EXE awesome than detailing how he did it.
To get the big picture of performance evolution over time, I
downloaded all 52 releases of fastDOOM, PCDOOMv2, and the original
DOOM.EXE, wrote a go program to generate a RUN.BAT running -timedemo
demo1 on all of them, and mounted it all with mTCP's NETDRIVE.
I chose to timedemo DOOM.WAD with sound on and screen size = 10
(fullscreen with status bar). After several hours of shotguns and
imps agony, I had run the whole suite five times and graphed the
average fps with chart.js.
The first thing this graph allows to rule out is that fastDOOM
improvements were mostly due to using a modern compiler. PCDOOMv2 is
built with OpenWatcom 2 but only gets a marginal improvement over
DOOM.EXE.
git archeology
---------------------------------------------------------------------
On top of releasing often, Viti95 displayed outstanding git
discipline where one commit does one thing and each release was
tagged. fastDOOM git history is made of 3,042 commits which allows to
benchmark each feature.
I wrote another go program to build every single commit. I will pass
on the gory details of handling the many build system changes
(especially from DOS to Linux). After an hour I had the most ugly
program I ever wrote and 3,042 DOOM.EXE. I was pleased to see the
build was almost never broken.
Graphing the files size shows that the early effort was to be lean by
cleaning and deleting code. There are major drops with bf0e983 (build
239 where sound recording was removed), 5f38323 (build 0340 where
error code strings were deleted), and 8b9cac5 (build 1105 where TASM
was replaced with NASM).
Going deep
---------------------------------------------------------------------
Timedemoing all builds would have taken a very long time (3042x1.5/60
/24 * 3 passes = 9 days) so I focused on the release where most of
the speed was gained. I wrote yet another go program to generate a
.BAT file running timedemo for all commits in v0.1, v0.6, v0.8,
v0.9.2, and v0.9.7. I mounted 1.4 GiB of FDOOM.EXE with mTCP and ran
it. It took a while because versions with 200+ commit runtime was 8h/
pass.
fastDOOM v0.1
---------------------------------------------------------------------
This release featured 220 commits.
$ git log --reverse --oneline "0.1" | wc -l
220
Chart is click-able and mouseover-able
The MPV patch of v0.1 is without a doubt build 36 (e16bab8). The
"Crispy optimization" turns status bar percentage rendering into a
noop if they have not changed. This prevents rendering to a scrap
buffer and blitting to the screen for a total of 2 fps boost. At
first I could not believe it. I assume my toolchain had a bug. But
cherry-picking this patch on PCDOOMv2 confirmed the tremendous speed
gain.
Next is build 167 (a9359d5) which inlines FixedDiv via macro.
Near the end, we see a series of optimizations granting 0.5 fps.
Build 207 (9bd3f20): A PSX Doom optimization which optimizes the way
the BSP is traversed.
Build 212 (dc0f48e) "Inlined R_MakeSpans" which renders horizontal
surfaces.
Overall this version saw a lot of code being deleted (50% of commits
were deletions) which probably helped to cuddle the 486 cachelines of
my machine.
git log --reverse --oneline "0.1" | grep -i -E "remove|delete" | wc -l
100
Somehow, one of my patches made it to fastDOOM. Probably when I was
writing the Black Book? I have zero recollection of writing this!
fastDOOM v0.6
---------------------------------------------------------------------
This release featured 33 commits.
$ git log --reverse --oneline "0.5"^.."0.6" | wc -l
33
Chart is click-able and mouseover-able
Among many small optimizations (hello GbaDOOM 341)) are MVP ones.
Build 342 (22819fd) Skips rendering unneeded visplanes.
Build 359 (40e0d4b) Removes a level of player pointer indirection.
Build 360 (ccd296f) Double down on indirection removal.
Build 369 (f29e665) Inlines the screenspace line splitter.
fastDOOM v0.8
---------------------------------------------------------------------
This release featured 282 commits.
$ git log --reverse --oneline "0.7"^.."0.8" | wc -l
282
The sound system was a bit unstable so I had to timedemo without
sound and then normalize the fps. Moreover the focus of v0.8 seems to
have been text-mode renderer so two regressions happened at Build 670
(a92c67f) and Build 730 (c3f5f50) where the Crispy optimization went
away.
Chart is click-able and mouseover-able
MVPs:
Build 792 (f279b7d): One executable per renderer (FDOOM.EXE,
FDOOM13H.EXE, and so on).
Build 793 (1874ee8): Disable debugging for compiler.
Build 796 (6aae724): Bring back Crispy opt.
Build 794 (1366ebf): Compile less code whenever possible.
fastDOOM v0.9.2
---------------------------------------------------------------------
This release featured 110 commits.
$ git log --reverse --oneline "0.9.1"^.."0.9.2" | wc -l
110
Chart is click-able and mouseover-able
MVPs:
Build 1639 (ae2a951): Optimize skyflatnum comparison.
Build 1645 (0730cdc): Optimize R_DrawColumn for Mode Y.
Build 1646 (17c9e83): Cleanup R_DrawSpan code.
fastDOOM v0.9.7
---------------------------------------------------------------------
This release featured 293 commits.
$ git log --reverse --oneline "0.9.6"^.."0.9.7" | wc -l
294
Despite running the benchmark several times, I was unable to reduce
the noise of this release.
Chart is click-able and mouseover-able
MVPs:
Build 1941 (0688235): Testing x86 ASM changes.
Build 1943 (f326e73): Add CPU selection + CR2 optimization for 386SX.
Build 1944 (a836abb): Add ESP optimization for R_DrawSpan386SX.
Build 2000 (3432590): Add basecode for rendering fuzz columns in ASM.
Build 2031 (0edab46): Remove a CMP comparison each loop (ken
silverman's optimization?).
Mode 13h vs Mode Y
---------------------------------------------------------------------
fastDOOM explored many ways to make things faster, for a broad range
of CPUs (386, 486, Pentium, Cyrix) and video buses (ISA, VLB, PCI).
One optimization that did not work on my machine was to use video
mode 13h instead of mode Y.
In mode 13h dispatch of data toward the four VRAM banks of the VGA is
done in hardware. To the CPU, the VRAM appears like a single linear
320x200 framebuffer. The inconvenience is that you can't double
buffer in VRAM so you have to do it in RAM which means bytes are
written twice. First into the framebuffer in RAM. And then a second
time when sent to VRAM. Also, the engine must block on VSYNC.
Mode 13h
-------- RAM VRAM (VGA card) SCREEN
+-------------------+ +-------------------+ +-------------------+
| +---------------+ | | | | |
| | framebuffer 1 | | | | | |
| +---------------+ | | | | |
| +---------------+ | | +---------------+ | | |
CPU ---->| | framebuffer 2 | +----> | |framebuffer(fb)| +------>| |
| +---------------+ | | +---------------+ | | |
| +---------------+ | | | | |
| | framebuffer 3 | | | | | |
| +---------------+ | | | | |
+-------------------+ +-------------------+ +-------------------+
The mode-Y lets programmers access the VGA banks individually. This
allows triple-buffering in VRAM. Moreover, it has the advantage of
writing bytes once, directly into VRAM. The target bank must be
manually selected by the developer via very slow OUT instructions but
that allows to duplicating pixels horizontally (which gives
low-detail mode for free) by writing to two VGA banks at once via
latches^[3]. Another inconvenience is that it makes drawing invisible
Specter much slower since it requires reading back from the VRAM.
Mode Y
------- VRAM (VGA card) SCREEN
+-------------------+ +-------------------+
| +---------------+ | | |
| |fb1 | fb2 | fb3| | | |
| +---------------+ | | |
| +---------------+ | | |
| |fb1 | fb2 | fb3| | | |
CPU ------------------------------> | +---------------+ +------>| |
| +---------------+ | | |
| |fb1 | fb2 | fb3| | | |
| +---------------+ | | |
| +---------------+ | | |
| |fb1 | fb2 | fb3| | | |
| +---------------+ | | |
+-------------------+ +-------------------+
For machines with fast CPUs and bus (100+ Mhz/ Pentium and VLB/PCI)
where video cards are less likely to handle OUT instruction well,
mode 13h is better. For "slow CPUs", it is faster to write data once
to VRAM via mode Y.
Anyway, Doom used mode Y.
DOOM uses 320*200*256 VGA mode, which is slightly different from
MCGA mode (it would NOT run on an MCGA equiped machine). I access
the frame buffer in an interleaved planar mode similar to Michael
Abrash's "Mode X", but still at 200 scan lines instead of 240
(less pixels == faster update rate).
DOOM cycles between three display pages. If only two were used,
it would have to sync to the VBL to avoid possible display
flicker. If you look carefully at a HOM effect, you should see
three distinct images being cycled between.
- John Carmack^[4] (mirror)
Another reason John game me for using Mode-Y back in the days is that
the tools used by the graphic team (Deluxe Paint) only supported
320x200 (whereas Mode-X is 320x240).
e...@agora.rdrop.com (Ed Hurtley) wrote: >Check, please... In
case you haven't hit ESC ever, the Options menu >has a Low/High
resolution toggle... Low is 320x200, High is >640x400, with the
border graphics (the score bar, menu, etc...) are >still
320x200... (Just the same graphics files)
Low detail is 160*200 in the view screen. This is done by setting
two bits in the mapmask register whenever the texturing functions
are writing to video memory, causing two pixels to be set for
each byte written.
ui...@freenet.Victoria.BC.CA (Ben Morris) wrote:
>John,
>You're using a planar graphics system for a bitmapped game that
>updates the entire screen at a respectable framrate on a 486/66?
Its planar, but not bit planar (THAT would stink). Pixels 0,4,8
are in plane 0, pixels 1,5,9 are in plane 1, etc.
>That's pretty incredible. I would have thought all the over- >
head for programming the VGA registers would kill that >
possibility.
The registers don't need to be programed all that much. The map
mask register only needs to be set once for each vertical column,
and four times for each horizontal row (I step by four pixels in
the inner loop to stay on the same plane, then increment the
start pixel and move to the next plane).
It is still a lot of grief, and it polutes the program quite a
bit, but texture mapping directly to the video memory gives you a
fair amount of extra speed (10% - 15%) on most video cards
because the video writes are interleaved with main memory
accesses and texture calculations, giving the write time to
complete without stalling.
Going to that trouble also gets a perfect page flip, rather than
the tearing you get with main memory buffering.
- John Carmack^[5] (mirror)
Heretic was released in 1994. Hardware had evolved to make mode 13h^
[6]^[7] more appealing so Raven modified the DOOM engine to this
effect. PCDoom v2 used Heretic I/O but re-implemented the video I/O
with mode Y. Finally fastDOOM gives users the choice by providing
several executable FDOOM.EXE, FDOOM13H.EXE, and FDOOMVBD.EXE.
The DOOM press release beta (October '93) used Mode 13h, so I
assume they switched to Mode Y to improve performance on slower
machines (low-detail). I wonder why they didn't also implement
the so-called "potato mode", which writes four pixels with a
single 8-bit write to VRAM.
In FastDoom, I reintroduced Mode 13h because Heretic/Hexen had
better-optimized ASM rendering code for this mode. Later, I was
able to partially port this approach to column rendering in Mode
Y, which resulted in a 5% to 7% performance improvement.
Based on my testing, the best mode for 486 CPUs is the VESA
direct mode (FDOOMVBD.EXE for 320x200). This mode combines the
advantages of Mode Y with the optimized rendering code from
Heretic while avoiding any OUT instructions--except for one to
switch buffers, which executes only once per rendered frame. The
only downside is that it requires a VLB or PCI graphics card with
LFB enabled and has slower performance in low-detail and
potato-detail modes.
- Conversation with Viti95
Viti95 elaborated further on fastDOOM mode 13h during proof-reading.
In FastDoom, Mode 13h uses a single framebuffer in RAM, which is
copied to VRAM after the entire scene is rendered. Vsync is not
enforced, which may result in flickering. There are two methods
for copying the backbuffer to VRAM, optimized for different bus
speeds. For slow buses (8-bit ISA), a differential copy method is
used, transferring only modified pixels.
This approach involves many branches but is faster overall
because branching is less expensive than excessive bus transfers.
For faster buses (16-bit ISA, VLB, PCI, etc.), a full backbuffer
copy is performed using REP MOVS instructions, which is efficient
when the bus bandwidth is sufficient.
- Conversation with Viti95
More optimization which did not work
---------------------------------------------------------------------
Another venue I appreciated seeing explored is OpenWatcom's
processor-specific flags (4r/4s vs 3r/3s)^[8]. Both wcc386's 386 and
486 flags were attempted but ultimately discontinued because the 386
version always seemed faster.
One of my goals for FastDoom is to switch the compiler from
OpenWatcom v2 to DJGPP (GCC), which has been shown to produce
faster code with the same source. Alternatively, it would be
great if someone could improve OpenWatcom v2 to close the
performance gap.
- Conversation with Viti95
Overall impression
---------------------------------------------------------------------
What splendid work by Victor Nieto! If software can die from a
thousand cuts, Viti95 made fastDOOM awesome with three thousand
optimizations! Not only he leveraged existing improvements (crispy,
psx, gba, Lee Killough), he also came up with many news one and
generated so much hype that even Ken Silverman (author of Duke3D
build engine) came to participate^[9].
I tip my beret to you Victor!
References
---------------------------------------------------------------------
Note from Viti95: Joystick and network gameplay support have
^ [1] been removed, so it's not a completely feature-intact port ^^
(People are still trying to convince me to bring network
gameplay back).
^ [2] DOOM engine: gamesrc-ver-recreation
^ [3] Game Engine Black Book: Wolfenstein 3D
^ [4] Doom graphics modes usenet
^ [5] Doom graphics modes usenet
^ [6] Doom vs Heretic VGA performance difference
^ [7] Doom in DOS: Original vs Source Ports
^ [8] OpenWatcom documentation
Note from Viti95: Some of Ken Silverman's ideas and code made
^ [9] their way into the rendering functions for UMC Green CPUs,
resulting in a significant speed boost on that hardware..
---------------------------------------------------------------------
*