[HN Gopher] An optimized 2D game engine can render 200k sprites ...
___________________________________________________________________
An optimized 2D game engine can render 200k sprites at 200fps
[video]
Author : farzher
Score : 30 points
Date : 2022-05-01 17:54 UTC (2 days ago)
(HTM) web link (www.youtube.com)
(TXT) w3m dump (www.youtube.com)
| _aavaa_ wrote:
| This by the looks of it is in Jonathan Blow's Jai language.
|
| How are you finding working with it? Have you done a similar
| thing in C++ to compare the results and the process of writing
| it?
|
| 200k at 200fps on an 8700k with a 1070 seems like a lot of
| rabbits. Are there similar benchmarks to compare against in other
| languages?
| farzher wrote:
| it's a lot of fun! jai is my intro to systems programming. so i
| haven't tried this in C++ (actually i have tried a few times
| over the past few years but never successuflly).
|
| this is just a test of opengl, C++ should be the same exact
| performance considering my cpu usage is only 7% while gpu usage
| is 80%. but the process of writing it is infinitely better than
| C++, since i never got C++ to compile a hardware accelerated
| bunnymark.
|
| the only bunnymarks i'm aware of are slow
| https://www.reddit.com/r/Kha/comments/8hjupc/how_the_heck_is...
|
| which is why i wrote this, to see how fast it could go.
| DantesKite wrote:
| I thought Jai wasn't released yet. Are you a beta user or did
| he release it already?
| adamrezich wrote:
| the official rendering modules are a bit all over the place
| atm... did you use Simp, Render, GL, or handle the rendering
| yourself?
| xaedes wrote:
| Nice demo! We need more of this approach.
|
| You really can achieve amazing stuff with just plain e.g. OpenGL
| optimized for your rendering needs. With todays GPU acceleration
| capabilities we could have town-building games with huge map
| resolutions and millions of entities. Instead its mostly only
| used to make fancy graphics.
|
| Actually I am currently trying to build something like that [1].
| A big big world with hundreds of millions of sprites is
| achievable and runs smoothly, video RAM is the limit. Admittedly
| it is not optimized to display those hundreds of millions of
| sprites all at once, maybe just a few millions. Would be a bit
| too chaotic for a game anyway I guess.
|
| [1] https://www.youtube.com/watch?v=6ADWXIr_IUc
| p1necone wrote:
| Is this not done because of technical limitations, or is it
| just not done because a town building game with millions of
| entities would not be fun/manageable for the player?
|
| Although, there's a few space 4x games that try this
| "everything is simulated" kind of approach and succeed.
| Allowing AI control of everything the player doesn't want to
| manage themselves is one nice way of dealing with it. See:
| https://store.steampowered.com/app/261470/Distant_Worlds_Uni...
| bob1029 wrote:
| > We need more of this approach.
|
| 1000% agree.
|
| I recently took it upon myself to see just how far I can push
| modern hardware with some very tight constraints. I've been
| playing around with a 100% custom 3D rasterizer which purely
| operates on the CPU. For reasonable scenes (<10k triangles) and
| resolutions (720~1080p), I have been able to push over 30fps
| with a _single_ thread. On a 5950x, I was able to support over
| 10 clients simultaneously without any issues. The GPU in my
| workstation is just moving the final content to the display
| device via whatever means necessary. The machine generating the
| frames doesnt even need a graphics device installed at all...
|
| To be clear, this is exceptionally primitive graphics
| capability, but there are many styles of interactive experience
| that do _not_ demand 4k textures, global illumination, etc. I
| am also not fully extracting the capabilities of my CPU. There
| are many optimizations (e.g. SIMD) that could be applied to get
| even more uplift.
|
| One fun thing I discovered is just how low latency a pure CPU
| rasterizer can be compared to a full CPU-GPU pipeline. I have
| CPU-only user-interactive experiences that can go from input
| event to final output frame in under 2 milliseconds. I don't
| think even games like Overwatch can react to user input that
| quickly.
| kingcharles wrote:
| Just to be clear - you're writing a "software-based" 3D
| renderer, right? This is the sort of thing I excelled at back
| in the late 80s, early 90s, before the first 3D accelerators
| turned up around 1995 I think.
|
| What features does your renderer support in terms of shading
| and texturing? Are you writing this all in a high-level
| language, e.g. C, or assembler? If assembler, what CPUs and
| features are you targeting?
|
| And of course, why?
| syntheweave wrote:
| The upper rendering limit generally isn't explored deeply by
| games because as soon as you add simulation behaviors, it
| imposes new bottlenecks. And the design space of "large scale"
| is often restricted by what is necessary to implement it; many
| of Minecraft's bugs, for example, are edge cases of streaming
| in the world data in chunks.
|
| Thus games that ship to a schedule are hugely incentivized to
| favor making smaller play spaces with more authored detail,
| since that controls all the outcomes and reduces the technical
| dependencies of how scenes are authored.
|
| There is a more philosophical reason to go in that direction
| too: Simulation building is essentially the art of building
| Plato's cave, and spending all your time on making the cave
| very large and the puppets extremely elaborate is a rather
| dubious idea.
| SemanticStrengh wrote:
| Yes although the performance is probably largely due to
| occlusion? Also the sprites do not collides with their
| environnement
| chmod775 wrote:
| Bit of a tangent and useless thought experiment, but I think you
| could render an infinite amount of such bunnies, or as many as
| you can fit in RAM/simulate. One the CPU, for each frame, iterate
| over all bunnies. Do your simulation for that bunny and at the
| pixel corresponding to its position, store its information in a
| texture at that pixel if it is positioned over the bunny
| currently stored there (just its logical position, don't put it
| in all the pixels of its texture!). Then on the GPU have a pixel
| shader look up (in surrounding pixels) the topmost bunny for the
| current pixel and draw it (or just draw all the overlaps using
| the z-buffer). For your source texture, use 0 for no bunny, and
| other values to indicate the bunny's z-position.
|
| The CPU work would be O(n) and the rendering/GPU work O(m*k),
| where n is the number of bunnies, m is the display resolution and
| k is the size of our bunny sprite.
|
| The advantage of this (in real applications utterly useless[1])
| method is that CPU work only increases linearly with the number
| of bunnies, you get to discard bunnies you don't care about
| really early in the process, and GPU work is constant regardless
| of how many bunnies you add.
|
| It's conceptually similar to rendering voxels, except you're not
| tracing rays deep, but instead sweeping wide.
|
| As long as your GPU is fine with sampling that many surrounding
| pixels, you're exploiting the capabilities of both your CPU and
| GPU quite well. Also the CPU work can be parallelized: Each
| thread operates on a subset of the bunnies and on its own
| texture, and only in the final step the textures are combined
| into one (which can also be done in parallel!). I wouldn't be
| surprised if modern CPUs could handle millions of bunnies while
| modern GPUs would just shrug as long as the sprite is small.
|
| [1] In reality you don't have sprites at constant sizes and also
| this method can't properly deal with transparency of any kind.
| The size of your sprites will be directly limited by how many
| surrounding pixels your shader looks up during rendering, even if
| you add support for multiple sprites/sprite sizes using other
| channels on your textures.
| sqrt_1 wrote:
| I assume each sprite is moved on the CPU and the position data is
| passed to the GPU for rendering.
|
| Curious how you are passing the data to the GPU - are you having
| a single dynamic vertex buffer that is uploaded each frame?
|
| Is the vertex data a single position and the GPU is generating
| the quad from this?
| farzher wrote:
| i finally got around to writing an opengl "bunnymark" to check
| how fast computers are.
|
| i got 200k sprites at 200fps on a 1070 (while recording). i'm not
| sure anyone could survive that many vampires
| nick__m wrote:
| that many rabbits, it's frightening!
|
| Do you have the code somewhere, I would like to see how it's
| made?
| juancn wrote:
| Neat. Isn't this akin to 400k triangles on a GPU? So as long as
| you do instancing it doesn't seem too difficult (performance
| wise) in itself. Even if there are many sprites, texture mapping
| should solve for the taking pixels to the screen part.
|
| My guess is that the rendering is not the hardest part, although
| it's kinda cool.
| moffkalast wrote:
| 200k sprites is roughly a mesh with 400k triangles assuming
| each sprite is a quad and it's all instanced/batched into one
| draw call as it should be. It's quite a bit but most modern
| GPUs should be able to handle that easily.
|
| It's the moving the individual quads around that can be kinda
| tricky. Draw calls are still the most limiting thing I think,
| but a good ballpark for those was around max 1k for a scene
| last I checked, so merging the entire scene into one geometry
| isn't exactly something that needs to be done in practical
| terms. This is premature optimization at its best.
___________________________________________________________________
(page generated 2022-05-03 23:00 UTC)