This is what 1,000,000 particles looks like.
What's most useful about this is that it definitively answers my previous concerns about particle updates on the CPU versus particle updates on the GPU. In this case, the CPU is faster, but only by 3 fps at this kind of count. (At slightly less extreme counts - like 100k particles - it pulls ahead by a little more (obviously feeling less fillrate pressure) but as we start to get down to more typical in-game scene counts they become more even again.)
What this means is that I can now drop one of these two options, and the one that I've chosen to drop is the CPU updater. Why drop the one that's faster? Simple, really - there are oodles of room for further optimization of the GPU drawer, but none whatsoever for the CPU drawer - it's already going pretty much as fast as it can. By not having to support the two in the same codebase I can start exploiting that optimization potential.
Update
So let it be written, so let it be done.
The GPU particle drawer is now a little faster than the CPU one was with the 1,000,000 particle case, and almost twice as fast with the 100k case.
Wednesday, June 20, 2012
Death by Particles
Posted by
mhquake
at
10:18 PM
Subscribe to:
Post Comments (Atom)
8 comments:
Great work, awesome results! Can't wait for 2.0 beta source release to test this diamond. ;)
Also, could you tell us what you have changed in your GPU particle code that made it so much faster? I really like your technical posts. :)
Currently the CPU handles initial generation of particles, adding them to the list, copying them to a vertex buffer and removing them from the list when they die.
Each particle is a single 64-byte vertex containing everything needed to draw it. This is slightly larger than the 4 12-byte vertexes used for the old CPU code, but it aligns better.
Instancing is used to expand that to a quad, and up to 64k particles are drawn per call. Geometry shaders are just too slow for this.
The GPU does everything else. It takes the position, velocity, gravity, delta-velocity and time passed for each particle, computes it's new position, expands it out and billboards it, and interpolates colour ramps and some alpha fade (this is really subtle).
The only per-particle property that is updated on the CPU is time passed since it was spawned, and - because I sort my particles into emitters - that just needs to be calculated once per emitter (because all particles belonging to an emitter are spawned at the same time) then stored to particles within that emitter.
The new position calculation is fairly gnarly and it needs to be done 4 times per particle, but the GPU, even doing it 4 times, is much faster than the CPU just doing it once. It's also been possible to pre-calculate some of the info needed and store that in a constant buffer, updated once per-frame.
It should be possible to go a little further by reducing (or eliminating) the copy-to-vertex-buffer stage, but there are tradeoffs in terms of draw calls and constant buffer updates there. It's a fairly fine balancing act.
...and it may even be possible to exploit the SV_VertexID and SV_InstanceID builtin semantics and completely remove the spawning code from the CPU-side, moving it all to a different vertex shader per-emitter type. That has a tradeoff in terms of shader changes though, and is probably a case where reverting back to a GS might be preferable.
"So let it be written, so let it be done." ... I'm sent here by the chosen one... Hetfield. You a fan?
"So let it be written, so let it be done." ... I'm sent here by the chosen one... Hetfield. You a fan?
"So let it be written, so let it be done." ... I'm sent here by the chosen one... Hetfield. You a fan?
I have been known to partake, yes.
Post a Comment