rand () is slow!
I recently constructed a benchmark that just chucks lots of stuff on the screen in order to stress-test various parts of my code and help identify potential bottlenecks. As part of that I needed to call rand 1.2 million times per frame. Replacing this with an implementation of fastrand (the non-SSE version) doubled my framerates in the test.
Yeah, it's an artificial test that will never be encountered in a for-real in-game scene, but all the same - this is one of those insidious creeping bottlenecks where 99% of the time you never even notice any impact from it, but that 1% where you do it can totally bring your performance crashing down.
CPU-side matrix operations can also be slow. If you find yourself doing a lot of these, they can drag you down further than you think. Same test, I constrcuted a matrix on-the-fly in my vertex shader and did two matrix multiplies instead of my previous one, and all of this per-vertex instead of a single CPU-side setup per-object (which is SSE-optimized). That was worth maybe a 50-odd percent improvement.
I already knew that from my adventures with IQMs last year, but despite that, seeing an increase in workload of well over an order of magnitude in the vertex shader give such a dramatic improvement over just doing it once per-object on the CPU was quite an eye-opener.
I wouldn't quite call that a general principle that applies to everything, but it is interesting to think how it may apply to some things.
Again, this was a stupidly heavy stress test that you won't see anything even remotely like in a real in-game scene so it won't translate into real-world improvements on the same scale.
IASetVertexBuffers is faster if you only change the stride and offset, but leave the buffer the same. I went back to packing all MDLs into a single set of buffers (rather than using a separate set of buffers for each MDL) and got some very useful improvements from that. It's probably worthwhile investigating potential for different objects to share the same buffer(s) sometime - I could probably usefully put the per-instance data for GUI stuff, MDLs, sprites and particles into a common buffer, for example.
There's a caveat to that which is that - at least with some drivers - input layouts may fail to compile if the buffer slots they're using are not strictly sequential. Right now I have a nice setup where I never actually need to change vertex buffers at runtime - it works well and is fast, and I suspect that using a shared buffer may break this - or at least make it become intolerably messy.
Wednesday, June 20, 2012
rand () is slow!
Posted by mhquake at 9:14 PM