Over the course of the last year or so I've done quite a lot of work on lightmaps, so I'm taking the opportunity to round up the experience and knowledge I've gained to date, which also serves to explain some of the reasoning behind where I am now and where I may be going in future.
The one thing about lightmaps in Quake technology that everybody knows is that they're slow. The reasons why they're slow are not so well handled in a lot of cases, however.
The two main causes are inappropriate texture formats and pipeline stalling. Of these, the latter is the worst by far, but the former is also important. There's also some interaction between them.
For texture formats, the traditional GL_RGB/GL_UNSIGNED_BYTE combination is totally useless - it's the slowest available. On the RMQ SVN I have a test program that will attempt multiple uploads using different combinations, and it turns out that GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV is optimal on all hardware. This requires two OpenGL extensions which were made core in 1.2 (BGRA and packed pixels), so they're totally safe to use on anything that anyone might have these days.
More about BGRA later on as it has bearing on overbrights.
Pipeline stalling is more insidious. When you update a texture you need to transfer data from the CPU to the GPU, and if the texture currently being updated is in use for drawing the GPU must flush all pending commands, stall while the update is happening, then resume. Obviously you want to get in and out of that update as fast as possible (BGRA/etc helps here too) but the stalls will still kill you.
The only way to alleviate this in a traditional CPU to GPU model is to batch up your updates, so instead of doing one per surface (which may run into hundreds or thousands per frame) you just do a single update per texture. A slightly larger texture rectangle gets updated, but you get far fewer stalls. They still happen though, and in situations where there's lots of heavy lightmap updating going on they do pile up.
Before I move from here I want to backtrack a little to that BGRA thing I mentioned. It seems to be imagined in certain quarters that usng GL_RGB (or GL_BGR) will somehow save memory - this is not the case. No current hardware has 24-bit texture support; it's all powers of two - 8, 16, 32, 64 or 128. So for a coloured lightmap that gives me 8 spare bits to play with, which I use to encode some HDR information. This lets me get a full overbright range (and beyond) with less precision loss than traditional 2x or 4x modulation, but does mean that shaders are absolutely required.
The HDR encoding is interesting here. I tried a number of different approaches, and they're worth mentioning.
Traditional HDR uses an exponential factor, but for Quake lightmaps this didn't work out well. There was extra cost involved in encoding them dynamically at runtime, and decoding was also quite expensive. There were also some visual artefacts that maifested as a slightly "crinkly" look when moving between different exponential factors.
An additive factor didn't work at all - it can't handle cases where one channel may be more than 255 higher than another.
Multiplicative looked promising and I tried a number of different approaches, but it was difficult to balance a base scale, was subject to clamping, and also had the same "crinkly" artefacts as exponential. Being subject to floating point precision loss on some target hardware counted further against it.
In the end I went with division and this worked very well. Divide each of R, G and B by the value stored in alpha and it all comes out right, precision remains high, and the drop-off is a nice curve as values go up, with more precision being retained at common ranges. Yes, division is slower, but it's not that much slower and it was a fair tradeoff.
Back to pipeline stalling.
The next obvious step here was to investigate moving lightmap updates from the CPU to the GPU. For animating styles this was quite easy, but the devil is in the details - or in this case the dynamic lights.
Dynamic lights on the GPU are hideously slow; slower than the CPU with a simple scene containing only one or two lights dropping performance by over half. The shader code required is all looping branching stuff, which GPUs are not good at, and the instruction counts soar.
For the RMQ engine I compromised by only adding dynamic lights per-vertex, using the reasoning that these lights are short-lived so quality loss would be more acceptable. There are still places in maps where bad per-vertex interpolation shows up, so it's not a perfect compromise, but it was necessary owing to a combination of lower hardware requirements and heavy use of animating styles.
For an experimental D3D11 build I went for the full GPU per-pixel implementation; it looked beautiful but had the huge performance falloff I've already mentioned.
A hybrid CPU/GPU implementation is certainly possible, by using traditional CPU updates for dynamic lights but GPU updates for animating styles. But then you're back in pipeline stalling country, and if you update dynamics on the CPU you may as well go the whole hog and update everything on it - the extra overhead of styles is miniscule by comparison (particularly when you factor in the required stalling).
I thought about using attenuation maps but they would need a 64-bit render target to avoid clamping, and they break multitexturing. End result would most likely be a good deal slower and not a worthwhile tradeoff.
Future experiments in this area are likely to involve compute shaders (if only the documentation and examples didn't suck) or some form of render to texture.
Some other things I tried on and off.
A 10/10/10/2 texture formal was useful for getting extra lighting range, but still needed clamping (at 1023 instead of 255 - but two rocket or grenade explosions in close proximity and in a bright enough area can easily exceed even this extra range, and that happens in id Quake demo1).
64-bit lightmaps gave extra range too; you could totally avoid the clamp and write directly to a locked texture rectangle (on D3D at least) which cut down on much CPU-side data copying.
Both of these formats were slower to sample GPU-side from than traditional 32-bit textures, and the 64-bit format was slower to upload to the GPU too, as well as having 16 spare bits doing nothing for you.
A test I haven't yet tried is using 3 16-bit luminance textures - one for each colour channel - and writing directly into them. It should have the same CPU-side savings as a full 64-bit texture, it does't have the unused 16-bits, but it needs 3 texture samples on the GPU instead of one. I think it'll balance out as not worth doing, but it tickles my interest nonetheless.
A PBO (OpenGL) or dynamic texture (D3D) could completely avoid the stall by using buffer orpaning (OpenGL) or a discard lock (D3D), but requires the full texture rectangle to be uploaded rather than just the modified portion; I haven't explored this too much but I suspect it'll balance out even. This may however yet turn out to be the way to go if it hits the sweet spot of "not too much slower in the worst case" so it's probably worth doing more work with.
One thing that did work very well was using a texture array instead of discrete textures for lightmaps. It resulted in one texture bind up-front when drawing surfaces and no complex sorting or state-changes in mid texture chain, allowing for larger batch sizes and fewer draw calls. That's unfortunately not possible for RMQ as the shader extensions I need to use for it don't support texture arrays (and updating to GLSL is a non-runner as that in turn is not supported on some of the target hardware). I could write a second rendering path but I really really really don't want to do that as everything I do I'd need to do twice, and the time investment required for keeping both paths consistent wouldn't justify it (the renderer isn't the bottleneck in RMQ anyway - the QC is, and by a very large margin).
So that brings everything mostly up to date. I might add more in comments if anything else relevant comes to mind, but for now this summarizes things as I remember them.
What's important here is that if I'm enthusing about a particular approach at any given moment in time, it needs to be remembered that not all details about it may have fallen out yet, and as further info does come to light the situation is liable to change.
Thursday, March 22, 2012
Lightmaps Roundup
Posted by
mhquake
at
10:51 PM
Subscribe to:
Post Comments (Atom)
4 comments:
ot:
I don't know if you're still entertaining the possibility of a deferred shading approach, but there's been some research into DX11 tile based forward rendering with some promising results, allowing for very high light counts without the fat buffers penalty.
http://aras-p.info/blog/2012/03/27/tiled-forward-shading-links/
I've looked at this, checked out the AMD demo, and it's just too slow for the kind of hardware I'm aiming at.
I just found how bad dynamic lights run on intel hardware the other day. I got QMB compiling using MinGW on an 11inch laptop with an intel i5 ulv, frame rates are good 60 ish until dynamic lights turn up then bam 1 or 2 fps.
I never realised it was so bad.
Yeah, I suspect that the Intel driver is actually pulling the texture data back to system memory in order to do the update (under certain conditions).
It can be made fast by switching to an R_BlendLightmaps style of update (i.e. update each texture in bulk before drawing rather than a small piece at a time), and using internalformat GL_RGBA8, format GL_BGRA and type GL_UNSIGNED_INT_8_8_8_8_REV. I've measured that as about 40 times faster than the more traditional GL_RGB/GL_UNSIGNED_BYTE with Intels.
Interestingly, Intel actually becomes faster at texture updates than AMD with that kind of setup, but AMD's better performance elsewhere still pull them ahead by a large margin.
Post a Comment