Over the course of the last year or so I've done quite a lot of work on lightmaps, so I'm taking the opportunity to round up the experience and knowledge I've gained to date, which also serves to explain some of the reasoning behind where I am now and where I may be going in future.
The one thing about lightmaps in Quake technology that everybody knows is that they're slow. The reasons why they're slow are not so well handled in a lot of cases, however.
The two main causes are inappropriate texture formats and pipeline stalling. Of these, the latter is the worst by far, but the former is also important. There's also some interaction between them.
For texture formats, the traditional GL_RGB/GL_UNSIGNED_BYTE combination is totally useless - it's the slowest available. On the RMQ SVN I have a test program that will attempt multiple uploads using different combinations, and it turns out that GL_BGRA/GL_UNSIGNED_INT_8_8_8_8_REV is optimal on all hardware. This requires two OpenGL extensions which were made core in 1.2 (BGRA and packed pixels), so they're totally safe to use on anything that anyone might have these days.
More about BGRA later on as it has bearing on overbrights.
Pipeline stalling is more insidious. When you update a texture you need to transfer data from the CPU to the GPU, and if the texture currently being updated is in use for drawing the GPU must flush all pending commands, stall while the update is happening, then resume. Obviously you want to get in and out of that update as fast as possible (BGRA/etc helps here too) but the stalls will still kill you.
The only way to alleviate this in a traditional CPU to GPU model is to batch up your updates, so instead of doing one per surface (which may run into hundreds or thousands per frame) you just do a single update per texture. A slightly larger texture rectangle gets updated, but you get far fewer stalls. They still happen though, and in situations where there's lots of heavy lightmap updating going on they do pile up.
Before I move from here I want to backtrack a little to that BGRA thing I mentioned. It seems to be imagined in certain quarters that usng GL_RGB (or GL_BGR) will somehow save memory - this is not the case. No current hardware has 24-bit texture support; it's all powers of two - 8, 16, 32, 64 or 128. So for a coloured lightmap that gives me 8 spare bits to play with, which I use to encode some HDR information. This lets me get a full overbright range (and beyond) with less precision loss than traditional 2x or 4x modulation, but does mean that shaders are absolutely required.
The HDR encoding is interesting here. I tried a number of different approaches, and they're worth mentioning.
Traditional HDR uses an exponential factor, but for Quake lightmaps this didn't work out well. There was extra cost involved in encoding them dynamically at runtime, and decoding was also quite expensive. There were also some visual artefacts that maifested as a slightly "crinkly" look when moving between different exponential factors.
An additive factor didn't work at all - it can't handle cases where one channel may be more than 255 higher than another.
Multiplicative looked promising and I tried a number of different approaches, but it was difficult to balance a base scale, was subject to clamping, and also had the same "crinkly" artefacts as exponential. Being subject to floating point precision loss on some target hardware counted further against it.
In the end I went with division and this worked very well. Divide each of R, G and B by the value stored in alpha and it all comes out right, precision remains high, and the drop-off is a nice curve as values go up, with more precision being retained at common ranges. Yes, division is slower, but it's not that much slower and it was a fair tradeoff.
Back to pipeline stalling.
The next obvious step here was to investigate moving lightmap updates from the CPU to the GPU. For animating styles this was quite easy, but the devil is in the details - or in this case the dynamic lights.
Dynamic lights on the GPU are hideously slow; slower than the CPU with a simple scene containing only one or two lights dropping performance by over half. The shader code required is all looping branching stuff, which GPUs are not good at, and the instruction counts soar.
For the RMQ engine I compromised by only adding dynamic lights per-vertex, using the reasoning that these lights are short-lived so quality loss would be more acceptable. There are still places in maps where bad per-vertex interpolation shows up, so it's not a perfect compromise, but it was necessary owing to a combination of lower hardware requirements and heavy use of animating styles.
For an experimental D3D11 build I went for the full GPU per-pixel implementation; it looked beautiful but had the huge performance falloff I've already mentioned.
A hybrid CPU/GPU implementation is certainly possible, by using traditional CPU updates for dynamic lights but GPU updates for animating styles. But then you're back in pipeline stalling country, and if you update dynamics on the CPU you may as well go the whole hog and update everything on it - the extra overhead of styles is miniscule by comparison (particularly when you factor in the required stalling).
I thought about using attenuation maps but they would need a 64-bit render target to avoid clamping, and they break multitexturing. End result would most likely be a good deal slower and not a worthwhile tradeoff.
Future experiments in this area are likely to involve compute shaders (if only the documentation and examples didn't suck) or some form of render to texture.
Some other things I tried on and off.
A 10/10/10/2 texture formal was useful for getting extra lighting range, but still needed clamping (at 1023 instead of 255 - but two rocket or grenade explosions in close proximity and in a bright enough area can easily exceed even this extra range, and that happens in id Quake demo1).
64-bit lightmaps gave extra range too; you could totally avoid the clamp and write directly to a locked texture rectangle (on D3D at least) which cut down on much CPU-side data copying.
Both of these formats were slower to sample GPU-side from than traditional 32-bit textures, and the 64-bit format was slower to upload to the GPU too, as well as having 16 spare bits doing nothing for you.
A test I haven't yet tried is using 3 16-bit luminance textures - one for each colour channel - and writing directly into them. It should have the same CPU-side savings as a full 64-bit texture, it does't have the unused 16-bits, but it needs 3 texture samples on the GPU instead of one. I think it'll balance out as not worth doing, but it tickles my interest nonetheless.
A PBO (OpenGL) or dynamic texture (D3D) could completely avoid the stall by using buffer orpaning (OpenGL) or a discard lock (D3D), but requires the full texture rectangle to be uploaded rather than just the modified portion; I haven't explored this too much but I suspect it'll balance out even. This may however yet turn out to be the way to go if it hits the sweet spot of "not too much slower in the worst case" so it's probably worth doing more work with.
One thing that did work very well was using a texture array instead of discrete textures for lightmaps. It resulted in one texture bind up-front when drawing surfaces and no complex sorting or state-changes in mid texture chain, allowing for larger batch sizes and fewer draw calls. That's unfortunately not possible for RMQ as the shader extensions I need to use for it don't support texture arrays (and updating to GLSL is a non-runner as that in turn is not supported on some of the target hardware). I could write a second rendering path but I really really really don't want to do that as everything I do I'd need to do twice, and the time investment required for keeping both paths consistent wouldn't justify it (the renderer isn't the bottleneck in RMQ anyway - the QC is, and by a very large margin).
So that brings everything mostly up to date. I might add more in comments if anything else relevant comes to mind, but for now this summarizes things as I remember them.
What's important here is that if I'm enthusing about a particular approach at any given moment in time, it needs to be remembered that not all details about it may have fallen out yet, and as further info does come to light the situation is liable to change.
Thursday, March 22, 2012
Lightmaps Roundup
Posted by
mhquake
at
10:51 PM
4
comments
Wednesday, March 21, 2012
Wednesday 21st March 2012
I'm overdue an update so here we go.
I've been spending parts of the last week or so working on a GL 3.x port of Quake 2. This was not something I'd intended to do, and it's probably not something I'll release, but it is functionally complete and I've done a full run through the game using it. The main purpose behind it was to get some practice in a relatively safe environment with some of the newer OpenGL features.
I still think that much of OpenGL is quite insane, but at this level most of the more objectionable parts go away - even dynamic vertex buffers actually become usable and well-performing rather than an ill-conceived pile of junk. I'll probably continue on and off with some more GL 3+ features (VAOs, separate shader, explicit attribute binding locations) as it's good exercise to shake oneself out of one's comfort zone a little and to get a different perspective on things. It also gives me a large block of working code if I ever decide to lift over some of it to RMQ as options.
A large portion of time went into the Dreaded Clipping Bug. I now have a fix that works in maybe 99% of cases using some code stolen from Quake 2 - a grand total of 3 characters! It involves a subtle but definitely noticeable physics/movement change, however, so I need to work out some further details on it before I can call it satisfactory.
The clipping bug itself comes from a false positive detection of the case where the physics code needs to check for a player going up a step; I am now absolutely certain of that. What I'm also certain of is that it's a bug from id's original Quake code as I've reproduced it in maybe 5 or 6 other engines (and have also reproduced my fix for it in those too).
The fix? Find the call to ClipVelocity in SV_FlyMove and change the last parameter from 1 to 1.01 - that's all. But like I said, this changes general movement in a manner that feels closer to Quake 2 than it does to Quake, so it needs some further tuning before I can feel totally happy with it.
I've also done some work on reducing the CPU load of RMQ's QC, and have found a case where a SV_HullPointContents call can be replaced with a faster path. As this is a fairly common case it translates to a good performance increase. That's just a matter of detecting if we're checking hull 0 and num 0 and if so sending it through Mod_PointInLeaf instead (as hull 0 is the drawing hull replicated as a clipping hull it should work out as the same result, but runs a good deal faster).
That's about all for this batch of work.
Posted by
mhquake
at
11:14 PM
5
comments
Thursday, March 15, 2012
Real Life...
It's good Real Life though, but it does mean that I'm detained/distracted/diverted.
As and when.
Posted by
mhquake
at
2:13 AM
3
comments
Saturday, March 10, 2012
The Dreaded Clipping Bug
I posted about this in a comment a while back, but I think it may have gotten lost in the flurry of recent updates.
I'm reasonably certain that I know what is causing this to happen, but I need to get some info from anyone who's experiencing it in order to confirm.
So: if you're getting the clipping bug, could you post a reply here letting me know what your host_maxfps value is and what fps you're getting (just set scr_showfps 1 for this) at the time it happens. Also - does lowering the value of host_maxfps make it go away (try 100)?
Ta.
Posted by
mhquake
at
2:32 AM
22
comments
Tuesday, March 6, 2012
1258 fps

This raises an interesting difference between NVIDIA and AMD hardware - NVIDIA seem to have a much more efficient anisotropic filtering implementation. On AMD on the other hand, even going to 4x filtering will cut your framerates by 25%.
Of course at this kind of speed it's essentially meaningless, but what it is also is totally consistent in everything I've tested so far.
Intel on the other hand are the strange child in the family - on one older machine I saw a small but significant performance increase going to 4x filtering (which was the highest it could take).
Resolved an odd timer glitch today; it transpires that at lower values of host_maxfps my old code was sometimes running server frames (and sending client data to the server) at half the rate it should - every 27.888 milliseconds (instead of every 13.999) - even worse, it was oscillating between the two at a quite irregular rate that was hardware-dependent. Ouch. Didn't see that one coming. Things are better now.
Posted by
mhquake
at
11:47 PM
9
comments
Sunday, March 4, 2012
A Question of Fog
I'm starting to think about how I'm going to handle fog in DirectQ now that most everything else has been ported to D3D11. Previously I'd built a second copy of all shaders with fog toggled on in it, but with the enhanced capabilities of the new API there are a few more options available.
An obvious one is to just use some branching in a single copy of all my shaders. There is a slight extra overhead in the case where there is no fog, but things get really simpler as a result.
A second option is to capture the depth buffer as a texture and blend that over the scene, using a fog formula on it. It appeals as it means there is no overdraw case whatsoever (only pixels that are in the final scene get fogged) but falls down on account of sky being awkward to handle and some objects that should be fogged not writing to the depth buffer (shadows, particles, certain alpha surfaces).
Finally I could just say "fog is always on" and rely on a density of 0 to not show any in cases where it isn't.
I'm thinking of going with the last option ("fog is always on") initially, see how that works out, and if it causes any problems revert back via branching and then a second copy. The appeal of this is that there is a clear and easy reversion path with minimal changes between each step. Ultimately I want to get away from having a second copy of all shaders, but I'll do it if I have to - it's just the final resort. I'll take the performance hit if it's small enough.
The depth buffer approach has massive appeal to me but there are enough corner cases where it's going to break that it's not worth pursuing.
In other news, I recently got to test on some D3D10 (not 10.1, just 10) class hardware and the good news is that I'm now able to drop the entry level back to D3D10. It's still using the D3D11 API but is able to use feature levels to run on D3D10 hardware, which includes older NVIDIA and AMD/ATI cards as well as a huge range of Intels dating from the original introduction of Windows Vista - over 5 years ago, in other words. I think saying that "your graphics hardware must be not more than 5 years old" is not unreasonable.
Posted by
mhquake
at
2:58 PM
2
comments
Friday, March 2, 2012
Another Milestone
Today I completed the last major part of the renderer (bounding boxes) and removed most of the old D3D9 interfaces from the source code. This also put me in a position where I could remove the D3D9 headers and libs from the code too, and get a clean compile without them. The engine is now fully D3D11-only.
This has also brought performance up another little bit as previously I was sending much of the drawing code through both sets of functions, but with the D3D9 stuff commented out or replaced with stub functions. This only really amounted to a coupla frames in heavy scenes, but it's nice to be cleaning things up.
The only real thing remaining in the renderer is fog. This should be quite trivial to handle, but there are a few options I'm chewing over on how to do so. I also need to add external texture loading to 2D GUI textures and review external textures in general.
Once that's done I'm going to be taking another few passes over the renderer code. Some of it is very clean indeed now (brush models, MDLs, particles) and some of it evolved in parallel with the old and so it needs a tidy-up (video startup is one example of this).
There are also a few features that disappeared from the renderer in recent versions that I'd like to add back. Software Quake mipmapping is one that's already been done, the enhanced particle system is another.
Following that I get to work more on an overall tune up, and maybe do some sound code. Exciting times ahead there.
Posted by
mhquake
at
1:54 AM
3
comments
Thursday, March 1, 2012
1000 FPS

No, there's no special tricksiness happening here. It really is that fast.
timedemo bigass1 783 fps - although bear in mind that I haven't done playerskins yet so that can be expected to come down a little. (Update: did playerskins, 778).
Borderless windowed modes are going to have to go. This is nothing to do with personal choice - DXGI doesn't let you have them so you can't have them.
Posted by
mhquake
at
8:04 PM
0
comments
Thursday 1st March 2012
Today I just got coronas ported. It was very easy and painless, but I had to step a bit carefully as I wanted to ensure that coronas could share as much code and objects with particles as was possible (with the setup I've got coronas are just a special case of particles).
It's starting to get close to the point where I can remove the D3D9 headers and libs from the project, which is an important milestone. Once I reach that I'll need to take another pass or two over the code, clean out legacy junk, restructure where required, and evaluate the next steps.
There are some important things to consider for moving forward from the simple port.
I don't make much use of D3D11 features. There are geometry shaders for sure, and one or two other things I couldn't have done in D3D9, but there is ample scope for more. I especially want to investigate the possibility of making brush surfaces into fully static data - currently I have a static vertex buffer by a dynamic index buffer. That's the only major chunk of geometry data that remains dynamic in the engine (aside from particles, sprites and other transient objects) and it constitutes such a large part of the draw for each frame that I feel it's worth shooting for.
Another major chunk of dynamic data is lightmaps (will I ever be finished agonizing over these?) What I didn't mention the other day was that an option that's available is to use a compute shader for updating dynamic lightmaps. That is at a very embryonic stage at the moment (I'm aware of the possibility but haven't even begun to work out the details; nor have I even written the "hello world" of compute shaders yet). It's interesting enough though.
More things are beginning to take shape with other parts of the engine too, especially the long-promised sound code work, which I'll talk about later. I also want to talk some about the research I'm doing on interpolation code, and how and why the original QER interpolation tutorials are so badly broken.
Another day.
Posted by
mhquake
at
1:42 AM
7
comments