It's looking as though the next release of DirectQ is going to be the last one that can fully guarantee support for Windows XP.
Bottom line is that I'm soon going to be in a position where I just no longer have access to a Windows XP machine for testing on. In those circumstances I'm not really going to be able to fire up XP, run tests, debug, etc. I can't stand over DirectQ on XP as a result of that.
What I do still have is an XP VM that I can use, but obviously - because it's a VM - it won't be in any way representative of a real physical machine. It does mean that I can and will continue to ensure that it will at least work on XP for as long as is reasonably possible, but matters such as performance will be something that I won't be able to guarantee.
Farewell, Old Shep.
Thursday, September 29, 2011
Posted by mhquake at 6:47 PM
Wednesday, September 28, 2011
I've been playing a little with the vsync code this evening.
One thing that annoys me slightly is that NVIDIA and Intel don't honour the D3DPRESENT_DONOTWAIT flag. This is a nuisance because if I could use it I would be able to have a situation where I could run the client and server independently of the renderer and retain smooth input response with vsync enabled.
Some fooling around with IDirect3DSwapChain9::GetRasterStatus suggests an alternative solution which I quickly hacked up, and blow me down but it seemed to work. Use it to check if it's appropriate to run a renderer frame yet, and if not, just don't bother running the renderer but let everything else proceed as normal.
The results are insane. DirectQ can easily hit over 30,000 frames per second (but most of those "frames" are ones where the renderer doesn't run) using this and everything is ultra smooth. It's kinda cheaty to add these to the FPS counter display though.
One great thing about this is that it rather cruelly exposes timing problems in the renderer (and also confirms where I got things right). Particles for example are updated in the renderer, but the update is dependent on the amount of client frametime that has passed. If the renderer is running so much slower than the client code, then it seems as though particles don't even move at all. Pickup, damage, etc flashes also suffer from a similar problem. On the other hand water and sky animations, entity frames, general movement all seem spot-on.
This means that some further timer decoupling is still needed, but otherwise the initial results are promising. Don't expect to see it until I fully work through all of the timing stuff, and even if something else does happen to prevent this from ever being completed, it will be good to have sorted out the timing anyway.
An interesting fact that Presenting through IDirect3DSwapChain9 is somewhat faster than Presenting through IDirect3DDevice9 - I guess that IDirect3DDevice9::Present needs to do some additional work (like getting and releasing the implicit swap chain) each frame, whereas taking the more direct route avoids that.
I've also just realised that GPU lightmaps could be achieved with only 3 textures, not 4. Each texture contains the red, green or blue part of the lightmap, and stores the lightmap value for each of the four styles in each of it's colour channels. The calculation then becomes a dotproduct of the texture lookup with the surface styles array for each destination colour channel. Plugging in the necessary shader code (I haven't done the lightmap setup yet) gives an impressive performance boost; still not as fast as the software lightmap updating code, but it's getting closer.
Posted by mhquake at 8:00 PM
I've been cleaning up some of the MDL code and managed to get a few extra percent speed out of it. Previously each MDL lived in it's own individual vertex buffer, now they all live together in a single big buffer (there's actually two separate buffers - one for position/normal and the other for texcoords). A previous attempt at this resulted in some glitches with Nehahra (what else?) but after some further testing all seems well now.
All rendering has switched over to vertex buffers. A few releases back I had switched some of the rendering to system memory streams owing to a few bugs, but that's now resolved too.
The dreaded anti-aliasing.
I've enabled Direct3D multisampling. It works so far as I have done for now, in both windowed and fullscreen modes. There are a few gotchas present - I need to go through parts of my code where I create additional render targets and make certain that their multsampling settings match what's used for the main backbuffer. The only big deal here is with the underwater warp texture, where I'm also going to need to switch to a new depth/stencil surface. Be interesting to see how that affects performance.
I have NO IDEA WHATSOEVER how this is going to interact with any AA settings you may force in your driver's control panel. If you do so, it's entirely at your own risk. All that I'll guarantee is that it will work if you leave your driver at "application controlled".
Video mode changing and game changing have gotten faster. The big delay here was reloading shaders; I guess that I was a mite paranoid and reloaded them fully on a change, but it turned out that there was no real need to.
I'll probably rant a bit more about multisampling tomorrow or the day after.
Posted by mhquake at 1:37 AM
Sunday, September 25, 2011
It occurs to me that I've talked a lot about the new HDR lighting mode, but I haven't actually shown you anything. Let's change that.
Here's a few shots of a pretty standard Quake rocket explosion. Nothing fancy, the kind of thing that happens all the time in regular ID1 Quake. I've tinted the explosion orange so that you can see the results better, especially in the second case.
First up we have overbrighting disabled. This is what GLQuake looks like. The light saturates to white all the way to the edges, where there are some small fringes.
Next we have standard 2x overbrighting; this is what most modified engines use (either as a default or available as an option). It's getting better, but notice the way the light starts to turn green towards the center. (We also lose a bit of precision.)
Finally we have HDR lighting. The green is gone and we have the full colour range all the way throughout the light radius. We could achieve similar with 4x overbrighting, but then we'd be losing two bits of precision.
The HDR method completely avoids any precision loss up to a value of 287, beyond which it scales down gracefully with no abrupt transitions, retaining better precision than either of the other two methods up to around 546 (beyond the point at which 2x overbrighting would have clamped).
Posted by mhquake at 4:14 PM
As originally posted by GB last year: http://kneedeepinthedoomed.wordpress.com/2010/11/07/orange-light-green-walls-far-out-experiences-with-colored-light/
Further work and experimentation has positively identified the criminal in this case as clamping.
Now, encoding the lightmap as a HDR texture format and decoding it in your fragment shader goes a long way to resolve the effects of this engine-side - lights that get additively combined no longer have an upper limit and Mr Green is gone. Precision does fall off if you get really silly, but by then we're talking about colossal white glares on the wall anyway. The common case starts off the same as regular GLQuake's downshift by 7, with graceful linear falloff from there.
There's one other place that light gets clamped, and that's in the lighting tool. If a light has been clamped in the lighting tool then nothing engine-side can fix that - once you clamp a value to 255 you have no way of finding out if the original value was 256, 384, or 1,000,000,000.
This plays into the "LIT2" format I mentioned previously. No final decision has been made as to whether or not I'll even do a "LIT2" format, I haven't even discussed this with the RMQ guys yet, I'm just throwing this idea out publicly.
The idea is that "LIT2" won't clamp. There are two ways to go about this - use a higher-capacity format or encode the result to allow the original to be reconstructed.
If using a higher capacity format the obvious choice would be shorts. You could even use signed shorts and have negative light contributions from combined lightstyles (you'd need to watch your lower bound when building the lightmap in-engine though). The LIT file size would double.
Encoding involves another HDR format in the LIT file. It would go from 3 channel to 4 channel, with the 4th being used to decode the first 3 (the LIT file size increases by one third). That could be done at load time or at run time (load time is less memory efficient but faster, run time the other way around). There would again be some minor precision loss, but things can be done in the light tool to moderate this (using floating point instead of integer math, rounding to nearest, etc) that can't be done engine-side owing to perf loss pile-up.
An interesting side-effect would be that the way you light maps might need to change a little. Previously you could put a hugely bright light in the center of a room, confident that it would clamp, and light the room relatively evenly. Take away the effects of clamping and that changes - surfaces nearer the light would now get the huge brightness. A proper decoupling of light intensity from light radius may be required.
Whatever, this is still entirely in the realms of speculation and might not even happen at all.
Posted by mhquake at 1:31 PM
It feels slightly bad to be pulling back on something I'd said I was doing, but I guess it's better to do it now than to wait until after a release.
Lightmaps have changed yet again.
I'd brought everything to the stage where all I had to do was just add the menu options, but then realised that the whole thing had become a horrible mess. There were slightly weird bugs cropping up, performance was falling below my comfort zone, and to cap it all I wasn't really noticing any quality improvement to justify it.
I got the HDR texture format working too. I'd made the classic mistake of starting out by assuming that it was going to be more complex than it actually turned out to be, and was coding that level of complexity into it. I'd also been using floating point for many of the calculations, which was contributing to the speed losses I'd mentioned. Plus I was testing in quite an extreme case - an unvised map with heavy use of animating lightstyles. All 3 of those together soured me on the experience, and a quick rip-apart of the code, a rewrite using integer math, and some more representative tests quickly convinced me that it was good.
The HDR format is now used if r_overbright is 1, otherwise a standard 32-bit texture is used. This is necessary because even standard 2x overbrighting can be overflowed in very common cases in the basic game - a rocket explosion will overflow it. 4x overbrighting is an option, but you're starting to lose too much quality there. Using a HDR format instead preserves as much quality as possible throughout the entire range, with an essentially unlimited dynamic range.
I'd also been doing a lot of micro-optimizations in the lightmap updating code which were of dubious benefit - the real bottleneck is: does it need to stall the pipeline in order to update a texture? Once you can answer "no" to that the lightmap updating code as it was in GLQuake is actually quite fine. A few things needed changing for sure (some surfaces still get marked for a dynamic light but don't get their lightmaps updated at all by it, so there's viable perf gains there; one or two other things).
Those micro-optimizations had made the job a lot more difficult than it needed to be, so I took them down to the canal, put them in a sack with a few bricks, and waved goodbye to them.
I also back-loaded the flip from RGB to BGR to the very last moment - a shader swizzle. D3D has no native 32-bit RGB formats so BGR is required, but doing it this way means that I can just assume RGB is used throughout the lightmap creation and updating pipeline. This is much cleaner as it keeps all array indexing consistent, and keeps brush surfaces consistent with MDLs.
Things are much better with it all now.
All of this has gotten me semi-interested in the idea of creating a "LIT2" format for RMQ. I know that the light tool needs to clamp the lighting range too, and now that the HDR stuff is working it may be the case that the time is right to use the same concept for addressing this clamping. This is basically the first time I've given the idea any thought, so it's just purely speculative right now.
Posted by mhquake at 12:14 AM
Thursday, September 22, 2011
I've been tweaking the lightmaps a little more (nothing is ever truly "concluded"), and some changes are coming through.
The ability to switch off overbrights came back. Nothing really happened to make me back off on my prior hardline stance, I just decided to put it back on a whim. While being a purist in terms of your goals can be a good thing, it's also nice to give people options.
I did in fact think of one case where having the ability to switch off overbrights can be reasonably justified (on grounds other than the mapper screwing up and not testing their map in software Quake or a modern engine) and that's where the player might want higher precision at the lower end of the scale, but the better texture formats aren't available on their hardware.
To most of use, losing that bit of precision in standard R8G8B8 texture formats is no big deal - we can't even see it. But different things drive different people nuts, and I'm willing to bet that there are people out there for whom this is important. It's all about being nice to the player.
I've switched the default format to the 10-bit per channel format I mentioned previously. Some further testing indicates that 64-bit lightmaps drop performance slightly below my comfort level. 64-bit support is still there, but is now the second choice if the 10-bit per channel format isn't available.
I've added two new lightmap formats - 16-bit greyscale and 8-bit greyscale. These are final fallbacks if no 32-bit or 64-bit format can be found. It's not necessary to use them if you're one of those people who wants coloured light disabled (r_coloredlight 0 will greyscale RGB lightmaps automatically for you, no map-reload needed, very fast) but they're there and they might run a little faster if absolute speed is your aim.
I'm going to let the player choose which lightmap format to use. This will be selectable through the video options menu (or the r_lightmapformat cvar) and won't require a restart or map reload. Only lightmap formats which are supported by your hardware (detected at startup) will be available. Or you could set it to "unknown" and let DirectQ choose for you.
I've also been doing some work on sky (details to follow) which I haven't quite cracked yet, and tightening up a handful of other things.
I've bought a lightweight ultra-portable laptop (not quite a netbook but not too far off) for use when travelling. The cool thing about this is that it's got ATI graphics, so now I'm finally in a position to be able to cross-check OpenGL stuff on ATI myself rather than handing it over to the ATI folks on the RMQ team and hoping for the best (and occasionally having to get involved in long-distance troubleshooting - yuck).
After uninstalling all the shovelware than came with it (is it just me or is this getting worse?) I couldn't resist giving DirectQ a quick spin. After all, this was a fresh-out-of-the-box, totally clean machine, no extra software to rely on, and a great test of how well I've set DirectQ up for this kind of thing.
A quick fresh install of Quake (no configs) and DirectX upgrade (I'll still require that) and away I went!
It came up perfect at 1366x768, widescreen-aspect corrected and all. Shader Model 3, all texture formats, hardware instancing, all available to be maxed out. Everything looked good, nice and crisp and high quality, no glaring bugs or glitches, beautiful image quality, everything worked the way it should.
The ATI Mobility Radeon thing obviously can't compete with the NVIDIA in my primary system, but it was more than adequate for good gameplay. timedemo demo1 got just under 300 FPS and framerates peaked around the 600 mark.
(I've commented before about how highly I rate the Intel 945, and it's interesting to note that performance of the 945 was very competitive with the ATI - maybe only 10% to 20% slower, and despite it being a much much older software T&L chip. Polycount and fillrate seem to be bottlenecks on the ATI and I suspect that in maps like Marcher the 945 would even beat it.)
The DirectX update was all that was needed. A happy result.
So I'm going to install a full development system on it. The cramped keyboard makes any serious work or play an experience that's something less than fun, but then again it's not meant to be used for that kind of thing. It will be nice to have an option like this for when I'm away from home and need my fix though.
Posted by mhquake at 11:31 PM
Wednesday, September 21, 2011
The culmination of my recent work on the lightmapping code has been reached, and it goes something like this.
The experimental HDR texture formats are gone. Too much messing, performance issues and quality tradeoffs involved in trying to get a better range out of 8/8/8/8 textures.
64-bit lightmaps are back in DirectQ. These are now the default mode and used if available. There is some speed loss incurred from using them, but on balance having an effectively unbounded range more than compensates. It also gives me some motivation to try claw back the performance from elsewhere.
If 64-bit textures are not available on your hardware DirectQ will look for 10/10/10/2 texture support. This 32-bit mode is very widely supported, has roundabout got performance parity with standard textures, and gives 2 extra bits of precision to play with per-component (the alpha component is not used). What this means is that it can achieve 4x the dynamic range of standard GLQuake lighting without compromise (in practice there are a few places even in ID1 maps that exceed even this range, so we clamp if required).
If this format is not available (unlikely) the final fallback is to standard 8/8/8/8 textures (D3D only has one 8/8/8 format available and hardware support for it is poor so I don't even bother checking).
DirectQ is now overbrights only, and on all the time. I had thought about doing this a few times in the past, and the new texture formats made my mind up for me. The rationale for this is that software Quake was also overbrights-only (as a quick check of the colormap.lmp file will confirm), Quake II had overbrights, Q3A had overbrights. Why is GLQuake so lonely in the middle without overbrights? I'm contending here that overbrights are ID's intended look for all of these engines, and that GLQuake was nothing more than an experimental stepping-stone to Quake II. On account of that, any oddities in it's behaviour - especially when they differ from software Quake - should be discarded.
Would someone say that heaving and buckling skies are part of Quake's intended look? What about tearing/shearing water surfaces? Or huge polygon cracks you could sail a Star Destroyer through when underwater? These are all things that were abandoned when ID finished GLQuake's development, and my argument is that lack of overbrights belong in that list too.
Some might say that if another few revisions of GLQuake had happened then these would all have been fixed. Others might say that these revisions did happen - it was called Quake II.
I'm retaining the ability to switch fullbrights off as there are too many released maps where the texture artist has screwed up. Yeah, I know that there are also places where the same could be said of overbrights, but thats something I'm going to take a stand on now.
Moving on, the ability to switch coloured light on or off has been restored. The old r_coloredlight cvar which has always been hanging around the engine now works again.
RMQ is - naturally - going to pick up on some of this stuff. I don't think I'll go with the 64-bit lightmap support (glTexSubImage2D is just too slow compared to D3D's LockRect/UnlockRect/AddDirtyRect stuff) but I will be implementing the 10/10/10/2 texture format, pending confirmation of some hardware support. I'll probably also leave the non-overbright mode available there.
Posted by mhquake at 7:01 PM
Tuesday, September 20, 2011
I've backed out the new lighting code.
Firstly, it exhibited some serious performance problems in scenes with lots of dynamic lights. What used to be 2 or 3 milliseconds per frame shot up to 20 or so milliseconds per frame, all in the renderer and on account of additional CPU load. That's not nice. I did some patented "quick 'n' dirty" workarounds (e.g. reverting to the old style when the light range was appropriately low enough) but overall I realised that I was patching up a flaky edifice and the time to stop was now.
Secondly, and more seriously, we encountered a driver/hardware bug on an older GeForce when running it under OpenGL. That's a big "ouch", especially because it was on someone else's hardware and that someone else was in a completely different country. Resolving that would have involved building test programs and shooting them back and forth over a period of up to a week. Now, I've done that before where I felt that the feature was important enough (cubemapped skyboxes not working on an ATI) but this time the benefit was dubious, all the more so because of the first problem.
Despite all of that I did have fun experimenting with it, and learned a lot in the process. Essentially "inventing" a HDR texture format (although I suspect that if you did enough research you'd find it was used before), working around bugs and edge cases and coming up with something that worked (reasonably) well (on most hardware) was a very interesting experience.
I think I may derive something longer term from this work for use in DirectQ, but - owing to the OpenGL GeForce bug - use in the RMQ engine is right out. It will probably end up being a simplified variant that just handles specific ranges.
Posted by mhquake at 11:18 PM
The new overbright lighting model can handle up to 9 times the dynamic range of GLQuake.
It preserves the same level of detail; we don't lose that pesky bit of precision (I could actually tweak it to gain an extra bit or two, but the extra dynamic range would go down to 5 or 3 times GLQuake).
It runs a little faster.
I think that's what we call a result.
Posted by mhquake at 12:07 AM
Monday, September 19, 2011
OK, I tried something with that extra 8 bits and it didn't work out.
Here's how it goes. I don't like the fact that overbright lighting loses 1 bit of precision compared to GLQuake's flattened lighting. I want to get that bit back. Now, it may be the case that it's only one bit, but the difference is there, it's quite subtle, and I can set up a test case that proves it.
In the past I've experimented with other methods (and rejected them for various reasons - lack of widespread hardware support, too slow, too awkward to work with, etc), but today it was using those extra 8 bits to store a "scaling factor". The theory goes something like this. If the blocklights value is less than 32768 bitshift by 7 and give it a scaling factor of 1, if it's less than 65536 bitshift by 8 and give it a scaling factor of 2, and so on. You can even go all the way down the scale to the other end too. In practice because we're storing the scaling factor in a single byte it's more convenient to make the baseline something like 64 (which will translate to roughly 0.25 in the driver) and then multiple it by 4.
So, in the shader when we're reading back the lightmap we have the scaling factor stored in the alpha channel, so all we need to do is multiply scale by RGB and - in theory - out comes the lightmap at the correct scale. (Like I said, we multiply it by an extra 4 too as we're using a baseline of 0.25ish rather than 1.)
That's where the test case that proves the difference lies. What actually happens is that you get some pretty awful banding on lightmaps where they transition between one scaling factor and another. It's very noticeable and very abrupt.
Aside from that it worked beautifully - it was quite breathtaking to see the full dynamic range of a rocket explosion going off.
An alternative that's probably worth pursuing is to use the extra byte as an additive factor instead. The way this would work is - if all of the colour components are above 255 (before clamping) subtract 255 from them and store 255 in the alpha channel. Then recomposit the lightmap in the shader. This could actually go one further again by storing a lower base value in the alpha channel and multiplying it before adding to RGB. Worth experimeting? I think it probably is.
Posted by mhquake at 10:26 PM
Earlier on today I implemented detail textures and water caustics in DirectQ. This is not something that's going to make it into a release, but it was an interesting exercise nonetheless.
Why not release it? There was a certain cost involved with my current setup that was unacceptable. I'll come to that later, but for now I'll just say that if not for that cost I likely would be releasing it, and that I'm going to keep the code anyway as it's possible that I may resolve it at some future time.
Detail textures were quite simple enough; just a noise texture modulated with the main texture. I played with using an RGB noise texture versus a greyscale one; the RGB texture - while it sounds horrible - actually turned out to look a great deal better at certain texture scales. I also did some experimenting with replicating some TexGen modes in the vertex shader to avoid having to send extra sets of texcoords - it worked reasonably well enough, but could have been better.
Caustics were interesting to do. I didn't go for shiny sparkly caustics like you sometimes see in techdemos, but instead did a moody shifting light effect; well suited to Quake's atmosphere. If I were to come back to this I'd probably do something with shifting the intensity of the actual lightmap textures rather than using a separate caustic texture.
The cost I mentioned was that I had to build separate combinations of shaders for caustics/fog/detail on and off, and startup (and vid_restart) time became prohibitive. In one extreme case DirectQ took a few minutes to start up - ouch. I experimeted with lazy-loading the appropriate shader depending on the requirements of the current scene (and reusing it from a cache if already loaded) but that caused stalls of a few seconds the first time you went into water, for example.
One solution to this might be to precompile my shaders and ship binary versions of them with the engine, but that would require some changes to my build system and would make things awkward for debugging and iterative testing. There are probably other solutions.
In any event, having backtracked I decided to keep the lazy-load/shader cache system but instead of loading on demand I just load the two combinations I currently need up-front. It still causes a slight stall on startup and vid_restart (if you've ever wondered why DirectQ's vid_restart and game changing was a mite slow - that's the reason: shader loading is slow) but it's within acceptable tolerances.
One last thing I did some preliminary work on is lightmaps. It occurs to me that because I'm using 32-bit lightmaps (D3D textures are strictly 8, 16, 32, etc bit) I have 8 bits spare that I could do something useful or interesting with. I have one specific idea for that, so let's see what happens...
Posted by mhquake at 9:09 PM
I've bitten the bullet and begun some work on the new DirectSound port. Right now it's pretty basic; it can load a sound, play it (without any consideration of volume, attenuation, entity channels, etc), stop it when done, and unload it during shutdown.
The really nice thing about the choice of DirectSound is that it can painlessly coexist with the old sound system during changeover. All I have to do is check if a buffer has been allocated for the sound in question, and - if so - play from that instead of using the old system. Reverting back to the old system is just a matter of commenting out one line of code.
I can even do a release with a partial implementation. Just comment out that one line, recompile and release.
This makes the transition phase far cleaner, easier and smoother than it would have been otherwise.
I've used the phrase "march of the weenies" in RMQ-related discussions to describe a concept that may very well be technically (or ideologically) superior but is a total pain-in-the-butt to have to work with. Switching to any other audio API instead of DirectSound would have been a "march of the weenies" decision.
There is another "march of the weenies" thing coming up in DirectSound: it uses hundreths of a decibel for it's volume scale. Because this is a logarithmic scale I need to do conversion from Quake's linear scale before I can do volume setting. I'm certain that someone at Microsoft took huge pride in being so technically correct over that, without it even entering their mind for one second that it might actually be awkward to use.
Oh well, on with the march...
Posted by mhquake at 12:26 AM
Sunday, September 18, 2011
I've dropped support for 8-bit sound from DirectQ; sounds are now upsampled to 16-bit at load time and all further sound processing is 16-bit all the way. I've also removed the upper limit (was 512) on the number of sounds that may be loaded.
This is something of a prelude to a more general cleanup of the sound code overall.
So, a few things have gotten cleaner as a result of this, but there was a (very) slight drop in performance and increase in memory usage. That's something I'm prepared to accept; I hope to address the performance drop over time, and memory is a cheap and plentiful resource that is there to be used.
When I say "increase in memory usage" by the way, I don't mean in the order of 10s or 100s of MB. It'll be a few MB, absolute max.
The 8-bit sound mode really only existed in Quake for low memory machines (and we're talking "low memory" relative to 1996 specs here) which - in some cases - would even resample 16-bit sounds down to 8-bit. Nasty. In 2011 we don't have to worry about this kind of thing.
What's interesting is that the overall quality of sound seems to have improved a little. Sounds are a little crisper, but without loss of any crucial low frequency ooomph.
This naturally brings me on to 44k sound, etc. DirectQ actually mixes sound at 44k, but the source sounds are still at the lower original rate. -sndspeed, -sspeed, etc are all present and you can use them to upsample the source sound rate, but the important thing to note with these options is that they actually decrease your sound quality quite considerably. You're not actually getting higher frequencies, you're getting smeared and pitch-shifted sound instead.
When I do move to a fully native DirectSound implementation I expect this to blow up a little. The entire mixer will be fully and natively running at 44k (or sometimes 48k) but an 11k source sound will remain that - 11k. Because with digital audio once you remove something that's it - it's gone forever. Those higher frequencies just cannot be brought back.
That should be both interesting and fun...
Posted by mhquake at 3:10 PM
Friday, September 16, 2011
I've now implemented the indexed triangle strip (with static vertex buffer and dynamic index buffer) option for hardware T&L devices. This caused me much pain and suffering to get working right, so here are some "notes from the field" on the entire topic.
As a general rule, hardware T&L devices are much much fussier than software T&L devices. Software T&L lets you get away with a lot of crazy stuff that under D3D causes hardware T&L to crash horribly (under OpenGL I assume it would punt you back to software emulation instead). That's the price you pay for speed.
A fast CPU just cannot make up for software T&L; I've an i7 and hardware T&L still goes over 50% faster. There are possible optimizations you can make to software T&L - SIMD-friendly layouts, etc - but why bother doing so much extra work when hardware T&L is just so much faster.
The Intel 945 is still an astonishing performer; it easily matches my NVIDIA in software T&L. That's pretty impressive stuff. It doesn't do hardware T&L so I can't make a comparison there. The Intel 965 is still a piece of utter junk. The 945 can go up to 3 or 4 times faster than it, despite the 965 having hardware T&L.
A hardware T&L device really really really does not like it when you have unused regions in your index buffer. Everything turns into crazy-assed Picasso triangles if so. Make sure that all of your indexes are tightly packed. Mandatory.
In order to get under my target of 64k indexes (and therefore be able to use 16-bit indexes - some of my target hardware doesn't support 32-bit indexes in hardware - D3D crashes, OpenGL goes back to software emulation) I needed to partition the vertex buffer. I still kept one single big vertex buffer but updated the stream offset any time it needed to change. This lost a little performance but not a great deal, and wouldn't be necessary in the individual draw calls case. It would be impossible to correctly draw a big map without doing so in most of the other cases, however.
Software T&L will happily let you overflow your index buffer without errors (even in PIX or the debug runtimes), and will even draw correctly. Hardware T&L will not do either. Unless you're counting the actual number of indexes you're going to need correctly, you run the risk of complete and utter havoc. Testing on both types of device is crucial, and D3D's "crash and burn horribly" is infinitely better than OpenGL's "punt you back to software emulation" behaviour in cases like this (the argument being that it helps you catch bugs earlier).
The D3D documentation for the NumVertices parameter to DrawIndexedPrimitive is seriously misleading. It's not the "number of vertices used during this call"; it's actually the number of verts in the range from lowest used to highest used, inclusive (similar to OpenGL's DrawRangeElements in other words, but the size of the range rather than the last vert).
Hopping around a static vertex buffer at random on a software T&L device will kill performance. Even specifying the range is not good; a dynamic vertex buffer that avoids hopping is much much better.
One individual draw call per surface is OK on some hardware. It will still run slower but not by too much. On other hardware it's horrific, even if in a static vertex buffer. Definitely one to consider if your target hardware is right; it will save you a lot of work if so.
Joining unindexed triangle strips by adding degenerate tris is a decent performer but not the fastest. On some hardware individual draw calls can actually outperform it. Either way, it's most definitely not worth bothering with (it might be a good option to get some batching going in OpenGL's immediate mode though).
For indexed triangle strips versus indexed triangle lists, lists come out on top on all hardware I've tested, in both software and hardware T&L modes, irrespective of whether the vertex buffer was static or dynamic (the index buffer was always dynamic). In some cases an indexed list was up to 15% faster than an indexed strip. Even though an indexed list normally has more indexes (and consequently uses more bandwidth) and even though indexed strips are more cache-friendly, we can assume that there is a triangle setup cost with strips (I'm specifically thinking about flipping the winding order on alternate strips).
The conclusion is that on reasonably modern D3D9 or better class hardware, strips joined with degenerate tris are just not worth the extra effort. Even on an Intel integrated a simple indexed list setup will outperform them any time, and on some hardware even individual draw calls will outperform them. They may be a useful option on certain hardware, but they should definitely be benchmarked against lists before any decision is made.
Posted by mhquake at 11:29 PM
It's a Quake BSP. The format is BSP2 but that's just standard Q1 BSP with some shorts changed to ints.
It's the Quake engine. It's a souped-up Quake engine designed to run this kind of thing fast, but at it's heart there is still the same old Quake engine stuff.
See that little red banner just to the right of bottom-center? That's maybe 2 to 3 times the size of the player. That's the sense of scale going on here.
It's still unfinished of course, and the visual quality will get better as polish is added, but check out the completely insane detail. And scale. Did I mention scale?
And did I ever mention that our mappers are a bunch of utter lunatics? I have now.
Posted by mhquake at 8:37 PM
Thursday, September 15, 2011
I'm currently porting DirectQ's new surface refresh code to RMQ; in the end I decided that it was the right thing to do. While - as I've said - OpenGL performance has never been as good as D3D in my tests (and DirectQ pulls off a few extra tricks - like using hardware instancing - that aren't practical for RMQ), overall RMQ performance is now starting to pull close.
There are a few scenes in certain maps that I use as "test scenes" for this kind of thing. These involve huge polycounts with insane complexity that stresses various parts of the engine to quite some degree. In these I'm measuring RMQ performance of about 85% of what DirectQ is now getting (that's actually faster than the current release of DirectQ).
By way of comparison, prior to this DirectQ could go over twice as fast as RMQ in some of these scenes - at 1024x768 compared to the 800x600 I used to use for RMQ (I now use the same resolution for both).
In ID1 maps and timedemos they're now neck and neck; sometimes DirectQ goes faster, sometimes RMQ does. Previously RMQ performance was more like 75% or so of DirectQ.
I'm also experimenting somewhat with DirectQ's code as I go. The new setup makes it incredibly easy to plug in alternative surface refreshes (it's just one new function) and try out ideas to see what happens without disrupting much of anything else.
Currently I have 5 different refresh methods implemented. These give a good range of options that I can test to see which is the best for all cases (or which two are the best for software T&L and hardware T&L, even).
These are panning out something like:
Unbatched - one individual draw call per surface, using a static vertex buffer. This is nearly always the slowest, but on D3D10 class hardware it can almost match or sometimes even exceed the others. On some drivers big scenes may drop you to single digit framerates. Because it's so variable I don't think I'm going to be including this in a release.
Batched - static vertex buffer plus dynamic index buffer. On hardware T&L cards this is currently the fastest; with software T&L cards - because the driver has to do quite a lot of jumping around in the vertex buffer - performance can drop off a good bit.
Batched - dynamic vertex buffer plus dynamic index buffer. This might be a good choice for software T&L - there's no vertex buffer jumping involved and performance is good overall.
Batched - triangle strip with degenerate triangles plus dynamic vertex buffer. This one is curious; performance seems a lot higher than I had suspected it might be (I'd thought the extra vertex overhead would pull it down) and it might be comparable to the above for some hardware.
Batched - indexed triangle strip with degenerate triangles plus dynamic vertex buffer. As above but with degenerate triangled added via indexes instead of via vertexes; this one currently seems the fastest for software T&L cards.
I'm going to be able to allow a user-selectable mode here, so that you can tune it for your own hardware; there are slight complexities in that the vertex buffer needs to be setup slightly different for software T&L and hardware T&L so not all modes will be available for each (the first two are for hardware, the rest for software).
Why all the fussing over software T&L? Surely this is 2011? Quite simple - a large part of DirectQ's original target hardware (integrated Intels) is software T&L only. I still want to run on these (I even test on one quite regularly) so it's important to have a good setup for them.
Posted by mhquake at 11:06 PM
Wednesday, September 14, 2011
You could be forgiven for thinking it, but I'm not making this up.
DirectQ just got a fairly significant speed boost in big maps/big scenes; something in the order of up to 50%. This came about from the new surface refresh code I mentioned before, which is now maturing nicely.
As with all other such boosts, this one is only noticeable when things get under stress. If you're the kind of person who only ever plays "Jesus Christ DM3 yesterday, today, tomorrow and FOREVERRRRRRRR" you are not likely to get any gain. Likewise benchmarking with ID1 timedemos won't show an appreciable difference. Run something like Warpspasm, ne_tower or Marcher in it and things really pick up.
I'm not certain how much of this would be translatable to RMQ; differences between how D3D and OpenGL handle draw calls and vertex buffers make it maybe 50/50 in my estimation as to whether it would have any effect at all. Performance-wise D3D clearly has the advantage and really pulls ahead once you get things well structured.
I'm thinking I'll port it anyway as it seems the right thing to do, but I remain wary as previous such porting efforts have in the past resulted in little or no gains (and occasionally even drops) in performance.
Posted by mhquake at 4:55 PM
Tuesday, September 13, 2011
RMQ's new BSP format is now complete in it's first revision. We have working versions of QBSP, Light and Vis that output a BSP in the new format, and we have an engine that is capable of loading and running the format.
We're calling it "BSP2". This shouldn't be viewed as some kind of usurpation or arrogance on anyone's part - the name just fell out naturally in the course of informal discussion. An older working name was "BSP 29a" but I guess that "BSP2" is a little more snappy.
I've written up a "BSP2 for Engine Coders" document, which I guess I'll release publicly sometime soon, but don't want to give any specifications just yet as the format has only been quite lightly tested (I've compiled ID1 e3m3 - because it's so damn awesome - quite a few times with it, and one of our mappers has compiled a map with over 64k verts with it) so it may be subject to minor future revision.
The fundamentals are in place and it seems quite solid though.
Here's an extract from the document containing some contextual info:
The objective of BSP2 is to address some limits in the stock BSP29 format. While many engines have boosted limits, there remain some cases where the on-disk format enforces a hard limit of 64k. During the course of RMQ development these cases were hit on a number of occasions; it was clear that we were exceeding the capabilities of the Quake formats and that a more robust and general solution was needed.
Alternatives such as Half-Life BSP, Quake II BSP, Quake III BSP or others were considered but rejected on grounds of legal issues, not actually doing anything to fix this problem, enforcing too many changes in the content creation pipeline (such as changing your map editor) and enforcing too many (and too complex) changes in engine code.
BSP2 will coexist peacefully with BSP29 so far as is possible; minimal changes are required to engine code (the largest by far being in the loader), the .map format remains exactly the same, and - while new map compiling tools are needed - the changes to these tools are also extremely minimal and can be made in the order of 30 minutes or so per tool.
BSP2 is not a radical revision of the format. You won't find RGB lightmaps, 32-bit textures or any such major overhaul here. These are left for other already standardised ways of providing such content. It's sole objective is to overcome limits that prevent content creators from fully realising their visions. By choosing this approach it is hoped that the path to implementing BSP2 support in as many engines who's authors choose to do so is as eased as is possible, and dependencies on other features (such as RGB lightmaps) are not an obstacle.
The key points here are overcoming limits, minimal changes, easy and fast to implement, no hidden dependencies on other engine features, and peaceful coexistence. I think we achieved all of these, and I'm happy with the result (within the bounds of being happier if the problem had never had to be dealt with, of course).
Posted by mhquake at 5:19 PM
Monday, September 12, 2011
Two interesting updates today.
Firstly, RMQ's new BSP format is coming along. The QBSP compiler is almost done (just one item - unrelated to the format, actually - remains to be teased out) and Light and Vis are next. Engine support is complete.
The format is just a simple extension of Quake's BSP 29 to use ints instead of shorts in certain places, and the objective is to remove hard limits that are being constantly hit by the mappers. Creeping featurism is not being considered; specific targets with specific action taken to achieve them is the name of this game.
I've mentioned before that this wasn't a decision lightly arrived at, but when three maps in quick succession hit those limits it's clear that something needs to be done.
I've also mentioned other options that were considered and rejected - such as HL BSP (legal issues, does nothing for the limits, change your map editor), Q2 BSP (some nice features we actually need, does nothing for the limits, much engine work needed, change your map editor), Q3 BSP (nice features, much engine work, change your map editor), etc. One alternative I didn't mention was to extend the format by adding more BSP lumps.
I don't believe that's viable because it doesn't actually address the problem of what happens when you try to run a map in an engine that doesn't support those extra lumps. You will just get the old invalid data, and have weirdness like not being able to move at all happening. To a player that's a bug, and it's most likely seen as an engine bug which will cause trouble for other people. Not nice. At least a big old dirty "Invalid BSP version" error gives the player a clearer indication of what's really going on.
Overall there is no perfect solution, and I accept that other people may have their own personal preferences here. But I do need to stress once again that it wasn't arrived at lightly; I didn't just wake up one morning and decide to change the format for the heck of it. This was the product of long and thoughtful discussion among the team, with retention of one's preferred map editor, minimal engine changes, use of the same map format, and minimal disruption to the content creation pipeline all being given a high priority. In the end I think it's the "best of a bad lot" solution to an unfortunate problem.
The second item relates to DirectQ. I've been playing around with it's surface refresh code on and off lately, with the objective of simplifying what had become something of a mess. It's not finalised yet, but the shape it's taking is interesting.
When I first started work on what became release 1.8.0 I made some changes to the MDL format (in memory, the on disk format is of course unchanged) and these saw my framerates shoot up dramatically. The changes resulted in my being able to draw each MDL with a single DrawIndexedPrimitive call, instead of using many many individual DrawPrimitive calls, and I thought "wow, I wonder if I can do something similar with brush models".
A lot of work since then has gone into doing just that, but recently I took a copy of the code, ripped it all apart, put all brush surfaces into one single big vertex buffer, and drew them with individual DrawPrimitive calls (RMQ does something similar these days).
It ran faster. Sigh.
OK, recieved wisdom is that "DrawPrimitive is slow, too many draw calls are bad, you need to batch stuff up for better performance, yadda yadda yadda", but for a Quake-like setup that's more like something of a balancing act. Yeah, you can batch stuff up, but there was lots of copying surfaces off into intermediate arrays, sorting, filling dynamic buffers, checking states and flushing, etc involved. This caused it's own overhead which looking back actually ended up more than wiping out the performance gains achieved by batching.
There's also the case that a lot of recieved wisdom is actually horribly outdated, relates to earlier versions of technologies used on earlier versions of hardware, is overdue being updated, and may have been formed by prejudice, FUD or misunderstanding in the first place, but yet gets quoted as if it were Truth Down From The Top Of The Mountain by some people (who one gets the feeling have never actually sat down, written the code, and found things out for themselves). We need less of that.
So roundabout now I have two possible options. The new "unbatched" code turned out to be fairly simple to convert over to a "batched" version, which gave performance approximately equal to before but slightly slower than "unbatched" (it retains the big static vertex buffer but uses a dynamic index buffer and needs to do lots of state change checking and flushing). However, I've only yet tested both on one machine and it may be the case that things will work out differently elsewhere. That's OK, it's also easy to convert back if needed, and should be even possible to support both in the one codebase (I'd prefer not to - keeps things cleaner).
Curious stuff indeed.
Posted by mhquake at 2:15 PM
With the Doom 3 source release hopefully coming closer, I've thought about it a little more and it's now looking likely that I will be doing something with the code. I haven't yet decided if this will be for public release or private use but it may make for some interesting discussion. There are a few things I want to try out in a more modern rendering environment than Quake allows, and Doom 3 seems like it might provide a nice platform (if nothing else the amount of work required to get it up to basic spec should be lessened).
Assuming that I do make the final decision, here's how I see it working out.
It will be Direct 3D. I haven't yet decided whether the version will be 9 or 11, as each has factors that would recommend it over the other (I know 9 better but there are some things in 11 that I'd like to play with).
Line for equivalent line Direct3D has always come out about 25% or so faster than OpenGL in any work I've ever done, so that should be good for integrated graphics chips (I have one that I played Doom 3 on once - it was a horrible experience).
Doom 3 uses OpenGL shaders, but the version it uses is the older ARB Assembly shading language. It should be relatively easy to write a simple parser to convert these to HLSL (the resulting code won't be pretty but it should work OK) and that's the first thing I want to try out. That necessarily limits use of the result to the original game (I can't recall what RoE used) but that's OK - it won't be a design goal to be all things to all people.
Because of the technology level it was aimed at it uses multiple passes to draw lit surfaces. There's one potental avenue for optimization - seeing how many of these we can collapse into a single pass. This should be easier if I normalize in a shader instead of using a cubemap, so I might also just hard-code the default game shaders into the engine. That rules out the shader parser/converter of course, so I'll probably end up flipping a coin on this one.
From my explorations so far of GL Intercept logs and shader code included with the game it seems as though there is a lot of work being done on the CPU that could be usefully moved to the GPU.
I want to explore some shadow mapping and try get soft shadows into the engine - it's always been something of an ugly visual clash that some of the pre-baked shadows are soft-edged whereas the dynamic stencil shadows are hard-edged (which I loathe to begin with). Don't know how that one will turn out, and there are obvious performance tradeoffs to consider. I've never really done any serious work with shadows and it would be nice to get something good for Quake out of that too.
Seeing if anything can be done to improve load times is interesting too. They're just too long. I know that DDS textures are available with the game, and making some creative use of these could be fun.
Some general visual improvements and nips and tucks on things that annoy me about the game seem in order. Ideas for these will probably evolve organically.
So that's a bunch of stuff that seems cool and fun to explore, but no promises, no due dates, and it will have to take a lower priority to other work.
This should be in 10-foot high letters on everyone's wall ("anatomy of a mis-feature"):
...Ah well, someone might use the feature for something, and it's already finished, so no harm done, right? Wrong. ..... The addition of a feature early on caused other (more important) features to not be developed. ..... The important point is that the cost of adding a feature isn't just the time it takes to code it. The cost also includes the addition of an obstacle to future expansion.It's also the reason why I sometimes remove features from DirectQ, or sometimes resist the addition of new ones.
Sure, any given feature list can be implemented, given enough coding time. But in addition to coming out late, you will usually wind up with a codebase that is so fragile that new ideas that should be dead-simple wind up taking longer and longer to work into the tangled existing web.
The trick is to pick the features that don't fight each other. The problem is that the feature that you pass on will always be SOMEONE's pet feature, and they will think you are cruel and uncaring, and say nasty things about you.
Posted by mhquake at 2:25 AM
Saturday, September 10, 2011
I've completed the slight diversion into GPU lightmap updating (I actually completed it a week ago but had to disappear for a short while) with some quite interesting results.
The overall verdict is that it's not going to be immediately viable but will definitely be something that it will be worth revisiting at some point in the future. It was also worth doing now to satisfy my own curiosity on the matter.
Animated lightstyles. The speed hit was initially a little too much but I was able to claw some back with use of some shader branching, and it now runs at some 80% to 90% of the speed of the old way in ID1 timedemos. On DirectQ's current target hardware you would need to do quite a bit of sorting and multiple passes to avoid the need for shader branching, which might (I don't know and right now I don't think it's worth investing the extra time to find out) wipe out any potential performance gains from not having to re-upload textures.
An interesting potential tradeoff emerged from this - if I was to drop use of coloured light then performance shoots up. In this case a lightmap animation is just one texture lookup and a dotproduct (with coloured light it's up to 4 texture lookups, shader branching, and some multiplies and adds).
Overbright lighting just came out naturally from it (in fact disabling overbrighting would reduce performance due to extra shader instructions), but it didn't do much to reduce the "stair step" effect.
For dynamic lights I just went for a fairly crude and brute force approach - assuming that every surface is potentially dynamically lit and using a "default dynamic light" with radius 0 for cases where a surface is not lit in reality. The intention here was to guage the effect rather than to measure a speed gain - doing it right would involve more code restructuring than I was prepared to do right now.
Two options were available - evaluating dynamic lights per-vertex or per-pixel. Per-vertex was much faster for sure - almost "for free" in fact - but the quality drop off (even with smooth shading used) was just unacceptable. Per-pixel dropped performance to about one-third (expected but not to that degree) and looked beautiful, but then a strange thing happened.
A beautiful perfectly circular dynamic light just created a huge visual discontinuity with Quake's original jaggy low-res lightmaps. Kinda like when you mix high-res replacement textures with low-res originals - it just doesn't look right. And ultimately visual consistency is more important than increased prettiness in one area only, so the overall effect is actually detrimental to the look and feel of the game.
Regarding performance there are quite obvious things that can be done to bring dynamic lights up, but like I said before the code restructuring required was something that I was unwilling to undertake right now. If I ever come back to this idea that will probably change, but for now I'm happy enough to park it.
There are a few other things I need to follow up on, so more later.
Posted by mhquake at 11:52 PM
Friday, September 2, 2011
The DirectQ update will be happening later on today (maybe 5/6/7 hours time) - keep yer peepers peeled!
Following that I'm going to fork the code and do some experimental work on dynamic light updates. The idea of updating lightmaps entirely on the GPU has been nagging me for over a year now, and the only thing holding me back has been an unwillingness to mess too much with working code, and some doubts about to handle dynamic lights from muzzleflashes, explosions, etc.
I'm feeling that the time is right to tackle it.
I already know how to handle animating lightstyles, and there are some interesting tradeoffs involved. There's going to be a 4x increase in texture memory usage, 3 extra texture lookups per surface, and the "no animating lightstyles" case is going to have no performance difference whatsoever from the "lots of animating lightstyles" case. The size of a surface vertex will also get slightly (4 bytes) bigger. On the other hand it completely avoids needing to update lightmap textures at runtime, CPU/GPU sync issues go away, a lot of horrible CPU-side work goes away, and codepaths become cleaner.
It also enables mappers to go nuts with lots of animating lightstyles without having to worry about "is this going to slow down the scene?"
It's going to be a balancing act of course; will the gains balance the losses? Will there be a general increase or not? If there is a decrease will it be worth accepting it (even if it's only on code-cleanliness grounds - a small enough decrease might tip the balance there)? That's why it's a fork of the code rather than the main codebase, and that's why I'm not making any promises that it's ever going to make it into a released version. It is an itch I have to scratch though.
Regarding muzzleflashes/explosions, I have some ideas brewing that might or might not take shape. One of them involves dropping the max number of these lights to something like 64, stating that each surface can only be affected by the 4 closest (I might be able to get away with 2, but I think 1 is right out) and doing some shader voodoo from there. Too early to talk much more about these.
I've mentioned before that RMQ maps make heavy use of animating lightstyles, so all of these are options for RMQ too, especially now that we're on a shaders-only path. I suspect that in the RMQ case we're going to see some performance improvement, although it's hard to tell at present as a lot of the content is suboptimal to begin with (un-vised maps, etc).
Posted by mhquake at 2:21 PM
Thursday, September 1, 2011
The DirectQ patch release will be out incredibly soon. I have been waiting on confirmation regarding a bugfix for lightmap corruption that one person reported, but it's not coming in a timely enough manner and nobody else seems to be experiencing this bug (or at least to have reported it) so I'm going to publish and be damned.
This is just intended as a patch release for the recent 1.8.8 so it will contain little in the way of major new features. It does contain some of the things I've been writing about recently (like the improved timer), and RMQ has also been getting these features (as well as some more of it's own) so it's - once again - been great to cross-check in another engine.
It's interesting to watch how the same basic idea evolves in two different directions in the two engines; even things like elements of coding style and variable naming differ. I might write up something more detailed on that one sometime.
There are a few other things regarding lighting and entity handling that I want to talk a bit more about too, but I'll save them for later when I can go into the detail I think they need.
Posted by mhquake at 1:00 AM