Keep the comments coming in; they're providing a great source of information on what does or doesn't work. I don't always get the chance to reply directly to each, but all feedback and bug reports are definitely welcome and everything will be considered.
I've been playing a little with D3D11 instancing lately, and thinking over how it compares with both D3D9 and OpenGL. My general opinion is that D3D's separation of vertex layout from buffer and offset specification (which is present in both 9 and 11) is superior to OpenGL, but instancing is one case where the OpenGL method is cleaner and clearer.
My current thinking is that I'm going to use an instanced path for all MDLs and Sprites. I already have instancing written and working for sprites, but just drawing one sprite per-instance, so the batching element remains to be done.
Of course instancing has it's own overhead, but this is a balancing act. You can switch between instanced and non-instanced drawing, but then you get the overhead of this switching to handle as well. You also need to consider where to set the cutoff point beyond which you switch to instancing. Keeping everything on the instanced path and hoping that the runtime and drivers handle it sensibly seems a better approach overall.
This may change of course depending on how benchmarks work out. :)
Particles are something else that may benefit from instancing, but I'm leaning more towards keeping my current geometry shader approach. There is a certain amount of extra calculation that needs to be done per-particle, and not using a GS means that it would need to be done 4 times (once per-vertex) rather than just once. It's worth noting that I use instancing for particles in my GL3.3 Quake II engine, but that still performs all of the gravity/velocity calculations on the CPU (DirectQ 2.0.0 does them on the GPU, which is where most of the extra calculation I mentioned comes from). DirectQ 1.9.x and earlier also used a form of instancing for particles, but it was a bit hacky (mostly to still support SM2 hardware).
I'm removing geometry shaders from the 2D GUI stuff and from Sprites (the latter already done). It's unfortunate that I'd let myself get so far down this route, particularly with the 2D stuff, as there is quite a job of work now involved in their removal, but it's the right decision. The initial use of geometry shaders for these was one of those "hey, I wonder if you can do this..." moments rather than being based on sound technical reasoning.
The filesystem code is likely to make a transition to using memory-mapped files, which should result in much improved loading times. I currently have a naive implementation of this written and working, which just creates a memory mapping, maps the file, allocates a buffer, copies the file into the buffer, then takes everything down. It's no faster nor slower than using classic file I/O, so that's a good indication that improvements will come when I do it right. In particular I intend leaving PAK file handles open and memory mappings active throughout the lifetime of the engine running (no, this doesn't use any additional memory; that only happens when you map a view) so a lot of extra setup and tear-down work involved in loading a file will go away.
These advantages will only apply to PAK files - PK3s are a little more complex owing to the need to go through unzipping code, and it's unrealistic to leave file handles open for raw filesystem files.
Some framerate-dependencies in some areas have been identified and removed. Generally these occurred in cases where a delta frametime was accumulated rather than an absolute difference between current time and a start time being used. One slightly embarrassing consequence of that was that bonus and damage flashes lasted a LOT shorter than they should have when you were running fast (and a lot longer if running slow). All fixed now.
I'm considering removing the host_maxfps cvar. This is just a thought that occurred to me rather than something I'm set on doing, but a lot of things would get simpler if it went away. What would happen instead is that the engine would by default run flat-out, and various other methods of throttling framerate (if you don't want to run your GPU too fast, for example, or if you were running on battery) would be available.
These would include enabling vsync or using the new sys_sleep cvar to specify an amount of time to sleep for each frame. I've tested both of these on a netbook and they work great, with the engine still feeling incredibly smooth and responsive (and generating next to no extra heat!)
(Aside: one of the really nice things about D3D11 is that vsync is just a parameter to the Present call, so it doesn't require a mode change to enable/disable, and can be easily and automatically switched off when you're running a timedemo.)
By contrast, host_maxfps 72 will just spin in a tight loop, so it doesn't achieve much to save on either CPU usage or battery, and has nowhere near the same smoothness. Even vsync enabled with a refresh rate of 60 feels just as smooth as smooth as running at 1000fps.
Like I said, I'm still not totally set on this approach, so I'm open to comments and suggestions here. Any arguments in favour of host_maxfps versus vsync and/or sys_sleep?
More later.
Monday, May 7, 2012
Another Roundup
Posted by
mhquake
at
1:47 PM
Subscribe to:
Post Comments (Atom)
9 comments:
If you do remove maxfps cvars, make sure you don't start flooding the network and spamming clients off the internet.
All QuakeWorld servers do all of their idling in the select() system/winsock call. This allows it to reply to clients faster, reducing latency with other players' movements.
With NQ, they only move when the server says, so there's no real advantage.
Either way, beware of drivers that implement vsync with a busyloop. One of them performance/power tradeoffs...
mmapping gets more problematic when you're mmapping 400mb at a time in a 32bit address space, where you have a 2gb cap, but for 64bit builds sure, why not, though I kinda suspect that you won't see a huge performance increase if only because quake reads the entire file in one go anyway (except for pak files where you'll get lots of random pagefaults instead, depending on which parts the system purged for more ram for that virus scanner or whatever).
Well, v-sync + 1000 fps smoothness sounds great to me! I don't see any need for my GPU to go into squeaky-noise-mode trying to render 4000+ fps while my screen can only display 60/85/100/120 whatever frames/sec ;)
But I would LOVE to take advantage of the 1000Hz mouse-polling rate to get the best possible responsiveness, so if you found a way to make that work, then WOOOT! :D
I'm sure you don't need this (Pushing the Limits of Windows: Virtual Memory by Mark Russinovich), but I'm just tossing it out there, as I personally found it quite enlightening, so maybe others reading this blog will as well..
I already only send at a max of 72fps so flooding the net isn't a concern. Been doing that for ages now.
sys_sleep is intended to work in conjunction with vsync; if you're capable of running frames in 1ms, but you're vsyncing to 60fps, you may as well sleep for some of the extra time. I've pushed it as high as 5ms sleep time per frame on the netbook without any noticeable degrading in smoothness.
On Windows memory mapping is done in two steps: CreateFileMapping just sets things up initially but doesn't use any of the process address space, MapViewOfFile actually does the memory mapping and does use address space. (UnmapViewOfFile gives back the address space when done, which will be the equivalent of closing the file.)
The intention is that the initial CreateFile and CreateFileMapping calls are called once only at startup, so there's no address space pressure from there. MapViewOfFile is called per-file, and only as much as is needed is mapped into the address space (rounded to page size granularity at either end of the mapping range). What that means is that a worst case of no more than 128k above the file size will be used. I don't think we need to worry about a temporary extra usage of 128k address space.
My own suspicion is that the primary bottleneck comes from not keeping the PAK files permanently open (which was what id Quake did and which was a step backwards on my part) so that's where most performance increase is going to come from. Memory mapping is just a convenience to prevent having to copy the file into a buffer from there.
Found some other possible bugs (I deleted previous comment becouse I discovered another one):
1) "Previous weapon" key does not work
2) Quick saves are named "quick_.sav"
3) Normally, when you pick up a power up, a countdown will appear on the status bar. But when you save and load the game, countdown is gone.
Good stuff, thanks.
No. 3 won't be fixable without changing the save game format I'm afraid, as the info required to restore the counter isn't stored in game save files.
But the game knows how much of the power up you have left upon loading a game, doesn't it? But if you say it isn't possible, I believe you. :)
I noticed another thing. When the countdown reaches zero, there is still a second left of the power up.
The QC knows it but the information isn't available to the client (and even SP games in Quake use the full client/server architecture). The workaround/hack I used to get that info on the client relies on other information that is available to the client but that isn't saved. So it can be fixed, but doing so would break compatibility with everything.
This is unrelated to your sleep+vsync thing, but have you heard of nVidia's new "Adaptive VSync" feature?
Here's an overview.
I imagine this wouldn't be an issue in DQ, since any new nVidia cards would be running DQ at 1000s of FPS, so vsync would never be disabled.
Post a Comment