Tuesday, April 24, 2012

Particles

I came down on OpenGL a little heavier than on D3D in the last post, so to redress the balance I have to say that OpenGL instancing is very nice indeed; much nicer than D3D.

Here's some code:

void R_DrawParticles (void)
{
 int i, j;
 particle_t *p;

 if (!r_newrefdef.num_particles) return;

 GL_UseProgram (gl_particleprog);

 glUniformMatrix4fv (u_partmatrix, 1, GL_FALSE, r_mvpmatrix.m[0]);
 glUniform3fv (u_partrorigin, 1, r_origin);
 glUniform3fv (u_partvpn, 1, vpn);
 glUniform3fv (u_partvright, 1, vright);
 glUniform3fv (u_partvup, 1, vup);

 GL_Enable ((BLEND_BIT | DEPTHTEST_BIT) | (gl_cull->value ? CULLFACE_BIT : 0));

 GL_EnableVertexAttribArrays (VAA0 | VAA1 | VAA2);
 glBindBuffer (GL_ARRAY_BUFFER, gl_particledynamicvbo);

 if (r_firstparticle + r_newrefdef.num_particles >= MAX_PARTICLES)
 {
  glBufferData (GL_ARRAY_BUFFER, MAX_PARTICLES * sizeof (particle_t), NULL, GL_STREAM_DRAW);
  r_firstparticle = 0;
 }

 glVertexAttribPointer (0, 3, GL_FLOAT, GL_FALSE, sizeof (particle_t), (void *) (r_firstparticle * sizeof (particle_t)));
 glVertexAttribPointer (1, 4, GL_UNSIGNED_BYTE, GL_TRUE, sizeof (particle_t), (void *) (r_firstparticle * sizeof (particle_t) + 12));

 if ((p = glMapBufferRange (GL_ARRAY_BUFFER, r_firstparticle * sizeof (particle_t), r_newrefdef.num_particles * sizeof (particle_t), BUFFER_MAP_BITS)) == NULL)
 {
  ri.Con_Printf (PRINT_ALL, "R_DrawParticles : glMapBufferRange failed\n");
  return;
 }

 memcpy (p, r_newrefdef.particles, r_newrefdef.num_particles * sizeof (particle_t));
 glUnmapBuffer (GL_ARRAY_BUFFER);

 glBindBuffer (GL_ARRAY_BUFFER, gl_particletexcoordvbo);
 glVertexAttribPointer (2, 2, GL_FLOAT, GL_FALSE, 0, (void *) 0);

 glVertexAttribDivisor (0, 1);
 glVertexAttribDivisor (1, 1);

 glDrawArraysInstanced (GL_TRIANGLE_FAN, 0, 4, r_newrefdef.num_particles);

 glVertexAttribDivisor (0, 0);
 glVertexAttribDivisor (1, 0);

 r_firstparticle += r_newrefdef.num_particles;
}

Monday, April 23, 2012

API Insanity

I've just done multisampling in DirectQ 2.0.0, and after some brief hiccups, it works. However the debug layer (D3D has a proper debug layer - isn't that awesome?) warns me that I'm trying to read from a multisampled texture via a standard Texture2D object in my render-to-texture shaders. That sucks - D3D9 was able to handle this automatically.

Some further investigation reveals that I need to use a Texture2DMS object instead (the "MS" stands for "multisampled", I guess), which needs the number of samples explicitly specified. Hold on a moment? Does that mean that I need to create a new version of each such shader for each multisample level?

Madness.

OpenGL Uniform Buffer Objects were added to my Quake II engine. The intention was to do much the same as for DirectQ constant buffers - upload a chunk of common values (the current MVP matrix, the current Ortho matrix for 2D stuff, player positioning info, etc) at the start of each frame and have them available to all shaders.

They run slower than plain uniforms; the OpenGL buffer object specification strikes again. No amount of glBufferData (...NULL...) or GL_MAP_UNSYNCHRONIZED_BIT can prevent them from stalling the pipeline. This is a pure API/Driver problem - it doesn't happen with constant buffers in D3D11 because D3D11_MAP_WRITE_DISCARD works as advertised; the hardware is clearly capable. Rip them out and go back to standard uniforms.

I can do "layout(location = 0) in vec4 position;" but I can't do "layout(location = 0) uniform mat4 localmatrix;"? Does the same person on the ARB who loves bind-to-modify also love glGetUniformLocation? Or are there two of them? (Yes, this was fixed in GL 4.2. Why was it not in GL 2.0? Sigh.)

Sunday, April 22, 2012

DirectQ Update - 22nd April 2012

Two important features have just made it into DirectQ.

The first is gamma-corrected mipmapping. This helps prevent lower mipmap levels from turning to sludge, and is applied to both native and external textures, with the exception of DDS files, where it is assumed that the texture author will provide a full mipchain in the DDS using a high-quality mipmapping algorithm.

The second is in relation to skyboxes. I've dumped the cubemapping, gone back to six separate skybox textures and have written my own cubemap lookup function to sample from these. It's on the GPU rather than the CPU, of course, so it plays nice with static vertex buffers and all the rest.

The reason for the latter is connected to the way Quake content is occasionally provided. Sometimes skyboxes have missing components or different resolution components - an example might be where the mapper or modder provided a lower-resolution bottom face, or doesn't provide one at all, on the assumption that this face of the skybox will never be visible.

Previously with the cubemap it was necessary to enforce that all face textures be present, be square, and be the same size. I handled this by first loading the 6 face textures into temporary storage, then resizing them as necessary, flipping/rotating/mirroring them to orient properly for a cubemap, then building a cubemap properly from that. It worked but it was somewhat nasty.

That's all gone now and skyboxes are handled with full flexibility instead. It's still a cubemap, but the cubemapping is now done manually and none of the old restrictions apply.

The downside is that skybox drawing is now a mite slower than before. In practice this means that the big Marcher Fortress scene runs at about 870 fps rather than 890, so it's not too big a deal but it is there. In an ideal world all code would run fast, look good and be flexible. In the real world you don't always get to have all three, and this was one case where sacrificing some speed was worthwhile.

If it was possible in Quake to just draw the skybox as a large cube at infinite distance this wouldn't be much of an issue. I'd just use fairly standard texture lookups and everything would be clean and fast. Unfortunately there are places in id1 maps where sky is used to occlude other geometry, and combining that with the need to support sky surfaces on brush models means that a more robust and general purpose solution was needed.

One other thing that's been fixed up is video mode selection. D3D11 gives a different video mode list than D3D9 did, so unfortunately that means that the mode list changes again. The upside is that it's now fully working, with the ability to select refresh rates, centered or stretched modes, and different scanline ordering. Of course, if your hardware doesn't expose some specific modes then you won't get them, so don't go looking for me to add stretched modes if there are none in the list. Your hardware just doesn't support them, and I can no more give you such a mode from the engine than I can upgrade your RAM from it.

Not yet done is multisampling. In order to provide this in D3D11 I'll need to destroy and recreate the swap chain. Seems simple enough, but I'm playing it a little on the cautious side.

Friday, April 20, 2012

OpenGL 3.3 - The Rant

As might be known, I have an experimental OpenGL 3.3 Quake 2 engine that I've been working some on recently. Part of the motivation for that was to scratch an itch I had about how modern(ish) OpenGL compares to other APIs, and part of it was some fairly intense frustration at the older OpenGL API being used by RMQ.

This is basically an OpenGL 3.3 core profile, meaning that absolutely everything is being done with shaders and VBOs, and no deprecated functionality is used.

Regarding the API itself, it's quite nice to work with. I'd place it as better than D3D9 but not as good as D3D11 in my personal rankings; a lot of the insane stuff that was brought in with GL 1.5, 2.0 and 2.1 is just gone, and in it's place are some good, common-sense features. Unfortunately some insane stuff remains, but I guess you can't have everything.

What's good is glMapBufferRange, which makes dynamic vertex buffers actually usable with the discard/no-overwrite pattern that D3D has had for the past 12-odd years (it's actually marginally nicer to use than D3D11's Map). Explicit vertex attribute binding locations are something that should have been in the first version of GLSL. Vertex Array Objects look good on paper, but I'm either misreading something in the spec or hitting a driver bug as I can't seem to get them working right. Texture arrays make lightmap handling an absolute breeze, allow for larger draw batch sizes, and remove a lot of messy CPU-side sorting. Primitive restart is nice to keep index buffer sizes down, but it's maybe a 50/50 wash with converting to triangles. I haven't done much in the way of geometry shader work yet, but I'm hoping to get something going with particles shortly.

What's not so nice are the things that weren't fixed, and the things that are obviously first-cuts of a design-by-committee spec.

Whoever on the ARB is so in love with bind-to-modify needs to be taken out and shot. It's astonishing that this hasn't been removed from the API yet. GL 3.0 was the perfect opportunity, and they missed it.

Uniform Buffer Objects share one nasty characteristic with D3D9 occlusion queries - they look like they were designed by a pack of monkeys on LSD. I'm sure that there's a sane, usable API in there somewhere, but it's not coming through from reading the spec and looking at code samples.

Lack of explicit uniform binding locations is another missed opportunity. If UBOs were sane this wouldn't be such a big deal - I'd just use UBOs (as I do with constant buffers in D3D11) and be done with it. This is a step backwards from the old ARB assembly programs, and compares unfavourably with HLSL, which has always supported it (but not required it) in all of it's incarnations.

Not being able to use the default depth buffer with FBOs - that just sucks. It wouldn't have hurt to include it as an option.

Despite all that, it is mostly fun to use, and I'm feeling quite productive writing code rather than feeling like I'm fighting against stupidity most of the time, which overall ranks it reasonably high. A good result.

Monday, April 16, 2012

One For The Ultra-Trads

There's something noble about using D3D11 to replicate 8-bit software rendering effects from 1996. This was really fast and easy to do, so why not?

Wednesday, April 11, 2012

Meet The Engine Crusher

What with one thing or another I haven't got much done recently, so here's another performance milestone that's worth noting.



The scene above (from ne_tower) is my standard "Engine Crusher" - chosen because it places huge stress on both MDL and BSP drawing, with about 6,000 brush surfaces and over 20,000 MDL triangles being drawn in it, and a substantial amount of overdraw. As good a test as any for determining how efficient you're going to be.

Again, this one demonstrates the leap forward that the D3D11 move is going to be. Here we're running it at just over 600 FPS on reasonably low-powered laptop hardware. By comparison, a more traditional GLQuake-style renderer barely scrapes 60 FPS, and the old D3D9 code could manage close to 300 on the same hardware. It's worth noting that the D3D9 code used some extra tricks - in particular hardware instancing - which are actually not used by the D3D11 code.

There are no dynamic lights or lightstyles in this scene, so what we're seeing here is pure polygon and pixel pushing power. An odd thing about the performance characteristics of this code is that it really likes it when you sneak up on it with something like this. More typical scenes certainly won't give you a 10x performance improvement over GLQuake-style code (or even 2x over the old D3D9 versions of DirectQ - although I have measured this magnitude of improvement in some other test scenes). This is the way things should be - the heavy guns come out when the going is tough.

I'm not certain at the moment if that kind of jump is specific to D3D11 itself or is down to the new code structure I was able to write (which wouldn't have been possible with D3D9) - most likely a combination of both.

I've a few other test scenes that I regularly use for benchmarking performance under stress, so I'll probably post some figures from them later.