It's still not working out for me, so I'm going to bite the bullet and accept a performance penalty. This has just delayed further work for far too long, and I have spent too much time on reading and research, with very little to show for it.
The current plan is to do it in hardware. I have two ideas, one of which is traditional hardware occlusion queries, the other of which is very non-traditional. I'm interested in comparing the performance penalty for each, as I think that with the non-traditional approach I can do a lot of useful stuff to colossally minimize the impact, and potentially outperform occlusion queries.
I'm going to outline my non-traditional approach here, as it's basically using the hardware to accomplish what I had originally intended doing in software.
- Render the entire scene into a small viewport. I think I can get away with something extremely small here, like in the order of 128 x 128.
- This render is done with texturing, colour and anything else that can be, switched off. All we're interested in getting is the depth information.
- Use glReadPixels to get the depth buffer back into software. This will implicitly flush the pipeline and stall during the read operation, but I'm fairly confident that at the point I do the readback, the flush will have minimal impact. The smaller size of the scene will help to minimize the impact of the readback too. I'm interested here in comparing multiple glReadPixels operations (e.g. of only the portions of the viewport that we actually need for each entity) versus a single glReadPixels operation. The pipeline flush is going to happen anyway, but the former method may minimize the stalls.
- Everything else happens in software; computing of entity bounding boxes in view space and comparing them with the read back depth buffer information. I already have this code written.