Freitag, 24. November 2017

Two-phase occlusion culling (part 1)

I really love rendering technique performance enhancements that work with existing asset pipelines and don't need artists to tweak and configure scene objects. Occlusion culling is a technique that is necessary for every engine that should be able to render interior scenes or outdoor scenes with much foilage. Some existing techniques require the user to mark occluder objects explicitely - which is not sufficient for outdoor scenes where ocludees are occluders too, some techniques are fast, but introduce sync points between cpu and gpu and therefore introduce latency and with it popping artifacts.

This papter describes a super nice algorithm very briefly: Two-phase occlusion culling. This technique has some advantages: It is blazingly fast, because it's GPU-only. Culling dozens of thousands of objects and even single instances is possible. It introduces no latency, so no popping artifacts. And it doesn't need hand-tweaking, no need to differenciate between occluders and occludees - every object can be both. The disadvantage is, that the technique requires a fairly new GPU with indirect rendering.

The algorithm

We need some buffers first:
A source command buffer, containing n commands.
A target command buffer of size n.
A visibility buffer, containing n ints, an entry for each source command.
A atomicCounter buffer.

All scene objects produce draw commands that are put into a buffer. Take a look at my other posts to see how to implement a lock-free multithreading engine with unsynchronized buffers. The entity data is put into a buffer as well. Part of this data is a AABB, calculated on the CPU side. The CPU only needs to calculate the AABB and put the latest data into a buffer.
The GPU-side now performs the following steps for rendering: A thread per draw command is executed. The corresponding entity data is fetched from the other buffer. The AABB is culled against the hierarchical depth buffer of the last frame (or an empty one if current frame is the first frame, doesn't matter). If the AABB is occluded, the command saves visibilityBuffer[index] = 1. Now another shader is executed with n threads. Every thread takes a look at it's visibilityBuffer index. If it contains a 1, the entity is visible and the sourceCommand is fetched and written to the targetCommand buffer. The index is the result of an atomicAdd on the atomicCounterBuffer. Now the atomicCounterBuffer contains the drawCount, while the targetCommand buffer contains the commands we need to render. Rendering is done with directRendering, so no readback to the CPU at all. Afterwards, the highz buffer is updated. Now the procedure is repeated, but only the invisible objects of the current frame are tested against the updated depth buffer. All visible marked objects were falsely culled and have to be rendered now. So all the algorithm needs is two indirect drawcommands, as well as the culling and buffer copying steps.

Here's an example of a scene, where 7-8 cars with roughly a million vertices is rendered. When behind a plane, they are occluded and therefore, no draw commands are comitted and the framerate goes through the roof.




Implementation

Normally, you are already rendering depth somehow, at least in the regular depth buffer. Here's the most generic compute shader way to create a highz buffer with OpenGL:

 public static void renderHighZMap(int baseDepthTexture, int baseWidth, int baseHeight, int highZTexture, ComputeShaderProgram highZProgram) {  
     highZProgram.use();  
     int lastWidth = baseWidth;  
     int lastHeight = baseHeight;  
     int currentWidth = lastWidth /2;  
     int currentHeight = lastHeight/2;  
     int mipMapCount = Util.calculateMipMapCount(currentWidth, currentHeight);  
     for(int mipmapTarget = 0; mipmapTarget < mipMapCount; mipmapTarget++ ) {  
       highZProgram.setUniform("width", currentWidth);  
       highZProgram.setUniform("height", currentHeight);  
       highZProgram.setUniform("lastWidth", lastWidth);  
       highZProgram.setUniform("lastHeight", lastHeight);  
       highZProgram.setUniform("mipmapTarget", mipmapTarget);  
       if(mipmapTarget == 0) {  
         GraphicsContext.getInstance().bindTexture(0, TEXTURE_2D, baseDepthTexture);  
       } else {  
         GraphicsContext.getInstance().bindTexture(0, TEXTURE_2D, highZTexture);  
       }  
       GraphicsContext.getInstance().bindImageTexture(1, highZTexture, mipmapTarget, false, 0, GL15.GL_READ_WRITE, HIGHZ_FORMAT);  
       int num_groups_x = Math.max(1, (currentWidth + 7) / 8);  
       int num_groups_y = Math.max(1, (currentHeight + 7) / 8);  
       highZProgram.dispatchCompute(num_groups_x, num_groups_y, 1);  
       lastWidth = currentWidth;  
       lastHeight = currentHeight;  
       currentWidth /= 2;  
       currentHeight /= 2;  
       glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);  
     }  
   }  

And the corresponding compute shader:

 layout(binding = 0) uniform sampler2D sourceTexture;  
 layout(binding = 1, r32f) uniform image2D targetImage;  
 #define WORK_GROUP_SIZE 8  
 layout(local_size_x = WORK_GROUP_SIZE, local_size_y = WORK_GROUP_SIZE) in;  
 uniform int width = 0;  
 uniform int height = 0;  
 uniform int lastWidth = 0;  
 uniform int lastHeight = 0;  
 uniform int mipmapTarget = 0;  
 vec4 getMaxR(sampler2D sampler, ivec2 baseCoords, int mipLevelToSampleFrom) {  
      vec4 fineZ;  
      fineZ.x = (texelFetch(sampler, baseCoords, mipLevelToSampleFrom).r);  
      fineZ.y = (texelFetch(sampler, baseCoords + ivec2(1,0), mipLevelToSampleFrom).r);  
      fineZ.z = (texelFetch(sampler, baseCoords + ivec2(1,1), mipLevelToSampleFrom).r);  
      fineZ.w = (texelFetch(sampler, baseCoords + ivec2(0,1), mipLevelToSampleFrom).r);  
      return fineZ;  
 }  
 vec4 getMaxG(sampler2D sampler, ivec2 baseCoords, int mipLevelToSampleFrom) {  
      vec4 fineZ;  
      fineZ.x = (texelFetch(sampler, baseCoords, mipLevelToSampleFrom).g);  
      fineZ.y = (texelFetch(sampler, baseCoords + ivec2(1,0), mipLevelToSampleFrom).g);  
      fineZ.z = (texelFetch(sampler, baseCoords + ivec2(1,1), mipLevelToSampleFrom).g);  
      fineZ.w = (texelFetch(sampler, baseCoords + ivec2(0,1), mipLevelToSampleFrom).g);  
      return fineZ;  
 }  
 void main(){  
      ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);  
      ivec2 samplePos = 2*pixelPos;//+1; // TODO: Fix this  
      int mipmapSource = mipmapTarget-1;  
   vec4 total;  
   if(mipmapTarget == 0) {  
     total = getMaxG(sourceTexture, samplePos, 0);  
   } else {  
     total = getMaxR(sourceTexture, samplePos, mipmapSource);  
   }  
   float maximum = max(max(total.x, total.y), max(total.z, total.w));  
 //Thank you!  
 //http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/  
  vec3 extra;  
  // if we are reducing an odd-width texture then fetch the edge texels  
  if ( ( (lastWidth % 2) != 0 ) && ( pixelPos.x == lastWidth-3 ) ) {  
   // if both edges are odd, fetch the top-left corner texel  
   if ( ( (lastHeight % 2) != 0 ) && ( pixelPos.y == lastHeight-3 ) ) {  
    extra.z = texelFetch(sourceTexture, samplePos + ivec2( 1, 1), mipmapSource).x;  
    maximum = max( maximum, extra.z );  
   }  
   extra.x = texelFetch( sourceTexture, samplePos + ivec2( 1, 0), mipmapSource).x;  
   extra.y = texelFetch( sourceTexture, samplePos + ivec2( 1,-1), mipmapSource).x;  
   maximum = max( maximum, max( extra.x, extra.y ) );  
  } else if ( ( (lastHeight % 2) != 0 ) && ( pixelPos.y == lastHeight-3 ) ) {  
   // if we are reducing an odd-height texture then fetch the edge texels  
   extra.x = texelFetch( sourceTexture, samplePos + ivec2( 0, 1), mipmapSource).x;  
   extra.y = texelFetch( sourceTexture, samplePos + ivec2(-1, 1), mipmapSource).x;  
   maximum = max( maximum, max( extra.x, extra.y ) );  
  }  
      imageStore(targetImage, pixelPos, vec4(maximum));  
 }  

The culling shader code can be found on rastergrid, where you can find very much information about highz culling in general. I experienced some strange bugs with compute shaders, that seemed to be executed twice on my machine. Since I wasn't able to get rid of them, I used a vertex shader trick to execute arbitrary kernels on the GPU: Have some dummy vertex buffer bound (needed by specification) and use glDrawArrays with (commands.size + 2) / 3 * 3 (we need a multiple of 3 here). No fragment shader is needed for the shader program. In the vertex shader, gl_VertexID can be used as the invocation index. The following shadercode copies draw commands from visible entities to the target buffer. It's just an extract, but you get the idea:

 [...]  
 layout(std430, binding=9) buffer _visibility {  
      int visibility[1000];  
 };  
 uniform int maxDrawCommands;  
 void main()  
 {  
   uint indexBefore = gl_VertexID;  
   if(indexBefore < maxDrawCommands) {  
     DrawCommand sourceCommand = drawCommandsSource[indexBefore];  
     bool visible = visibility[indexBefore] == 1;  
     if(visible)  
     {  
       uint indexAfter = atomicAdd(drawCount, 1);  
       drawCommandsTarget[indexAfter] = sourceCommand;  
     }  
   }  
 }  

There is some aspect missing yet: With instanced rendering, one has to introduce a offset buffer, where every entry gives the offset into a instanced entity buffer for a draw command. My current implementation has a AABB for an instance cluster, grouping several isntances into a draw command that can be culled. The next post will hopefully show, how this tecchnique can be extended to cull single instances, in order to have perfect culling for example for thousands of foilage objects.

Costs of Kotlin Delegation

If there is one thing I really really like about Kotlin, then it's the first class citizenship of delegation throughout the whole language. It provides a simple and concise way to have a replacement for inheritance. The nice thing here is, that abstract properties can be used in an interface, but the state remains in the implemting class, with very low amount of boilerplate, magic and chance of bullshit happening. Or an interface can be implemented by giving a delegating instance and that's it. Here's a short snippet to show what I am talking about (reduced example):

class Delegation(val myImpl: MyInterfaceImpl) : MyInterface by myImpl

Since this can be done in Java as well with default methods in interfaces and IDE code generation, it really is the reduced amount of boilerplate that has to be done to achieve it. It's only a worthy thing, if it can be expressed in readable, efficient code.

And the last one is a very important criteria: Efficient. I don't want to make my JVM code slower as it is already. We're talking about games - if another language is slower, than the code I can produce with Java is the better compromise for me. If I imagine using delegation over inheritance with Kotlin throughout my codebase, in every hierachy..will it slow down my engine remarkably? I ran benchmarks of implementations of a directly implemented field, an inherited field and a delegated field with Java and Kotlin. I expected a field to be faster than a property in general, and a delegated field to be slower alltogether. I used JMH and a blackhole, so there should be everything implemented just fine, but I get these unexpected results:

 Benchmark                     Mode Cnt     Score     Error Units  
 getStringJavaDelegation      thrpt 200 220181331,464 ± 2144028,358 ops/s  
 getStringJavaImplementation  thrpt 200 171078263,764 ± 889605,110 ops/s  
 getStringJavaInheritance      thrpt 200 170878616,220 ± 818848,070 ops/s  
 getStringKotlinDelegation     thrpt 200 225753956,507 ± 1740352,057 ops/s  
 getStringKotlinImplementation thrpt 200 168879795,813 ± 2728455,723 ops/s  
 getStringKotlinInheritance     thrpt 200 170414757,249 ± 1515476,325 ops/s  

Turns out the delegation to a delegate field is faster than the other two versions... Okay, I have the instantiation included in the benchmarked code, and even though I expected delegation to be slower right then, I removed it from the benchmark - so now a single getString() call is measured. Results:

 Benchmark                     Mode Cnt     Score     Error Units  
 getStringJavaDelegation      thrpt 200 301713586,642 ± 8160921,344 ops/s  
 getStringJavaImplementation  thrpt 200 225820433,449 ± 3676854,362 ops/s  
 getStringJavaInheritance      thrpt 200 234833613,665 ± 561919,892 ops/s  
 getStringKotlinDelegation     thrpt 200 320742908,021 ± 1406189,583 ops/s  
 getStringKotlinImplementation thrpt 200 230377534,877 ± 3347435,643 ops/s  
 getStringKotlinInheritance     thrpt 200 230821924,187 ± 1159446,814 ops/s  

No chance, same results. And additionally, Kotlin's delegation seem to be even faster then the bare hand implementation with Java. I decided to take a closer look at the bytecode.

 DelegationJava Bytecode  
  public getString()Ljava/lang/String;  
   L0  
   LINENUMBER 14 L0  
   ALOAD 0  
   GETFIELD de/hanno/playground/DelegationJava.impl : Lde/hanno/playground/MyInterfaceImplementation;  
   INVOKEVIRTUAL de/hanno/playground/MyInterfaceImplementation.getString ()Ljava/lang/String;  
   ARETURN  
   L1  
   LOCALVARIABLE this Lde/hanno/playground/DelegationJava; L0 L1 0  
   MAXSTACK = 1  
   MAXLOCALS = 1  


 DelegationKotlin Bytecode  
  public getString()Ljava/lang/String;  
   L0  
   ALOAD 0  
   GETFIELD de/hanno/playground/DelegationKotlin.myInterfaceImplementation : Lde/hanno/playground/MyInterfaceImplementation;  
   INVOKEVIRTUAL de/hanno/playground/MyInterfaceImplementation.getString ()Ljava/lang/String;  
   ARETURN  
   L1  
   LOCALVARIABLE this Lde/hanno/playground/DelegationKotlin; L0 L1 0  
   MAXSTACK = 1  
   MAXLOCALS = 1  
 

ImplementationJava ByteCode  
  public getString()Ljava/lang/String;  
   L0  
   LINENUMBER 14 L0  
   ALOAD 0  
   GETFIELD de/hanno/playground/ImplementationJava.myStringField : Ljava/lang/String;  
   ARETURN  
   L1  
   LOCALVARIABLE this Lde/hanno/playground/ImplementationJava; L0 L1 0  
   MAXSTACK = 1  
   MAXLOCALS = 1  

Not much difference here, just a LINENUMBER instruction. So I have to admit I'm very satisfied with this result, even though I can't explain it. Of course I know that one Bytecode is not everything ... but I don't feel like investing more time here because I have more interesting things to implement :) If anybody has further ideas here, I would like to hear.