Hannos Blog: 2017

Freitag, 15. Dezember 2017

Delegation and Ad-hoc polymorphism in Kotlin

CAUTION: This post got somewhat long :/ TLDR: You can effectively use Kotlin's delegation if you want to mimic ad-hoc polymorphism.

I was always fascinated by ad-hoc polymorphism. The clear separation between data and behavior that can for example be achieved with tools like Scala's type classes, is especially useful when combined with data oriented design, which is important for game development as well for example.

I realized, that the usefulness depends heavily on how this feature can be integrated and supported by the programming language you work with.

For example in Go, you don't have to implement an interface explicitly - if your 'class' (Go doesn't have classes) satisfies the contract of the interface, it is virtually implemented. That means you can safely pass your instance of your class into a method that expects an interface. I like that this is called static duck typing. This is a form of structural typing, because your instance is typed by parts of its definition. Structural typing has many advantages, but also the disadvantage that code can break or fail at runtime, when said interfaces change visibly or invisibly. Also, there's the potential of naming clashes, for example if two interfaces have methods with the same name/signature. Although I don't know anyhing about the internal implementation of Go's structural typing, I suggest that the whole system works without creating wrappers under the hood.

In languages like Java and Scala, we don't have this luxury, or rather we have to solve slightly different problems here. I find Scala's solution a very good one - but I don't know too many comparable others :) The type classes combined with implicits are a very good example of how to integrate this feature into a statically typed language (This is a very nice post that pretty much answers all questions about how those type classes and implicits in Scala work).

And here's the big but (and I can not lie..): The feature is very infamous by people that don't use Scala and even infamous by some people who do use it. My humble opinion is, that it boils down to the old problem: As a programmer, you read more code than you write. That should imply, that the ability to read code more easily is more important than the ability to write code easily. I would like to correct this statement slightly: It's more important to be able to easily read and understand what's happening. Very concise code is nice to read, because it hides a large amount of information - but there is a thin red line that should not be crossed when hiding information. Problem here is, that this is very subjective to a degree. One of the biggest critics to Scala is the complex implicit resolution, where people really struggle to understand it and even the compile times suffer drastically from it. This weakness is so clearly a problem not only for mediocre programmers, non-library developers but also for code quality of a project in general, that the Kotlin creators decided to leave ad-hoc polymorphism out of the language until now.

At the time of writing this, similar blog posts to my thoughts were created here and here. And in fact, even after reading the whole language improvement suggestion for ad-hoc polymorphism, I'm not quite sure if ad-hoc polymorphism without something comparable to Scala's implicits makes sense at all. Here's a small example of a situation where the usage of ad-hoc polymorphism would be appropriate: You have an interface and some implementations of this interface. The interface is consumed by some methods, one of them is someFrameWorkFunction:

interface MyInterface {  
    fun myInterfaceMethod()  
}  
class MiOwnImplementation : MyInterface {  
    override fun myInterfaceMethod() = println("This is my own implementation")  
}  
fun someFrameWorkFunction(parameter: MyInterface) = parameter.myInterfaceMethod()  
fun main(args: Array) {  
    someFrameWorkFunction(MiOwnImplementation())  
}

Everything is fine until you have to integrate code that's outside of your control. As the following class, that may already have the functionality you need implemented, but with a wrong method signature and without your interface (because where should people get your interface from...):

class ForeignImplementation {  
    fun printSomething() = println("This is a foreign implementation")  
}

Irrelevant which solution you take to integrate this foreign implementation, one thing is for sure: you need a description how the needed functions are implemented, hence a mapping from interface methods to implementation methods.

This could do the job in Kotlin, nothing fancy.

class MyFacade(private val impl: ForeignImplementation) : MyInterface {  
   override fun myInterfaceMethod() = impl.printSomething()  
}

Afterwards, we can use our implementation like this. Alternatively, we can use Kotlin's equivalent of an anonymous class, an object:

fun main(args: Array) {  
   someFrameWorkFunction(MiOwnImplementation())  
   someFrameWorkFunction(object : MyInterface {  
     private val impl = ForeignImplementation()  
     override fun myInterfaceMethod() = impl.printSomething()
  }
}

With an implicit resolution algorithm, one could write

someFrameWorkFunction(ForeignImplementation())

instead of explicitly instantiating one's own facade.

The question here is: Is the implicit resolution worth the hassle that one now has to search for the conversion definition? Is the example with the explicit conversion that much worse readable? In my opinion, until now, there's no reason to not state the conversion explicitly. Only with taking the feature even further, it makes sense to have it - for example because you are used to using contextual objects as implicit parameters, as they are embedded in the language, like in Scala. For example it's possible to provide default implementations that can be imported by the user of your library. Without such a feature, one would have to provide overloads of a method that takes types as parameters. Kotlin won't benefit from it, instead it would be an artificial feature that is limited to one situation.
Swift on the other side allows for extensions, that can have an existing class implement an interface with some limitations I don't know in particular. Extensions are automatically available through module import. So they are somewhat comparable to what Scala can do, because your module can contain all the default extensions and make them available for the user.

In general, the separation of state and behavior could become the default option and ad-hoc polymorphism could be the default over parametric polymorphism. But on the JVM, you will have to pay for the wrapping class, so this is probably not the most wise idea. Extension functions on the other hand seem to extend a class's interface but does it via syntactic sugar and static dispatch. That means the extension can be imported on the callee's side, whereas class extensions or type classes are chosen by the caller.

So let's think about a more complex scenario. The interface our framework method accepts is a more complex one, that extends the List interface and provides a getLast() method that doesn't throw an exception if the list is empty, but returns a nullable reference. This can be a default implementation that uses an abstract property. Additionally, a real interface method is declared for prettyprinting, that classes have to implement.

interface MyList<T> : List<T> {  
   val list: List<T>  
   fun getLast() : T? = if (!list.isEmpty()) list.last() else null  
   fun prettyPrint()
}

Since one can only delegate to constructor parameters and new instances in Kotlin, there is no way to have MyList provide default implementations for all List methods just by delegating to the abstract property list. This limitation prohibits using a simple inline object declaration in a method that consumes a MyList interface because you would have to implement all methods of the List interface or create a local list outside of the statement.
A solution would be to create a simple Implementation (as one could and should typically with heavy use of delegation), that implements the List interface by itself with delegation, like:

class MyListImpl(override val list : List<T>) : MyList<T>, List<T> by list {  
   override fun prettyPrint() = println("Pretty: " + super.toString())  
}

So the framework method can be called like

someFrameWorkFunction(MyListImpl(emptyList()))

While this is not ad-hoc polymorphism by definition, under the hood it's equivalent to comparable solutions. Delegation can help to overcome many of the pain points when the need to satisfy interfaces with existing classes emerge. As for a general purpose usage of type classes, one has to keep in mind that delegation has a small runtime overhead, as it creates objects and delegates method calls.

Another very clean solution would be to use scopes extension methods to achieve the same thing. Usage of with can bring extensions defined on a interface into scope. Using a singleton would prevent us from creating objects, but obviously, one would have to pass the extension provider explicitly as a parameter.

interface ExtensionProvider<T> {
    fun T.customExtensionMethod()
}
object IntExtensionProvider : ExtensionProvider<Int> {
    override fun Int.customExtensionMethod() = println("Custom method prints: " + this)
}
fun <T> someFrameWorkFunction(list : List<T>, extensionProvider: ExtensionProvider<T>) {
    with(extensionProvider) {
        list.forEach { it.customExtensionMethod() }
    }
}

fun main(args: Array<String>) {
    someFrameWorkFunction(listOf(1,2,3), IntExtensionProvider)
}

And now I hope that I didn't confuse the wording of all these things :)

Montag, 11. Dezember 2017

Rendering thousands of unique animated instances

Some weeks ago, I implemented GPU skinning in my engine. As mentioned in my posting about multithreading and high performance stuff, you could face a problem with animated objects: Depending on the max amount of weights per vertex you want to support (usually 4 or 8), you have a different vertex layout to support now. Additionally, your entity contains a bone hierarchy that can be arbitrarily shaped. I implemented a second entity data buffer additionally to the main entity buffer that contains model matrix, material index etc. This has the advantage, that we now have a buffer that only contains equally sized nodes of all scene object's bone hierarchies....means that we can freely do index access into this global array from all shaders.

So in the vertex shader for animated objects, the corresponding bone (matrix) can be fetched from the structured buffer by index. The retrieved data structure contains its parent node index if present or -1 if the bone is the top of the hierarchy. Here we have a hierarchical data strucutre traversable by the GPU.

Per entity, it is now necessary to define a maximum count of animations that can run at the same time. Since vec4s are nicely aligned by the GPU, I chose 4. So each entity data now contains an additional vec4 that contains 4 float values, indicating the weight of the 4 active animations. I wanted to avoid an additional indirection here, so I didn't define the animation data in their own buffer. On the CPU side of things, an animation controller can play an animation and only has to update a single value in the entity data buffer (which is lockfree and unsynchronized, so extremely fast, read my post about multithreaded engines).

The vertex shader already has the actual entity data structure and can do the animation blending now on the GPU directly.

Recapture: The CPU updates the animation controller. The result is a float value per animation... the current weight of the animation. Since bones are precalculated on model import, everything is ready for the GPU now. The GPU only has to do some buffer fetches and some matrix multiplications. Combined with instanced rendering, where the animation controller is part of the per-instance data, one can have thousands of independent animations, for example to simulate people crowds. Or many Hellknights:

Keep in mind that there's no culling right now - neither instance cluster wise, nor instance based.

Freitag, 24. November 2017

Two-phase occlusion culling (part 1)

I really love rendering technique performance enhancements that work with existing asset pipelines and don't need artists to tweak and configure scene objects. Occlusion culling is a technique that is necessary for every engine that should be able to render interior scenes or outdoor scenes with much foilage. Some existing techniques require the user to mark occluder objects explicitely - which is not sufficient for outdoor scenes where ocludees are occluders too, some techniques are fast, but introduce sync points between cpu and gpu and therefore introduce latency and with it popping artifacts.

This papter describes a super nice algorithm very briefly: Two-phase occlusion culling. This technique has some advantages: It is blazingly fast, because it's GPU-only. Culling dozens of thousands of objects and even single instances is possible. It introduces no latency, so no popping artifacts. And it doesn't need hand-tweaking, no need to differenciate between occluders and occludees - every object can be both. The disadvantage is, that the technique requires a fairly new GPU with indirect rendering.

The algorithm

We need some buffers first:

A source command buffer, containing n commands.

A target command buffer of size n.

A visibility buffer, containing n ints, an entry for each source command.

A atomicCounter buffer.

All scene objects produce draw commands that are put into a buffer. Take a look at my other posts to see how to implement a lock-free multithreading engine with unsynchronized buffers. The entity data is put into a buffer as well. Part of this data is a AABB, calculated on the CPU side. The CPU only needs to calculate the AABB and put the latest data into a buffer.

The GPU-side now performs the following steps for rendering: A thread per draw command is executed. The corresponding entity data is fetched from the other buffer. The AABB is culled against the hierarchical depth buffer of the last frame (or an empty one if current frame is the first frame, doesn't matter). If the AABB is occluded, the command saves visibilityBuffer[index] = 1. Now another shader is executed with n threads. Every thread takes a look at it's visibilityBuffer index. If it contains a 1, the entity is visible and the sourceCommand is fetched and written to the targetCommand buffer. The index is the result of an atomicAdd on the atomicCounterBuffer. Now the atomicCounterBuffer contains the drawCount, while the targetCommand buffer contains the commands we need to render. Rendering is done with directRendering, so no readback to the CPU at all. Afterwards, the highz buffer is updated. Now the procedure is repeated, but only the invisible objects of the current frame are tested against the updated depth buffer. All visible marked objects were falsely culled and have to be rendered now. So all the algorithm needs is two indirect drawcommands, as well as the culling and buffer copying steps.

Here's an example of a scene, where 7-8 cars with roughly a million vertices is rendered. When behind a plane, they are occluded and therefore, no draw commands are comitted and the framerate goes through the roof.

Implementation

Normally, you are already rendering depth somehow, at least in the regular depth buffer. Here's the most generic compute shader way to create a highz buffer with OpenGL:

 public static void renderHighZMap(int baseDepthTexture, int baseWidth, int baseHeight, int highZTexture, ComputeShaderProgram highZProgram) {  
     highZProgram.use();  
     int lastWidth = baseWidth;  
     int lastHeight = baseHeight;  
     int currentWidth = lastWidth /2;  
     int currentHeight = lastHeight/2;  
     int mipMapCount = Util.calculateMipMapCount(currentWidth, currentHeight);  
     for(int mipmapTarget = 0; mipmapTarget < mipMapCount; mipmapTarget++ ) {  
       highZProgram.setUniform("width", currentWidth);  
       highZProgram.setUniform("height", currentHeight);  
       highZProgram.setUniform("lastWidth", lastWidth);  
       highZProgram.setUniform("lastHeight", lastHeight);  
       highZProgram.setUniform("mipmapTarget", mipmapTarget);  
       if(mipmapTarget == 0) {  
         GraphicsContext.getInstance().bindTexture(0, TEXTURE_2D, baseDepthTexture);  
       } else {  
         GraphicsContext.getInstance().bindTexture(0, TEXTURE_2D, highZTexture);  
       }  
       GraphicsContext.getInstance().bindImageTexture(1, highZTexture, mipmapTarget, false, 0, GL15.GL_READ_WRITE, HIGHZ_FORMAT);  
       int num_groups_x = Math.max(1, (currentWidth + 7) / 8);  
       int num_groups_y = Math.max(1, (currentHeight + 7) / 8);  
       highZProgram.dispatchCompute(num_groups_x, num_groups_y, 1);  
       lastWidth = currentWidth;  
       lastHeight = currentHeight;  
       currentWidth /= 2;  
       currentHeight /= 2;  
       glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);  
     }  
   }

And the corresponding compute shader:

 layout(binding = 0) uniform sampler2D sourceTexture;  
 layout(binding = 1, r32f) uniform image2D targetImage;  
 #define WORK_GROUP_SIZE 8  
 layout(local_size_x = WORK_GROUP_SIZE, local_size_y = WORK_GROUP_SIZE) in;  
 uniform int width = 0;  
 uniform int height = 0;  
 uniform int lastWidth = 0;  
 uniform int lastHeight = 0;  
 uniform int mipmapTarget = 0;  
 vec4 getMaxR(sampler2D sampler, ivec2 baseCoords, int mipLevelToSampleFrom) {  
      vec4 fineZ;  
      fineZ.x = (texelFetch(sampler, baseCoords, mipLevelToSampleFrom).r);  
      fineZ.y = (texelFetch(sampler, baseCoords + ivec2(1,0), mipLevelToSampleFrom).r);  
      fineZ.z = (texelFetch(sampler, baseCoords + ivec2(1,1), mipLevelToSampleFrom).r);  
      fineZ.w = (texelFetch(sampler, baseCoords + ivec2(0,1), mipLevelToSampleFrom).r);  
      return fineZ;  
 }  
 vec4 getMaxG(sampler2D sampler, ivec2 baseCoords, int mipLevelToSampleFrom) {  
      vec4 fineZ;  
      fineZ.x = (texelFetch(sampler, baseCoords, mipLevelToSampleFrom).g);  
      fineZ.y = (texelFetch(sampler, baseCoords + ivec2(1,0), mipLevelToSampleFrom).g);  
      fineZ.z = (texelFetch(sampler, baseCoords + ivec2(1,1), mipLevelToSampleFrom).g);  
      fineZ.w = (texelFetch(sampler, baseCoords + ivec2(0,1), mipLevelToSampleFrom).g);  
      return fineZ;  
 }  
 void main(){  
      ivec2 pixelPos = ivec2(gl_GlobalInvocationID.xy);  
      ivec2 samplePos = 2*pixelPos;//+1; // TODO: Fix this  
      int mipmapSource = mipmapTarget-1;  
   vec4 total;  
   if(mipmapTarget == 0) {  
     total = getMaxG(sourceTexture, samplePos, 0);  
   } else {  
     total = getMaxR(sourceTexture, samplePos, mipmapSource);  
   }  
   float maximum = max(max(total.x, total.y), max(total.z, total.w));  
 //Thank you!  
 //http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/  
  vec3 extra;  
  // if we are reducing an odd-width texture then fetch the edge texels  
  if ( ( (lastWidth % 2) != 0 ) && ( pixelPos.x == lastWidth-3 ) ) {  
   // if both edges are odd, fetch the top-left corner texel  
   if ( ( (lastHeight % 2) != 0 ) && ( pixelPos.y == lastHeight-3 ) ) {  
    extra.z = texelFetch(sourceTexture, samplePos + ivec2( 1, 1), mipmapSource).x;  
    maximum = max( maximum, extra.z );  
   }  
   extra.x = texelFetch( sourceTexture, samplePos + ivec2( 1, 0), mipmapSource).x;  
   extra.y = texelFetch( sourceTexture, samplePos + ivec2( 1,-1), mipmapSource).x;  
   maximum = max( maximum, max( extra.x, extra.y ) );  
  } else if ( ( (lastHeight % 2) != 0 ) && ( pixelPos.y == lastHeight-3 ) ) {  
   // if we are reducing an odd-height texture then fetch the edge texels  
   extra.x = texelFetch( sourceTexture, samplePos + ivec2( 0, 1), mipmapSource).x;  
   extra.y = texelFetch( sourceTexture, samplePos + ivec2(-1, 1), mipmapSource).x;  
   maximum = max( maximum, max( extra.x, extra.y ) );  
  }  
      imageStore(targetImage, pixelPos, vec4(maximum));  
 }

The culling shader code can be found on rastergrid, where you can find very much information about highz culling in general. I experienced some strange bugs with compute shaders, that seemed to be executed twice on my machine. Since I wasn't able to get rid of them, I used a vertex shader trick to execute arbitrary kernels on the GPU: Have some dummy vertex buffer bound (needed by specification) and use glDrawArrays with (commands.size + 2) / 3 * 3 (we need a multiple of 3 here). No fragment shader is needed for the shader program. In the vertex shader, gl_VertexID can be used as the invocation index. The following shadercode copies draw commands from visible entities to the target buffer. It's just an extract, but you get the idea:

 [...]  
 layout(std430, binding=9) buffer _visibility {  
      int visibility[1000];  
 };  
 uniform int maxDrawCommands;  
 void main()  
 {  
   uint indexBefore = gl_VertexID;  
   if(indexBefore < maxDrawCommands) {  
     DrawCommand sourceCommand = drawCommandsSource[indexBefore];  
     bool visible = visibility[indexBefore] == 1;  
     if(visible)  
     {  
       uint indexAfter = atomicAdd(drawCount, 1);  
       drawCommandsTarget[indexAfter] = sourceCommand;  
     }  
   }  
 }

There is some aspect missing yet: With instanced rendering, one has to introduce a offset buffer, where every entry gives the offset into a instanced entity buffer for a draw command. My current implementation has a AABB for an instance cluster, grouping several isntances into a draw command that can be culled. The next post will hopefully show, how this tecchnique can be extended to cull single instances, in order to have perfect culling for example for thousands of foilage objects.

Costs of Kotlin Delegation

If there is one thing I really really like about Kotlin, then it's the first class citizenship of delegation throughout the whole language. It provides a simple and concise way to have a replacement for inheritance. The nice thing here is, that abstract properties can be used in an interface, but the state remains in the implemting class, with very low amount of boilerplate, magic and chance of bullshit happening. Or an interface can be implemented by giving a delegating instance and that's it. Here's a short snippet to show what I am talking about (reduced example):

class Delegation(val myImpl: MyInterfaceImpl) : MyInterface by myImpl

Since this can be done in Java as well with default methods in interfaces and IDE code generation, it really is the reduced amount of boilerplate that has to be done to achieve it. It's only a worthy thing, if it can be expressed in readable, efficient code.

And the last one is a very important criteria: Efficient. I don't want to make my JVM code slower as it is already. We're talking about games - if another language is slower, than the code I can produce with Java is the better compromise for me. If I imagine using delegation over inheritance with Kotlin throughout my codebase, in every hierachy..will it slow down my engine remarkably? I ran benchmarks of implementations of a directly implemented field, an inherited field and a delegated field with Java and Kotlin. I expected a field to be faster than a property in general, and a delegated field to be slower alltogether. I used JMH and a blackhole, so there should be everything implemented just fine, but I get these unexpected results:

 Benchmark                     Mode Cnt     Score     Error Units  
 getStringJavaDelegation      thrpt 200 220181331,464 ± 2144028,358 ops/s  
 getStringJavaImplementation  thrpt 200 171078263,764 ± 889605,110 ops/s  
 getStringJavaInheritance      thrpt 200 170878616,220 ± 818848,070 ops/s  
 getStringKotlinDelegation     thrpt 200 225753956,507 ± 1740352,057 ops/s  
 getStringKotlinImplementation thrpt 200 168879795,813 ± 2728455,723 ops/s  
 getStringKotlinInheritance     thrpt 200 170414757,249 ± 1515476,325 ops/s

Turns out the delegation to a delegate field is faster than the other two versions... Okay, I have the instantiation included in the benchmarked code, and even though I expected delegation to be slower right then, I removed it from the benchmark - so now a single getString() call is measured. Results:

 Benchmark                     Mode Cnt     Score     Error Units  
 getStringJavaDelegation      thrpt 200 301713586,642 ± 8160921,344 ops/s  
 getStringJavaImplementation  thrpt 200 225820433,449 ± 3676854,362 ops/s  
 getStringJavaInheritance      thrpt 200 234833613,665 ± 561919,892 ops/s  
 getStringKotlinDelegation     thrpt 200 320742908,021 ± 1406189,583 ops/s  
 getStringKotlinImplementation thrpt 200 230377534,877 ± 3347435,643 ops/s  
 getStringKotlinInheritance     thrpt 200 230821924,187 ± 1159446,814 ops/s

No chance, same results. And additionally, Kotlin's delegation seem to be even faster then the bare hand implementation with Java. I decided to take a closer look at the bytecode.

 DelegationJava Bytecode  
  public getString()Ljava/lang/String;  
   L0  
   LINENUMBER 14 L0  
   ALOAD 0  
   GETFIELD de/hanno/playground/DelegationJava.impl : Lde/hanno/playground/MyInterfaceImplementation;  
   INVOKEVIRTUAL de/hanno/playground/MyInterfaceImplementation.getString ()Ljava/lang/String;  
   ARETURN  
   L1  
   LOCALVARIABLE this Lde/hanno/playground/DelegationJava; L0 L1 0  
   MAXSTACK = 1  
   MAXLOCALS = 1  


 DelegationKotlin Bytecode  
  public getString()Ljava/lang/String;  
   L0  
   ALOAD 0  
   GETFIELD de/hanno/playground/DelegationKotlin.myInterfaceImplementation : Lde/hanno/playground/MyInterfaceImplementation;  
   INVOKEVIRTUAL de/hanno/playground/MyInterfaceImplementation.getString ()Ljava/lang/String;  
   ARETURN  
   L1  
   LOCALVARIABLE this Lde/hanno/playground/DelegationKotlin; L0 L1 0  
   MAXSTACK = 1  
   MAXLOCALS = 1  
 

ImplementationJava ByteCode  
  public getString()Ljava/lang/String;  
   L0  
   LINENUMBER 14 L0  
   ALOAD 0  
   GETFIELD de/hanno/playground/ImplementationJava.myStringField : Ljava/lang/String;  
   ARETURN  
   L1  
   LOCALVARIABLE this Lde/hanno/playground/ImplementationJava; L0 L1 0  
   MAXSTACK = 1  
   MAXLOCALS = 1

Not much difference here, just a LINENUMBER instruction. So I have to admit I'm very satisfied with this result, even though I can't explain it. Of course I know that one Bytecode is not everything ... but I don't feel like investing more time here because I have more interesting things to implement :) If anybody has further ideas here, I would like to hear.

Montag, 14. August 2017

Multithreaded game engines and rendering with OpenGL

There is this one question since beginning of multicore processors: How to multithread your game engine? I can't answer this question in general, but I can answer the question How to multithread the rendering part of my game engine?

Problem

The most simple case is a single-threaded game engine that does this:

 while(true) {  
   simulateWorld();  
   renderWorld();  
 }

The update step produces a consistent state of the world. After each update, the rendering is done. After the rendering is done, the next cycle is processed. No synchonization needed, everything works. The bad thing: simulateWorld is a cpu-heavy task, renderWorld is a gpu-heavy task. So while the world is updated, the gpu idles, while the gpu renders, the cpu idles (more or less). And the worst thing: Your framerate is limited by both the cpu and the gpu. Your frametime is timeOf(simulateWorld) + timeOf(renderWorld). Our target frametime is timeOf(renderWorld) and to keep this time as low as possible. More precisely the target is timeOf(renderWorld).

Foundations

Okay I lied: The above mentioned problem is only the overall-globally-problem we solve. There's a shitload of other problems, that mostly depend on the language platform, graphics api, hardware, your skillset and so on. Since our lifetime is limited, and I want to give practical, real-world-relevant help, I make some assumptions. One of these is, that you use OpenGL and a pretty new version of it. Let's say 4.3. Multithreaded rendering could probably be done better and faster with DX 12 or Vulkan, but let's stay cross-platform and simpler with OpenGL.

I assume that you already have at least a little bit knowledge about how game engines work, timesteps, scene representations and so on.

Multiple gpu contexts

In order to be able to issue commands to OpenGL, you have to use a thread that has an OpenGL context bound to it. Architecture-wise, there is the CPU, the OpenGL driver and the GPU. The driver holds a queue, where the commands from the SPU are buffered somehow (implementation dependend). Then again, there is a queue for the GPU - that again buffers commands that came from the driver (implementation dependend). First, all commands the CPU issue are asynchronous, which means there's no actual work done on the GPU yet, but the command is queued by the driver. The order is preserved, that's what you can count on. The implementation can decide how many commands are queued up, until the cpu thread will block on command issuing. And the implementation can decide how many commands are buffered on the GPU side of things.

Now, having that said, let's think about multithreading. What we can control is the CPU side (unless you write your own driver, which I doubt you want to do). Multiple cpu threads would require multiple gpu contexts - one for each thread. While this is theoretically possible with OpenGL, it is a very dumb idea in 19 out of 20 cases. As with traditional multithreading, context switching comes with a very high overhead - so you have to ensure that context switching overhead doesn't eat up the performance you gained from using multiple threads. And here's what's wrong: It seems that all OpenGL drivers (where you issue your comamnds at) except the one for iOS, are implemented singlethreaded. That means you can issue commands from multiple cpu threads, but on the driver thread, those commands are processed sequentially, with a context switch in between. You can find a more detailled explanation here, so that I don't have to lose many further words, but: Don't use multiple contexts for anything else than asynchronous resource streaming.

I assume we only use one OpenGL context. That means, all OpenGL calls must happen on a single thread. The construct used for this is the command pattern. A few hints: You should issue as few commands to OpenGL as possible. You should issue as lightweight commands as possible. Obviously, using only one thread for command execution limits the amount of work that can be done by this (cpu) thread. Nonetheless, you want a single, inifintely running worker thread that owns your OpenGL context and takes commands out of a non-blocking command queue. For convenience, you can seperate commands that should return a result (hence block) or should just be fired and forgotten. It's not nessecary that all commands are properly implemented by yout context wrapper/worker thread class, even though it makes your application more predictable in terms of performance. I implemented a non-blocking, generic command queue in Java here that uses Runnables and Callables as Commands. The principle is also explained more detailed here.

Extractor pattern

Extractor pattern is a term one never finds on the internet when searching for information about rendering. The first time I heard this, was when a (professional graphics programming) collegue of mine explained the basics of renderer architectures to me many many years ago. Fancy name for a simple thing: If you share data between two consumers (not exactly threads in this case), you either have to synchronize the access somehow, or you extract a copy of the data to pass it to the renderer in this case. That implies, that you need a synchronized/immutable command-like data structure, that represents all the information you need to push a render command. That plays nicely with the chapter above, you know, command queue etc. In languages other than Java where you have true value types, the implementation could be easier - although having very large renderstate objects copied over and over might be a drawback here. The chapter below shows another way of handling this problem.

Multibuffering

Let's assume that the GPU only consumes renderstate, but doesn't alter it. That means if the GPU calculates any fancy physic, or results that should be passed to the next render cycle, than it isn't seen as part of our traditional renderstate for now.
That means the only producer of renderable state is the CPU. Since we don't want to render a scene where the first half objects are in timestep n and the second half is still in n-1, we have to be able to access a state of our gamestate, that is concise somehow. This is easily achievable with a task-based architecture, that basically has only one update thread, that sequentially crunches down what you want to do in parallel. For example if you have 10 game objects, the updatethread pushes 10 update-commands into a pool of worker threads and after all of them returned a result, we have a concise world state we can use and used all threads our system can push. I don't think, there's an alternative approach to multithreading that can be used in a general purpouse game engine, but I may be wrong. Instead of creating a new renderstate object and pushing a command to the renderqueue, we use something that's known in multithreading environments for ages: multibuffering.

This should be confused with double or triple buffering that is used to display images on your monitor, although the principle behind it is the same.

I skip explaining why double buffering isn't enough to satisfy our needs. Just use triple buffering and be happy, if you can afford the additional memory consumption. This works the following way.

You have three instances of your renderstate. You have a wrapper, that encapsulates these three instances. Instance A is the currently readable state. The renderer uses this state for rendering, so while rendering is in progress, this instance mustn't be touched by anyone else than the renderer. Instance B is the current staging copy of the renderstate. It is the next state, that the renderer will use for the next frame, after the current frame is finished. Copy C is the isntance the update is currently applied on.
Now while this sounds simple, the important thing is, how to swap those instances correctly. Our main purpouse is to keep the GPU busy and to be able to render maximum fps, although one could theoretically (and practically) limit the framerate somehow (vsync, fps cap etc.).

I realized this with a ever-running thread that does push and wait for a render-scene-drawcall constantly. After the frame is finished, the read copy is swapped with the staging copy. It's important, that the renderer get's a fresh copy on every swap. If the update thread isn't fast enough to provide a new updated renderstate copy, the renderer would render a state that is older than the one just drawn. This will cause flickering objects. I realized this with an ongoing counter, and I prevent swapping to a state with a lower number then the current state has. Let's assume your engine and the machine is capable of pushing 60cps update and 60fps framerate.
Let's face the other swap: The update thread updates the write copy C constantly as fast as possible. When finished, the write state becomes the new staging state B, so we have to swap them. We can't do this if the renderer is currently swapping his read and the staging state. So we need a lock for swap(A,B) and a lock for swap(B,C). If rendering can be done with okayish framerates (which we assume), the updatethread can wait (block) until swap(A, B) is finished. If you are using only coherent memory access, than you're donw here. But you wouldn't search for state-of-the-art rendering, if you don't use incoherent access. So let's get to the final ingredient we need for our super fast renderer.

Lowering draw calls and persistent mapped buffers synchronization

At the very beginning, there is the need to reduce the amount of work the gpu has to do, in order to render your scene. There are paths thorugh the OpenGL API that are more expensive than others, and there are paths with very very small overhead, if you can afford the loss of flexibility that comes with using it. This can give you a rough overview about how costly API calls are. I suggest we skip a long explenation and you read this. Summed up: You want unsynchronized buffers for all your data and handle synchronization by yourself. This results in no driver or gpu work for the synchronization, which is crucial for maxing out your performance. Remember that Vulkan was designed to exactly kill the driver overhead of OpenGL. Furthermore, we reduce draw calls and state changes with bindless resources and one of the most important things: indirect and instanced rendering.

Having all these things implemented, there's one last thing to add to our triple buffering: synchronization. Since the driver works asynchronous and the gpu, too, there's no guarantee that the gpu doesn't use the buffer of our current write copy, that the update threads currently writes to. To adress this, one has to insert a fence sync object into the pipeline and associate it with the state, whenever a state becomes the current write state. When the render thread uses this thread and pushing all render commands of the current frame are pushed, the fence signal is pushed. Whenever this signal is processed by the gpu, the update thread can write to the associated state without any problems. The update thread can prepare the renderstate update and wait until the prepared data can be applied. This effectively blocks the update thread, if the renderer is more than 2 frames behind the update thread.
Special care has to be taken if you have any timestep-dependent calculations in your rendeirng step: Since rendering is decoupled form the update-step, you have to calculate the timestep by yourself somehow in the renderthread, based on the last tick and the time passed since then.

That's it

Using the fastest possible paths through the OpenGL API. You need shader storage buffers and indirect (and inntanced) rendering, probably combined with bindless textures
Model a renderstate class that can be your complete draw command
Run a thread permanently, that permanently pushes rendercommands

Samstag, 15. April 2017

Voxel Cone Tracing Global Illumination in relatime

Quite a while ago, I implemented Voxel Cone Tracing global illumination for my own custom engine. This technique can be quite impressive. Used Java, OpenGL, running on a GTX 770 desktop GPU.

Here's another video of the famous cornell box, where I added a different diffuse tracing calculation. Same engine (newer version), Running on a GTX 1060 mobile gpu.

And another run with a point light only configuration on the mobile gpu.

Donnerstag, 16. März 2017

Rendering 4 Mio. vertices with 340000 cubes in Java

Someone has to fight the battle against the "Java is so slow, all Java games have low fps" myth. Maybe Java is not the best language for game development because it lacks value types and easy and zero overhead native code integration .... but the results one could achieve with using a zero-driver-overhead path like modern graphics APIs recommend for so many years, can be on par with what you can achieve in C or C++ as in Unreal, CryEngine or Unity. The secret is: Bindless ressources, indirect and instanced rendering, persistent mapped buffers, direct state access and of course good old multithreading.

So, the most important things first: Use indirect rendering (to minimize calls the cpu has to issue!) and a shared global vertex buffer (to reduce state changes). Use large uniform or shader storage buffers for your object's properties and material properties. For each object, push a command into the buffer. Massive object counts cause massive command counts and large command buffers, so one can now use instancing to further reduce command count. With a clever structure (a offset buffer for example), one can easily have unique properties per object instance (for example dedicated textures per instance!) sourced from a large uniform buffer, when it's okay to share a common geometry (vertices). Et voila, 2ms cpu time to fire a render command that draws 340.000 instanced cubes with 4 million triangles at 60 fps in Java. Each object can have it's own textures and properties etc.

Why simple automatic lightmap coords (usually) don't work

Some time ago, I experimented with new algorithms for dynamic global illumination in realtime after I was not too satisified with my Voxel Cone Tracing. A friend of mine invented a method I am probably not allowed to talk about, but was really impressed of. It is somehow related to this one in a certain degree: Real Time Point Based Global Illumination. The idea of having a global scene parameterization (for example a lightmap texture) is very compelling, especially if it can be used for global illumination calculation. Inspired by the method I just linked, the idea was: Use a moderately sized lightmap (n*m or n*n) for the entire scene, make the lightmap texture a "deferred buffer" and store a texture with world position, normals, color and maybo other material properties. Afterwards, use brute force in the form of a compute shader, to solve each texels (which is effectively a scene surfel) illumination through incoming light (all other texels of the fat lightmap). Split the n*n computation in 4 or more parts and update the final texture every 4th or so frame. Evaluate the global illumination as a fullscreen effect in your deferred pipeline -> finished. The nice thing is, that dynamic objects can not only receive, but contribute global illumination with updating the position (and other) textures as well. Multiple bounces can implemented with simple iteration and ping-ponging...

While this sounds very very nice, which it is, I tested it successfully with the famous cornell box. This scene is small enough to have a very small lightmap. Also, the geometry is simple enough, that the lightmap coords can be packed very tightly. Using an emmissive box for dynamic global illumination contribution worked fine as well. But now back to the real world:

First thing is, that I'm not an artist. And most testing models I use aren't lightmapped. So because I don't like to depend on someone else, my method should work completely automatic, maybe with some precomputation. Fine, let's automatically calculate lightmap coordinates. There are so many different algorithms on the internet, that I keep it short: Most automatic methods can't handle all possible cases of geometry. Many people on the internet (and who knows better :P) tend to say, that simple planar projection is the only fast enough, simple enough allround algorithm in existence. I implemented planar projection of triangles. Each (immutable) scene has lightmap coords precalculated. Therefore, all triangles are ordered by height, so that no single triangle is higher than the one before it. Every triangle has to cover at least one texel, or it won't be rendered, when rendering the lightmap. I used the triangles world-sizes to span a lightmap space, so for example alle of my scene's faces cover an area of 1022*2344 or so. Afterwards, I determine how big my lightmap has to be - the scaling factor is applied to the worldspace lightmap coords at runtime. Everything fine so far. Worked like a charm. Here's a debug picture of a "simple" scene with some dozens of thousands of triangles.

One can already see whats the problem: many small triangles.

It took me just until this point in my implementation, when I realized, that I didn't think it through. Having a box-like geometry makes this mesh use maybe 12 triangles. But already a small sphere can use 400 triangles and be only of 1m³ size. Without simplifiing the meshes, I had to reserve one texel per triangle. Even more when not tightly packed. My scene had 250k triangles, thats 500*500 with absolute perfect packing. With padding and garbage, and the fact that larger triangles occupy more texels, I finally had to allocate a lightmap of 24000*1024....which obviously isn't possible to handle at all, even with brute force compute shaders.

So, I really wonder if there exists an efficient way for automatic lightmap coord generation without mesh simplification or complex segmentation algorithms. Goodby, allmighty realtime global illumination with fat lightmaps, back to Voxel Cone Tracing :)