Dienstag, 13. Februar 2018

Koin vs Vanilla Kotlin

Here we go again: I'm searching for a performant, convenient solution for dependency injection in Kotlin. Koin got my attention, because it seems to be simple and straightforward. Since it uses a lot of small inline functions and seems to have only a few hotspots where performance suckers could lurk, it seemed very promising. But I wouldn't want to use it in my game engine project until I can be sure that the performance impact would be negligible. So I did a ... probably totally flawed microbenchmark, that doesn't show anything, but I want to dump it here nonetheless.

So we have a simple service class and another service class that depends on the first service. The main module depends on the seconds service, so the chain has to be fulfilled.


class MainModuleKoin : Module() {
    override fun context(): Context = applicationContext {
        provide { ServiceA() }
        provide { ServiceB(get()) }
    }
}
class MainModuleVanilla(val serviceA: ServiceA, val serviceB: ServiceB)

class MainComponentKoin : KoinComponent {
    val bla by inject<ServiceB>()
}
class MainComponentVanilla(val bla: ServiceB)

class ServiceA
class ServiceB(val serviceA: ServiceA) {
}  

Using Koin, one can simply write a few lines and everything is wired together automatically. Note that the Koin context has to be started and stopped, which has to be excluded from the benchmark later.


@JvmStatic fun benchmarkKoin(): ServiceB {
    return MainComponentKoin().bla
}

@JvmStatic fun stopKoin() {
    closeKoin()
}

@JvmStatic fun startKoin() {
    startKoin(listOf(MainModuleKoin()))
}

@JvmStatic fun benchmarkVanilla(): ServiceB {
    return MainComponentVanilla(ServiceB(ServiceA())).bla
}

The benchmark is executed as follows, to ensure all object creation happens and context creation is done outside of the benchmarked code:

@State(Scope.Thread)
public static class MyState {

    @Setup(Level.Trial)
    public void doSetup() {
        KoinBenchmarkRunner.startKoin();
    }

    @TearDown(Level.Trial)
    public void doTearDown() {
        KoinBenchmarkRunner.stopKoin();
    }

}
@Benchmark
public void benchmarkKoin(Blackhole hole, MyState state) {
    hole.consume(KoinBenchmarkRunner.benchmarkKoin());
    hole.consume(state);
}
@Benchmark
public void benchmarkVanilla(Blackhole hole, MyState state) {
    hole.consume(KoinBenchmarkRunner.benchmarkVanilla());
    hole.consume(state);
}

 The result is somehow sobering

Benchmark                          Mode  Cnt          Score         Error  Units
BenchmarkRunner.benchmarkKoin     thrpt  200    1425585.082 ±   31179.345  ops/s
BenchmarkRunner.benchmarkVanilla  thrpt  200  106484919.110 ± 1121927.712  ops/s   

Even though I'm aware that this is an artificial benchmark that may be flawed, it's pretty much clear that using Koin will have a huge impact on performance, that could make program infrastrucutre slower by a factor of 100. Of course, we're talking about dependency injection at object creation time, which should be a rare case in a game engine. Nonetheless, not too good from my sight.

Mittwoch, 7. Februar 2018

Kotlin's scoped extensions micro-benchmarked

My last post was about an approach to use Kotlin's scoped extension methods to implement an application with data oriented design paradigm. Yes, I'm still coding that game engine, that's why I had to do a simple benchmark, just to get a feeling how performance could get better or worse. See it as a brain dump. Very unprofessional benchmark with the println statement, but I wanted to get the relation between the simple baseline implementation and the extension method version, like this:


import org.openjdk.jmh.annotations.Benchmark

interface Extension<ELEMENT>{
    fun ELEMENT.prettyPrint() { println("Default pretty " + this) }
}

object StringListExtension : Extension<String>

fun <T> someFrameWorkFunction(parameter : T, extensionProvider: Extension<T>) {
    with(extensionProvider) {
        parameter.prettyPrint()
    }
}

@Benchmark
fun extension() {
    someFrameWorkFunction("asd", StringListExtension)
}

fun String.prettyPrint() { println("Default pretty " + this) }

@Benchmark
fun baseline() {
    "asd".prettyPrint()
}

Surprising results again:


Benchmark                            Mode  Cnt       Score       Error  Units
BenchmarkRunner.benchmarkBaseline   thrpt  200  269087.160 ± 17915.393  ops/s
BenchmarkRunner.benchmarkExtension  thrpt  200  329648.131 ± 19646.005  ops/s

Once again, the opposite of my expectations.

Two-phase occlusion culling (part 2) - instance aware culling

I described the basics about the two-phase occlusion culling technique in the first part post, but there is one thing, that doesn't work well with this simple setup: Instancing or sometimes viewed as batching.


The problem is very simple, the solution however is not. Only a few engines managed to implement the technique, but it is unavoidable if one needs to render large batches or large amounts of instances. Quick recap, under the assumption that we compromise flexibility in favor of speed and use OpenGL 4.5 level graphics APIs:

Avoid buffer changes one, two different vertex layouts (for example for static and animated meshed)

Avoid program changes well defined firstpass program (deferred rendering) or subroutines in a global attribute buffer

Minimize drawcalls Usage of instanced and indirect rendering


The usage of draw commands and indirect rendering is nice enough. The amount of drawcommands is reduced by using instanced rendering for entities that have different uniform attributes but the same vertices and indices. This results in the command and entitiy buffers of the following form:


Entities
uint entityId 678
mat4 transform 000...
uint materialIndex 18
uint entityId 1566
mat4 transform 000...
uint materialIndex 2

Commands
vertexOffset 0
indexOffset 0
indexCount 99
instanceCount 2
vertexOffset 1200
indexOffset 99
indexCount 33
instanceCount 3

Firing indirect rendering call is now done with a count of two, because we have two commands. 5 objects will be drawn. The shaders are invoked several times and the entity data can be accessed with an index and fetched from the entities buffer. The resulting indices in the shader will look like

Shader Invocations
DrawId InstanceId Entity buffer index
0 0 0
0 1 1
1 0 ???
1 1 ???
1 2 ???

As mentioned in the previous blog post, there's no chance for us to know the entity index when the second command is executed, because the current offset is based on the amount of instances the previous commands draw. The solution is to use an additional offset buffer with size of the draw command buffer. For every command, we set the offset to the appropriate value when creating the commands on the cpu side. With instance based culling, this problem intensifies, because the cpu doesn't know the offset anymore. The calculation has to be done on the GPU now. My solution is still based on vertex shader kernels, but I will tell you later why this is probably problematic. First how it is done, because conceptionally, it will be the same for compute shaders. The layout now looks like this:

Shader Invocations
DrawId InstanceId Entity buffer index (command offset + InstanceId) Offset of the command
0 0 0 0
0 1 1 0
1 0 2 2
1 1 3 2
1 2 4 2


The first step is to determine the visibility for every single instance. Vertex shader kernels have an advantage here, because they can be arbitrarily large (compute shader groups are limited to 1024 or so). A single draw call can determine visibility for dozens of thousands of instances or entities. Combined with instancing, we can use DrawId and InstanceId to index into the command buffer and offset buffer (DrawId) and into the entity buffer (offset + InstanceId). Since the same kernel sizes are applied to every command, invocations are wasted if few commands have many more instances as others. So you might want to launch draw calls without instancing, one per command, which could be faster here.
The visibility is again stored into the visibility buffer, which has to be as large as the entity buffer now. With non-instanced culling, size of the command buffer was sufficient. One important thing has to be done now: Every visible entity thread has to increase a counter that is associated with the corresponding command - since this is done per invocation, atomic operations are needed. So we need another int buffer (just like the offset buffer), that holds the visible instances count per command.

The reason why this is so important is, that this is the bit of information that is used in a second step to append entity data and draw commands in parallel. This is done by an algorithm similar/equal to parallel prefix sum - you can google it. Tldr: Each command has to know how many visible instances all previous commands produce in total, so that it can append its own instances there.

I call the seconds step appending step, while the first one is the visibility computation step, that actually calculates the visibility of an instance. For a massive amount of instanced commands, one would probably want to launch a two-dimensional kernel of the size n*m where n is the amount of commands and m is max(maxInstanceCount, someArbitraryValue) or something. Again, with vastly different instanceCounts per command, multiple shader calls could give a benefit.

So now we need a target buffer that holds the entity data, a target buffer for the commands and a target buffer for the entity data offsets. Additionally, we need an atomic counter for the target command index and then an atomic counter per command, that is the current index of the entities per command. Each shader invocation has its own command index in the DrawID built-in and the instance index in the InstanceID  built-in. So we can calculate all information that is needed on the fly:

struct oldCommand = read from the bound command buffer with DrawID
uint  oldCommandEntityOffset = read from the oldOffsetBuffer with DrawID
uint oldEntityIndex = oldCommandEntityOffset + InstanceID
uint visibilityBufferIndex = oldEntityIndex
uint targetCommandIndex = atomicAdd(commandConuter, 1)
uint targetInstanceIndex = atomicAdd(commandCounters[targetCommandIndex], 1)
uint targetEntityIndex =  sumOfVisibleInstancesOfAllCommandsBeforeCurrentCommand

The current entity data can then be written to the targetEntityDataBuffer, as well as the offset for the instance/command. The command can be written by the first active thread (InstanceID == 0). The resulting buffers contain only the visible instances, offsets of the visible instances and drawcommands that can be used to only draw exactly those instances.

 Here's a video demonstrating occlusion and frustum culling with this technique:



Bonus: For the visibility computation step, it's nice to launch a vertex shader kernel, since the kernel can be arbitrarily large in two dimensions at max. Since there's no shared memory involved, this will probably perform better than the compute shader equivalent, maybe at the same speed, but it shouldn't be slower. The appending step needs an atomic operation per invocation, because every invocation increments the counter if the corresponding instance is visible (which the other invocations couldn't know). Instead of "global" shared memory, one could use shared memory of a compute shader group. Shared memory in a group is much faster than atomic operations between a gazillion vertex shader invocations, so I will implement this at some time and may be able to show a performance comparison.