Hannos Blog: 2018

Montag, 17. September 2018

Kind-of-structs on the JVM using Kotlin's delegated properties

Project Valhalla is on everyone's lips nowadays, but the problem is: It is for years now and there's no concrete schedule when we can expect value types to be part of the JVM.

In games, there is a desperate need for value types or at least control about the object layout. Why? Because one has to share memory with the native side. For example OpenGL lets you use a persistent mapped buffer - combined with multibuffering and your own synchronization gives you a blazing fast multithreading approach for your engine. But OpenGL doesn't want to read your Java object's headers, that's why you can't use regular serialzation mechanisms and instead you have to put your objects into a ByteBuffer float by float or int by int.

Using standard Java/JVM heap objects, one has to update all the objects and afterwards extract them to a ByteBuffer. This means two iterations. Better would be to have objects that use a ByteBuffer directly, in order to be able to skip the buffer extraction completely.

Now there's Kotlin with its delegated properties. All the basic examples show how to use a hash map instance as a backing storage for arbitrary properties of an object (https://kotlinlang.org/docs/reference/delegated-properties.html). This led me to the idea to use delegated properties to access a ByteBuffer object as a backing storage for objects and structures of objects - just like structs in C do it.


interface Struct {
  byteOffset: Int
  buffer: ByteBuffer
}

// some missing magic here for property registration and local offset calculation

class FloatProperty(val localOffset) {
  inline operator fun getValue(thisRef: Struct, KProperty<*,*>): Float {
   thisRef.buffer.getFloat(thisRef.byteOffset+localOffset)
  }
}

class MyStruct: Struct {
  // .. missing magic here
  val myFloat by FloatProperty()
  val mySecondFloat by FloatProperty()
}

This will result a flat Layout for a MyStruct instance. This can be used directly by native APIs. So what I'm interested in is, how well does this perform compared to Vanilla Java approach? How much overhead is there for all those methods and delegate instances? I tested several different implementations that differ in convenience for the user and overall performance. One surprise for me was, that none of my implementations (neither ByteBuffer nor Unsafe backed) is as fast as vanilla Java. There are some benchmarks on the internet that tell a different story, for example this one. I can't really tell you how it was achieved. Just that I wasn't able to achieve similar results.

Benchmark
Mode Cnt Score Error Units
iterAndMutBufferDirect
thrpt 12 90626,796 ± 303,407 ops/s
iterAndMutKotlinDelegatedPropertySlidingWindowBuffer
thrpt 12 23695,594 ± 82,291 ops/s
iterAndMutKotlinDelegatedPropertyUnsafeSimpleSlidingWindowBuffer
thrpt 12 27906,315 ± 52,382 ops/s
iterAndMutKotlinDelegatedPropertyUnsafeSlidingWindowBuffer
thrpt 12 25736,322 ± 904,017 ops/s
iterAndMutKotlinSimpleSlidingWindowBuffer
thrpt 12 27416,212 ± 959,016 ops/s
iterAndMutResizableStruct
thrpt 12 10204,870 ± 189,237 ops/s
iterAndMutSimpleSlidingWindowBuffer
thrpt 12 27627,217 ± 122,119 ops/s
iterAndMutStructArray
thrpt 12 12714,642 ± 51,275 ops/s
iterAndMutStructArrayIndexed
thrpt 12 11110,882 ± 26,910 ops/s
iterAndMutVanilla
thrpt 12 27111,335 ± 661,822 ops/s
iterStruct
thrpt 12 13240,723 ± 40,612 ops/s
iterVanilla
thrpt 12 21452,188 ± 46,380 ops/s

All benchmarks iterate over a collection of 5000 Vector3f instances. iterAndMutVanilla is just a regular ArrayList iteration with forEach, setting the three components of each vector. iterAndMutStruct is my current implementation of a tight StructArray of Vector3fs with a sliding window iteration.

Vanilla Java iteration with mutation yields the baseline results with 27k operations. It's very intersting, that a non-abstracted simple version with a direct bytebuffer is three times as fast as the baseline, reaching 90k operations. Simple non-abstracted implementations with Kotlin's delegates brings us down to the baseline performance again. My struct abstraction in the current implementation with a struct array class implementation can only reach 50% of the baseline - quite a difference between the simple delegate approach and only a rough sixth of the simple direct bytebuffer approache's performance.

I have to figure out why my abstractions degrade performance by such amounts - the generated bytecode looks pretty similar for all the versions. At the time of writing, Kotlin's inline classes are not stable enough for delegate usage, so delegates cause some class overhead here.

But even though there are some performance differences in this very micro benchmark, it doesn't necessarily mean that other use cases show such dramatic differences as well. Additionally, the largest benefit my struct-alike implementation offers is, that now large and complex datastructures can be memcopied like this:


class MyStruct: Struct {
  // .. missing magic here
  val myFloat by FloatProperty()
  val mySecondFloat by FloatProperty()
}

val source = MyStruct().apply {
  myFloat = 5
}

val target = MyStruct()
source.copyTo(target) // Simple extension method that copies a bytebuffer
println(target.myFloat) // prints 5

This means no iteration over nested arrays, complex copy constructors and even more complex nested invocation of them. Super handy for renderstate constructs in game engines - your whole renderstate instance can be mapped to a OpenGL struct and mapped as a shader storage buffer :)

Dienstag, 27. März 2018

Programmatically compile Kotlin code

Here's a small braindump again, because I don't want to forget how I managed to compile Kotlin files programmatically. There are some very small discussions online, for example this one, but I find existing usages too complex and there are too many things that didn't work for me at the first few tries.

So there's a very convenient artifact that seem to contain everything one needs in order to call a Kotlin compiler from within code and that's published for every Kotlin release. Use it with

    compile group: 'org.jetbrains.kotlin', name: 'kotlin-compiler-embeddable', version: "$kotlin_version"

The following snippet instantiates a compiler, passes a file and some other properties to it and executes compilation. The resulting class file can be found in the given output directory.

val output = File("/home/myuser/out")

K2JVMCompiler().run {
    val args = K2JVMCompilerArguments().apply {
        freeArgs = listOf(File("/home/myuser/KotlinFile.kt").absolutePath)
        loadBuiltInsFromDependencies = true
        destination = output.absolutePath
        classpath = System.getProperty("java.class.path")
                .split(System.getProperty("path.separator"))
                .filter {
                    File(it).exists() &amp;&amp; File(it).canRead()
                }.joinToString(":")
        noStdlib = true
        noReflect = true
        skipRuntimeVersionCheck = true
        reportPerf = true
    }
//        output.deleteOnExit()
    execImpl(
            PrintingMessageCollector(
                    System.out,
                    MessageRenderer.WITHOUT_PATHS, true),
            Services.EMPTY,
            args)
}

The usage depends on the project where it is used, for example it coult be necessary to add a different classpath or add the classpath and a stdlib or something else.

EDIT: I'm sure I found this snippet, or a similar one in the internet, but I can't find it anymore. If you found it, let me know and I will link it as source.

Dienstag, 13. Februar 2018

Koin vs Vanilla Kotlin

Here we go again: I'm searching for a performant, convenient solution for dependency injection in Kotlin. Koin got my attention, because it seems to be simple and straightforward. Since it uses a lot of small inline functions and seems to have only a few hotspots where performance suckers could lurk, it seemed very promising. But I wouldn't want to use it in my game engine project until I can be sure that the performance impact would be negligible. So I did a ... probably totally flawed microbenchmark, that doesn't show anything, but I want to dump it here nonetheless.

So we have a simple service class and another service class that depends on the first service. The main module depends on the seconds service, so the chain has to be fulfilled.

class MainModuleKoin : Module() {
    override fun context(): Context = applicationContext {
        provide { ServiceA() }
        provide { ServiceB(get()) }
    }
}
class MainModuleVanilla(val serviceA: ServiceA, val serviceB: ServiceB)

class MainComponentKoin : KoinComponent {
    val bla by inject<ServiceB>()
}
class MainComponentVanilla(val bla: ServiceB)

class ServiceA
class ServiceB(val serviceA: ServiceA) {
}

Using Koin, one can simply write a few lines and everything is wired together automatically. Note that the Koin context has to be started and stopped, which has to be excluded from the benchmark later.

@JvmStatic fun benchmarkKoin(): ServiceB {
    return MainComponentKoin().bla
}

@JvmStatic fun stopKoin() {
    closeKoin()
}

@JvmStatic fun startKoin() {
    startKoin(listOf(MainModuleKoin()))
}

@JvmStatic fun benchmarkVanilla(): ServiceB {
    return MainComponentVanilla(ServiceB(ServiceA())).bla
}

The benchmark is executed as follows, to ensure all object creation happens and context creation is done outside of the benchmarked code:

@State(Scope.Thread)
public static class MyState {

    @Setup(Level.Trial)
    public void doSetup() {
        KoinBenchmarkRunner.startKoin();
    }

    @TearDown(Level.Trial)
    public void doTearDown() {
        KoinBenchmarkRunner.stopKoin();
    }

}
@Benchmark
public void benchmarkKoin(Blackhole hole, MyState state) {
    hole.consume(KoinBenchmarkRunner.benchmarkKoin());
    hole.consume(state);
}
@Benchmark
public void benchmarkVanilla(Blackhole hole, MyState state) {
    hole.consume(KoinBenchmarkRunner.benchmarkVanilla());
    hole.consume(state);
}

The result is somehow sobering

Benchmark                          Mode  Cnt          Score         Error  Units
BenchmarkRunner.benchmarkKoin     thrpt  200    1425585.082 ±   31179.345  ops/s
BenchmarkRunner.benchmarkVanilla  thrpt  200  106484919.110 ± 1121927.712  ops/s

Even though I'm aware that this is an artificial benchmark that may be flawed, it's pretty much clear that using Koin will have a huge impact on performance, that could make program infrastrucutre slower by a factor of 100. Of course, we're talking about dependency injection at object creation time, which should be a rare case in a game engine. Nonetheless, not too good from my sight.

Mittwoch, 7. Februar 2018

Kotlin's scoped extensions micro-benchmarked

My last post was about an approach to use Kotlin's scoped extension methods to implement an application with data oriented design paradigm. Yes, I'm still coding that game engine, that's why I had to do a simple benchmark, just to get a feeling how performance could get better or worse. See it as a brain dump. Very unprofessional benchmark with the println statement, but I wanted to get the relation between the simple baseline implementation and the extension method version, like this:

import org.openjdk.jmh.annotations.Benchmark

interface Extension<ELEMENT>{
    fun ELEMENT.prettyPrint() { println("Default pretty " + this) }
}

object StringListExtension : Extension<String>

fun <T> someFrameWorkFunction(parameter : T, extensionProvider: Extension<T>) {
    with(extensionProvider) {
        parameter.prettyPrint()
    }
}

@Benchmark
fun extension() {
    someFrameWorkFunction("asd", StringListExtension)
}

fun String.prettyPrint() { println("Default pretty " + this) }

@Benchmark
fun baseline() {
    "asd".prettyPrint()
}

Surprising results again:

Benchmark                            Mode  Cnt       Score       Error  Units
BenchmarkRunner.benchmarkBaseline   thrpt  200  269087.160 ± 17915.393  ops/s
BenchmarkRunner.benchmarkExtension  thrpt  200  329648.131 ± 19646.005  ops/s

Once again, the opposite of my expectations.

Two-phase occlusion culling (part 2) - instance aware culling

I described the basics about the two-phase occlusion culling technique in the first part post, but there is one thing, that doesn't work well with this simple setup: Instancing or sometimes viewed as batching.

The problem is very simple, the solution however is not. Only a few engines managed to implement the technique, but it is unavoidable if one needs to render large batches or large amounts of instances. Quick recap, under the assumption that we compromise flexibility in favor of speed and use OpenGL 4.5 level graphics APIs:

Avoid buffer changes one, two different vertex layouts (for example for static and animated meshed)

Avoid program changes well defined firstpass program (deferred rendering) or subroutines in a global attribute buffer

Minimize drawcalls Usage of instanced and indirect rendering

The usage of draw commands and indirect rendering is nice enough. The amount of drawcommands is reduced by using instanced rendering for entities that have different uniform attributes but the same vertices and indices. This results in the command and entitiy buffers of the following form:

Entities
uint entityId	678
mat4 transform	000...
uint materialIndex	18
uint entityId	1566
mat4 transform	000...
uint materialIndex	2

Commands
vertexOffset	0
indexOffset	0
indexCount	99
instanceCount	2
vertexOffset	1200
indexOffset	99
indexCount	33
instanceCount	3

Firing indirect rendering call is now done with a count of two, because we have two commands. 5 objects will be drawn. The shaders are invoked several times and the entity data can be accessed with an index and fetched from the entities buffer. The resulting indices in the shader will look like

Shader Invocations
DrawId	InstanceId	Entity buffer index
0	0	0
0	1	1
1	0	???
1	1	???
1	2	???

As mentioned in the previous blog post, there's no chance for us to know the entity index when the second command is executed, because the current offset is based on the amount of instances the previous commands draw. The solution is to use an additional offset buffer with size of the draw command buffer. For every command, we set the offset to the appropriate value when creating the commands on the cpu side. With instance based culling, this problem intensifies, because the cpu doesn't know the offset anymore. The calculation has to be done on the GPU now. My solution is still based on vertex shader kernels, but I will tell you later why this is probably problematic. First how it is done, because conceptionally, it will be the same for compute shaders. The layout now looks like this:

Shader Invocations
DrawId	InstanceId	Entity buffer index (command offset + InstanceId)	Offset of the command
0	0	0	0
0	1	1	0
1	0	2	2
1	1	3	2
1	2	4	2

The first step is to determine the visibility for every single instance. Vertex shader kernels have an advantage here, because they can be arbitrarily large (compute shader groups are limited to 1024 or so). A single draw call can determine visibility for dozens of thousands of instances or entities. Combined with instancing, we can use DrawId and InstanceId to index into the command buffer and offset buffer (DrawId) and into the entity buffer (offset + InstanceId). Since the same kernel sizes are applied to every command, invocations are wasted if few commands have many more instances as others. So you might want to launch draw calls without instancing, one per command, which could be faster here.
The visibility is again stored into the visibility buffer, which has to be as large as the entity buffer now. With non-instanced culling, size of the command buffer was sufficient. One important thing has to be done now: Every visible entity thread has to increase a counter that is associated with the corresponding command - since this is done per invocation, atomic operations are needed. So we need another int buffer (just like the offset buffer), that holds the visible instances count per command.

The reason why this is so important is, that this is the bit of information that is used in a second step to append entity data and draw commands in parallel. This is done by an algorithm similar/equal to parallel prefix sum - you can google it. Tldr: Each command has to know how many visible instances all previous commands produce in total, so that it can append its own instances there.

I call the seconds step appending step, while the first one is the visibility computation step, that actually calculates the visibility of an instance. For a massive amount of instanced commands, one would probably want to launch a two-dimensional kernel of the size n*m where n is the amount of commands and m is max(maxInstanceCount, someArbitraryValue) or something. Again, with vastly different instanceCounts per command, multiple shader calls could give a benefit.

So now we need a target buffer that holds the entity data, a target buffer for the commands and a target buffer for the entity data offsets. Additionally, we need an atomic counter for the target command index and then an atomic counter per command, that is the current index of the entities per command. Each shader invocation has its own command index in the DrawID built-in and the instance index in the InstanceID built-in. So we can calculate all information that is needed on the fly:

struct oldCommand = read from the bound command buffer with DrawID
uint oldCommandEntityOffset = read from the oldOffsetBuffer with DrawID
uint oldEntityIndex = oldCommandEntityOffset + InstanceID
uint visibilityBufferIndex = oldEntityIndex
uint targetCommandIndex = atomicAdd(commandConuter, 1)
uint targetInstanceIndex = atomicAdd(commandCounters[targetCommandIndex], 1)
uint targetEntityIndex = sumOfVisibleInstancesOfAllCommandsBeforeCurrentCommand

The current entity data can then be written to the targetEntityDataBuffer, as well as the offset for the instance/command. The command can be written by the first active thread (InstanceID == 0). The resulting buffers contain only the visible instances, offsets of the visible instances and drawcommands that can be used to only draw exactly those instances.

Here's a video demonstrating occlusion and frustum culling with this technique:

Bonus: For the visibility computation step, it's nice to launch a vertex shader kernel, since the kernel can be arbitrarily large in two dimensions at max. Since there's no shared memory involved, this will probably perform better than the compute shader equivalent, maybe at the same speed, but it shouldn't be slower. The appending step needs an atomic operation per invocation, because every invocation increments the counter if the corresponding instance is visible (which the other invocations couldn't know). Instead of "global" shared memory, one could use shared memory of a compute shader group. Shared memory in a group is much faster than atomic operations between a gazillion vertex shader invocations, so I will implement this at some time and may be able to show a performance comparison.

Donnerstag, 11. Januar 2018

Trying to use Kotlin's companion objects to mimic constructor type constraints

Some problems just stick in my mind forever. I searched for a nice way to enforce the presence of a static method via an interface. C# has type constrtaints for this need, but I find the syntax rather ugly. For example, one could require that a static factory method is present on the class of a given instance. A problem regularly solved with factories or factory methods, or even factory methods as parameters.
In Kotlin, interfaces can have companions. However, they are not part of the contract of the interface, that means it's not abstract and you don't have to override it/can't override it. Methods defined on the companion can't be overriden either, so the following code doesn't do what one expects it does:

interface SomeInterface {
    companion object {
        fun xxx() {
            println("Base method by SomeInterface")
        }
    }
}
class MyImpl : SomeInterface {
    fun SomeInterface.Companion.xxx() {
        println("Overriden static method by MyImpl")
    }
}

fun test(impl: SomeClass.SomeInterface) {
    with(impl) {
        SomeClass.SomeInterface.xxx() // surprise here!
    }
}

This prints
...
Base method by SomeInterface
...
The corresponding bytecode piece in the test function is

GETSTATIC SomeClass$SomeInterface.Companion : LSomeClass$SomeInterface$Companion;
INVOKEVIRTUAL SomeClass$SomeInterface$Companion.xxx ()V

Since the goal is to enforce the presence of a static method, it's likely that there is no base implementation at all, so no default implementation in the interface. In this case, we can only define a companion object in the interface without any methods. It's then possible to define interface methods for the companion object, just like fun SomeInterface.Companion.myMethod(). This is similar to include extension functions in an interface's contract, as described in my last blog post. Implementing interfaces would then need to implement this method. Here's a complete example on how to do this.

class SomeClass {

    interface SomeInterface {
        companion object
        fun SomeInterface.Companion.xxx()
    }

    class MyImpl : SomeInterface {
        override fun SomeInterface.Companion.xxx() {
            println("Overriden static method by MyImpl")
        }
    }

    class MyOtherImpl : SomeInterface {
        override fun SomeInterface.Companion.xxx() {
            println("Overriden static method by MyOtherImpl")
        }
    }

    companion object {
        fun test(impl: SomeClass.SomeInterface) {
            with(impl) {
                SomeClass.SomeInterface.xxx()
            }
        }
    }
}
fun main(args: Array<String>) {
    test(SomeClass.MyImpl())
    test(SomeClass.MyOtherImpl())
}

It will print
...
Overriden static method by MyImpl
Overriden static method by MyOtherImpl
...

The corresponding bytecode of the test function now looks like this:

GETSTATIC SomeClass$SomeInterface.Companion : LSomeClass$SomeInterface$Companion;
INVOKEINTERFACE SomeClass$SomeInterface.xxx (LSomeClass$SomeInterface$Companion;)V

So the old invokevirtual became an invokeinterface. This most likely comes with a small performance overhead. But hey, can't have everything.

This construct allows for factory-like constructor functions. Sadly, operator functions are not allowed to be declared on interfaces. That means, the constructor method has to be named create or something, leading effectively to an interface function signature of

fun SomeInterface.Companion.create() : T

and a usage like

with(impl) {
    println("Created ${SomeInterface.create().javaClass}")
}

This eliminates the sense in all this for me, because I would then rather type

println("Created ${impl.create().javaClass}")

The only worth-it syntax that would satisfy me would be this - based on an invoke operator function defined in the interface, usable through the interfaces companion object.

println("Created ${SomeInterface().javaClass}") //Not valid

Okay, at least I tried :)