Montag, 17. September 2018

Kind-of-structs on the JVM using Kotlin's delegated properties

Project Valhalla is on everyone's lips nowadays, but the problem is: It is for years now and there's no concrete schedule when we can expect value types to be part of the JVM.

In games, there is a desperate need for value types or at least control about the object layout. Why? Because one has to share memory with the native side. For example OpenGL lets you use a persistent mapped buffer - combined with multibuffering and your own synchronization gives you a blazing fast multithreading approach for your engine. But OpenGL doesn't want to read your Java object's headers, that's why you can't use regular serialzation mechanisms and instead you have to put your objects into a ByteBuffer float by float or int by int.

Using standard Java/JVM heap objects, one has to update all the objects and afterwards extract them to a ByteBuffer. This means two iterations. Better would be to have objects that use a ByteBuffer directly, in order to be able to skip the buffer extraction completely.

Now there's Kotlin with its delegated properties. All the basic examples show how to use a hash map instance as a backing storage for arbitrary properties of an object (https://kotlinlang.org/docs/reference/delegated-properties.html). This led me to the idea to use delegated properties to access a ByteBuffer object as a backing storage for objects and structures of objects - just like structs in C do it.

interface Struct { byteOffset: Int buffer: ByteBuffer } // some missing magic here for property registration and local offset calculation class FloatProperty(val localOffset) { inline operator fun getValue(thisRef: Struct, KProperty<*,*>): Float { thisRef.buffer.getFloat(thisRef.byteOffset+localOffset) } } class MyStruct: Struct { // .. missing magic here val myFloat by FloatProperty() val mySecondFloat by FloatProperty() } This will result a flat Layout for a MyStruct instance. This can be used directly by native APIs. So what I'm interested in is, how well does this perform compared to Vanilla Java approach? How much overhead is there for all those methods and delegate instances? I tested several different implementations that differ in convenience for the user and overall performance. One surprise for me was, that none of my implementations (neither ByteBuffer nor Unsafe backed) is as fast as vanilla Java. There are some benchmarks on the internet that tell a different story, for example this one. I can't really tell you how it was achieved. Just that I wasn't able to achieve similar results.



Benchmark                                                       
Mode  Cnt      Score     Error  Units
iterAndMutBufferDirect                                         
thrpt   12  90626,796 ± 303,407  ops/s
iterAndMutKotlinDelegatedPropertySlidingWindowBuffer           
thrpt   12  23695,594 ±  82,291  ops/s
iterAndMutKotlinDelegatedPropertyUnsafeSimpleSlidingWindowBuffer
thrpt   12  27906,315 ±  52,382  ops/s
iterAndMutKotlinDelegatedPropertyUnsafeSlidingWindowBuffer     
thrpt   12  25736,322 ± 904,017  ops/s
iterAndMutKotlinSimpleSlidingWindowBuffer                       
thrpt   12  27416,212 ± 959,016  ops/s
iterAndMutResizableStruct                                       
thrpt   12  10204,870 ± 189,237  ops/s
iterAndMutSimpleSlidingWindowBuffer                             
thrpt   12  27627,217 ± 122,119  ops/s
iterAndMutStructArray                                           
thrpt   12  12714,642 ±  51,275  ops/s
iterAndMutStructArrayIndexed                                   
thrpt   12  11110,882 ±  26,910  ops/s
iterAndMutVanilla                                               
thrpt   12  27111,335 ± 661,822  ops/s
iterStruct                                                     
thrpt   12  13240,723 ±  40,612  ops/s
iterVanilla                                                     
thrpt   12  21452,188 ±  46,380  ops/s


All benchmarks iterate over a collection of 5000 Vector3f instances. iterAndMutVanilla is just a regular ArrayList iteration with forEach, setting the three components of each vector. iterAndMutStruct is my current implementation of a tight StructArray of Vector3fs with a sliding window iteration.

Vanilla Java iteration with mutation yields the baseline results with 27k operations. It's very intersting, that a non-abstracted simple version with a direct bytebuffer is three times as fast as the baseline, reaching 90k operations. Simple non-abstracted implementations with Kotlin's delegates brings us down to the baseline performance again. My struct abstraction in the current implementation with a struct array class implementation can only reach 50% of the baseline - quite a difference between the simple delegate approach and only a rough sixth of the simple direct bytebuffer approache's performance.

I have to figure out why my abstractions degrade performance by such amounts - the generated bytecode looks pretty similar for all the versions. At the time of writing, Kotlin's inline classes are not stable enough for delegate usage, so delegates cause some class overhead here.

But even though there are some performance differences in this very micro benchmark, it doesn't necessarily mean that other use cases show such dramatic differences as well. Additionally, the largest benefit my struct-alike implementation offers is, that now large and complex datastructures can be memcopied like this:


class MyStruct: Struct { // .. missing magic here val myFloat by FloatProperty() val mySecondFloat by FloatProperty() } val source = MyStruct().apply { myFloat = 5 } val target = MyStruct() source.copyTo(target) // Simple extension method that copies a bytebuffer println(target.myFloat) // prints 5

This means no iteration over nested arrays, complex copy constructors and even more complex nested invocation of them. Super handy for renderstate constructs in game engines - your whole renderstate instance can be mapped to a OpenGL struct and mapped as a shader storage buffer :)