In games, there is a desperate need for value types or at least control about the object layout. Why? Because one has to share memory with the native side. For example OpenGL lets you use a persistent mapped buffer - combined with multibuffering and your own synchronization gives you a blazing fast multithreading approach for your engine. But OpenGL doesn't want to read your Java object's headers, that's why you can't use regular serialzation mechanisms and instead you have to put your objects into a ByteBuffer float by float or int by int.
Using standard Java/JVM heap objects, one has to update all the objects and afterwards extract them to a ByteBuffer. This means two iterations. Better would be to have objects that use a ByteBuffer directly, in order to be able to skip the buffer extraction completely.
Now there's Kotlin with its delegated properties. All the basic examples show how to use a hash map instance as a backing storage for arbitrary properties of an object (https://kotlinlang.org/docs/reference/delegated-properties.html). This led me to the idea to use delegated properties to access a ByteBuffer object as a backing storage for objects and structures of objects - just like structs in C do it.
interface Struct {
byteOffset: Int
buffer: ByteBuffer
}
// some missing magic here for property registration and local offset calculation
class FloatProperty(val localOffset) {
inline operator fun getValue(thisRef: Struct, KProperty<*,*>): Float {
thisRef.buffer.getFloat(thisRef.byteOffset+localOffset)
}
}
class MyStruct: Struct {
// .. missing magic here
val myFloat by FloatProperty()
val mySecondFloat by FloatProperty()
}
This will result a flat Layout for a MyStruct instance. This can be used directly by native APIs. So what I'm interested in is, how well does this perform compared to Vanilla Java approach? How much overhead is there for all those methods and delegate instances?
I tested several different implementations that differ in convenience for the user and overall performance. One surprise for me was, that none of my implementations (neither ByteBuffer nor Unsafe backed) is as fast as vanilla Java. There are some benchmarks on the internet that tell a different story, for example this one. I can't really tell you how it was achieved. Just that I wasn't able to achieve similar results.
Benchmark
Mode Cnt Score Error Units
iterAndMutBufferDirect
thrpt 12 90626,796 ± 303,407 ops/s
iterAndMutKotlinDelegatedPropertySlidingWindowBuffer
thrpt 12 23695,594 ± 82,291 ops/s
iterAndMutKotlinDelegatedPropertyUnsafeSimpleSlidingWindowBuffer
thrpt 12 27906,315 ± 52,382 ops/s
iterAndMutKotlinDelegatedPropertyUnsafeSlidingWindowBuffer
thrpt 12 25736,322 ± 904,017 ops/s
iterAndMutKotlinSimpleSlidingWindowBuffer
thrpt 12 27416,212 ± 959,016 ops/s
iterAndMutResizableStruct
thrpt 12 10204,870 ± 189,237 ops/s
iterAndMutSimpleSlidingWindowBuffer
thrpt 12 27627,217 ± 122,119 ops/s
iterAndMutStructArray
thrpt 12 12714,642 ± 51,275 ops/s
iterAndMutStructArrayIndexed
thrpt 12 11110,882 ± 26,910 ops/s
iterAndMutVanilla
thrpt 12 27111,335 ± 661,822 ops/s
iterStruct
thrpt 12 13240,723 ± 40,612 ops/s
iterVanilla
thrpt 12 21452,188 ± 46,380 ops/s
All benchmarks iterate over a collection of 5000 Vector3f instances. iterAndMutVanilla is just a regular ArrayList iteration with forEach, setting the three components of each vector. iterAndMutStruct is my current implementation of a tight StructArray of Vector3fs with a sliding window iteration.
Vanilla Java iteration with mutation yields the baseline results with 27k operations. It's very intersting, that a non-abstracted simple version with a direct bytebuffer is three times as fast as the baseline, reaching 90k operations. Simple non-abstracted implementations with Kotlin's delegates brings us down to the baseline performance again. My struct abstraction in the current implementation with a struct array class implementation can only reach 50% of the baseline - quite a difference between the simple delegate approach and only a rough sixth of the simple direct bytebuffer approache's performance.
I have to figure out why my abstractions degrade performance by such amounts - the generated bytecode looks pretty similar for all the versions. At the time of writing, Kotlin's inline classes are not stable enough for delegate usage, so delegates cause some class overhead here.
But even though there are some performance differences in this very micro benchmark, it doesn't necessarily mean that other use cases show such dramatic differences as well. Additionally, the largest benefit my struct-alike implementation offers is, that now large and complex datastructures can be memcopied like this:
class MyStruct: Struct {
// .. missing magic here
val myFloat by FloatProperty()
val mySecondFloat by FloatProperty()
}
val source = MyStruct().apply {
myFloat = 5
}
val target = MyStruct()
source.copyTo(target) // Simple extension method that copies a bytebuffer
println(target.myFloat) // prints 5
This means no iteration over nested arrays, complex copy constructors and even more complex nested invocation of them. Super handy for renderstate constructs in game engines - your whole renderstate instance can be mapped to a OpenGL struct and mapped as a shader storage buffer :)