Here we go again: I'm searching for a performant, convenient solution for dependency injection in Kotlin. Koin got my attention, because it seems to be simple and straightforward. Since it uses a lot of small inline functions and seems to have only a few hotspots where performance suckers could lurk, it seemed very promising. But I wouldn't want to use it in my game engine project until I can be sure that the performance impact would be negligible. So I did a ... probably totally flawed microbenchmark, that doesn't show anything, but I want to dump it here nonetheless.
So we have a simple service class and another service class that depends on the first service. The main module depends on the seconds service, so the chain has to be fulfilled.
Using Koin, one can simply write a few lines and everything is wired together automatically. Note that the Koin context has to be started and stopped, which has to be excluded from the benchmark later.
Even though I'm aware that this is an artificial benchmark that may be flawed, it's pretty much clear that using Koin will have a huge impact on performance, that could make program infrastrucutre slower by a factor of 100. Of course, we're talking about dependency injection at object creation time, which should be a rare case in a game engine. Nonetheless, not too good from my sight.
My last post was about an approach to use Kotlin's scoped extension methods to implement an application with data oriented design paradigm. Yes, I'm still coding that game engine, that's why I had to do a simple benchmark, just to get a feeling how performance could get better or worse. See it as a brain dump. Very unprofessional benchmark with the println statement, but I wanted to get the relation between the simple baseline implementation and the extension method version, like this:
I described the basics about the two-phase occlusion culling
technique in the first part post, but there is one thing, that doesn't
work well with this simple setup: Instancing or sometimes viewed as batching.
The problem is very simple, the solution however is not.
Only a few engines managed to implement the technique, but it is
unavoidable if one needs to render large batches or large amounts of
instances. Quick recap, under the assumption that we compromise
flexibility in favor of speed and use OpenGL 4.5 level graphics APIs:
Avoid buffer changes one, two different vertex layouts (for example for static and animated meshed)
Avoid program changes well defined firstpass program (deferred rendering) or subroutines in a global attribute buffer
Minimize drawcalls Usage of instanced and indirect rendering
The
usage of draw commands and indirect rendering is nice enough. The
amount of drawcommands is reduced by using instanced rendering for
entities that have different uniform attributes but the same vertices
and indices. This results in the command and entitiy buffers of the
following form:
Entities
uint entityId
678
mat4 transform
000...
uint materialIndex
18
uint entityId
1566
mat4 transform
000...
uint materialIndex
2
Commands
vertexOffset
0
indexOffset
0
indexCount
99
instanceCount
2
vertexOffset
1200
indexOffset
99
indexCount
33
instanceCount
3
Firing indirect rendering call is now done with a
count of two, because we have two commands. 5 objects will be drawn. The
shaders are invoked several times and the entity data can be accessed
with an index and fetched from the entities buffer. The resulting
indices in the shader will look like
Shader Invocations
DrawId
InstanceId
Entity buffer index
0
0
0
0
1
1
1
0
???
1
1
???
1
2
???
As mentioned in the previous blog post, there's no
chance for us to know the entity index when the second command is
executed, because the current offset is based on the amount of instances
the previous commands draw. The solution is to use an additional offset
buffer with size of the draw command buffer. For every command, we set
the offset to the appropriate value when creating the commands on the
cpu side. With instance based culling, this problem intensifies, because
the cpu doesn't know the offset anymore. The calculation has to be done
on the GPU now. My solution is still based on vertex shader kernels,
but I will tell you later why this is probably problematic. First how it
is done, because conceptionally, it will be the same for compute
shaders. The layout now looks like this:
Shader Invocations
DrawId
InstanceId
Entity buffer index (command offset + InstanceId)
Offset of the command
0
0
0
0
0
1
1
0
1
0
2
2
1
1
3
2
1
2
4
2
The first step is to determine the
visibility for every single instance. Vertex shader kernels have an
advantage here, because they can be arbitrarily large (compute shader
groups are limited to 1024 or so). A single draw call can determine
visibility for dozens of thousands of instances or entities. Combined
with instancing, we can use DrawId and InstanceId to index into the
command buffer and offset buffer (DrawId) and into the entity buffer
(offset + InstanceId). Since the same kernel sizes are applied to every
command, invocations are wasted if few commands have many more
instances as others. So you might want to launch draw calls without
instancing, one per command, which could be faster here.
The
visibility is again stored into the visibility buffer, which has to be
as large as the entity buffer now. With non-instanced culling, size of
the command buffer was sufficient. One important thing has to be done
now: Every visible entity thread has to increase a counter that is
associated with the corresponding command - since this is done per
invocation, atomic operations are needed. So we need another int buffer
(just like the offset buffer), that holds the visible instances count
per command.
The reason why this is so important is,
that this is the bit of information that is used in a second step to
append entity data and draw commands in parallel. This is done by an algorithm similar/equal to parallel prefix sum - you can google it. Tldr: Each command has to know how many visible instances all previous commands produce in total, so that it can append its own instances there.
I call the seconds step appending step, while the first one is the visibility computation step,
that actually calculates the visibility of an instance. For a massive
amount of instanced commands, one would probably want to launch a
two-dimensional kernel of the size n*m where n is the amount of commands
and m is max(maxInstanceCount, someArbitraryValue) or something. Again,
with vastly different instanceCounts per command, multiple shader calls
could give a benefit.
So now we need a target buffer
that holds the entity data, a target buffer for the commands and a
target buffer for the entity data offsets. Additionally, we need an
atomic counter for the target command index and then an atomic counter
per command, that is the current index of the entities per command. Each
shader invocation has its own command index in the DrawID built-in and
the instance index in the InstanceID built-in. So we can calculate all
information that is needed on the fly:
struct oldCommand = read from the bound command buffer with DrawID
uint oldCommandEntityOffset = read from the oldOffsetBuffer with DrawID
uint oldEntityIndex = oldCommandEntityOffset + InstanceID
uint visibilityBufferIndex = oldEntityIndex
uint targetCommandIndex = atomicAdd(commandConuter, 1)
uint targetInstanceIndex = atomicAdd(commandCounters[targetCommandIndex], 1)
uint targetEntityIndex = sumOfVisibleInstancesOfAllCommandsBeforeCurrentCommand
The
current entity data can then be written to the targetEntityDataBuffer,
as well as the offset for the instance/command. The command can be written by the first active
thread (InstanceID == 0). The resulting buffers contain only the visible
instances, offsets of the visible instances and drawcommands that can
be used to only draw exactly those instances.
Here's a video demonstrating occlusion and frustum culling with this technique:
Bonus: For the visibility computation step, it's nice to launch a vertex shader kernel, since the kernel can be arbitrarily large in two dimensions at max. Since there's no shared memory involved, this will probably perform better than the compute shader equivalent, maybe at the same speed, but it shouldn't be slower. The appending step needs an atomic operation per invocation, because every invocation increments the counter if the corresponding instance is visible (which the other invocations couldn't know). Instead of "global" shared memory, one could use shared memory of a compute shader group. Shared memory in a group is much faster than atomic operations between a gazillion vertex shader invocations, so I will implement this at some time and may be able to show a performance comparison.