Donnerstag, 26. Dezember 2019

Smart property delegation with Kotlin

Every now and then I have a use case where I have a data holder object that is wrapped by some other object. With Kotlin's interface delegation, it's very easy to implement an interface by delegating to the implementing property. However, sometimes you don't have interfaces, so I realized that there's one thing missing from the Kotlin standard lib: Have a property delegating to another property. Or at least I wasn't able to find it and it's kind of difficult to google that. So here's the solution:



I'm using a property reference (you need the reflection dependency for that) for generic property (get and set) usage. This is kind of a lense, known from functional programming I guess.

Nice things can done with that, for example typesafe ui components can be generated like that:

Sonntag, 1. Dezember 2019

Programmable vertex pulling

So i finally managed to invest some time to implement programmable vertex pulling in my engine. I can really recommend to implement an abstraction over persistent mapped buffers that lets you implement structured buffers of generic structs and then use it on the cpu and the gpu side as a simple array of things.

Nothing comes for free: I find it quite difficult to handle any other layout than std430 because that matches what your c, c++ code is doing, as long as you restrict yourself to always use 16 byte alignment members, I think. My struct framework doesn't do any alignment, so I just added dummy members where appropriate in order to match the layout requirements. Afterwards, struct definitions in glsl have to match your struct on the cpu side and the only things left for your vertices is


struct VertexPacked {
    vec4 position;
    vec4 texCoord;
    vec4 normal;
};
layout(std430, binding=7) buffer _vertices {
    VertexPacked vertices[];
};

...


int vertexIndex = gl_VertexID;
VertexPacked vertex = vertices[vertexIndex];


Combined with persistent mapping, you can get rid of any layout fuddling, synchronization, buffering, mapping...and it just works.

Regarding performance: I am using an array of structs approach because it is the simplest to use. The performance in my test scenes (for example sponza) is completely identical to the traditional approach. No performance differences on a Intel UHD Graphics 620.

Having free indexed access to vertices in your shaders can be beneficial in other situations as well. For example you can implement a kd-tree accelerated ray tracer with compute that uses indices into your regular vertex array.

Donnerstag, 14. November 2019

Pixel perfect picking with deferred rendering

There are several ways to accomplish pixel perfect picking in one's engine. Some tutorials mention an object hierarchy as a scene representation in order to trace rays for picking. This is often done on the CPU, where information about the object is already available when the ray hit callback is invoked when tracing. On the GPU, this could be done with a compute shader or a vertex shader that writes to an output buffer... this output buffer can be read back on the CPU.

With deferred rendering however, I have a simpler approach that doesn't involve any tracing at all. Since I need object IDs for later passes to fetch instance data out of a big buffer, I write them to one output texture in my deferred rendering buffer. The output texture can be of type int or uint, depending on the amount of objects you have to handle. One can pack the index into bits of a regular 8bit rgba texture, into a floating point texture, or whatever texture has some bits space left in your gbuffer. After the gbuffer pass of deferred rendering was done, one can use glReadPixels (with read buffer set) or glGetTexImage and the current mouse position to get the index of the clicked object back to the cpu side of things. Besides the format handling and the bit mangling, this is rather trivial, so I won't post code here, but here's a nice video of the usage in my new ribbon based editor :)

Sonntag, 8. September 2019

Sparse voxelization on the CPU

The various adventures with Voxel Cone Tracing showed me, that asynchronous and partly done voxelization on the gpu can become really really tricky, because the data structures involved are very hard to implement.

So after ditching clipmaps because they won't allow for enough caching of static sthings, I gave sparse voxelization another try. Like - I think - CryEngine, I wanted to implement voxelization on the cpu, in order to be able to perform it async and partly on demand. The voxelization itself didn't took too much time - I decided to go for the "brick" approach, which allocates either none or all eight children of a given node, whether some of them are empty or not. The advantage here would be that working with offsets is much easier, as first childs always are located at the brick pointer start and the next 7 children are contiguous in memory and one can just increment the pointer to get the next child. Using a pointer based apporach enables required asynchronicity and streaming, because the memory layout doesn't have to fulfill everything we need on the GPU later on for tracing. Super nice Kotlin coroutines enable usage of 100% of the cpu very easily with a fork join approach and just launching thousands of coroutines. Depth of the tree is 11, size is 1000³. I think in order to have this resolution, one would need a 3d texture with resolution of 512³ or 1024³ which usually is too costly for voxel cone tracing approaches.



I didn't squeeze out the last bit of performance here. So yes, maybe this can be implemented much faster. And no excuses, but the video itself eats up a lot of the fps I had - it had about 30 fps, pretty much limited by my laptop cpu.

And this video could be the last thing I can show about voxelization, because the tracing part didn't just work out as expected. In order to work with empty leave nodes, one has to get creative. Because if you intersect a ray with a box, but the hit would actually be with a box that is an empty leaf, you would have to backtrack the tree (hence you need parent node pointers in all nodes, but that's okay) until you get somewhere where some sibling nodes are left to be traced...

My implementation is too bad to be anywhere near realtime and debugging or reason about what's going on is very very tedious... after having my computer completely frozen half a dozen times because I made another stupid mistake on the GPU, I decided that this approach is just not doable in the amount of time I can afford :)

Freitag, 16. August 2019

Companion vals

I totally forgot to make a braindump about another feature I stumbled upon lately: companion vals.

There's this really interesting feature proposal for the Kotlin language.

In essence, the proposed feature allows to write the following code:


and the following code:

From my point of view the feature has two aspects.

The first one is making members of companion members part of the surrounding instance. That means we can have properties and their members are automatically exposed, hence you don't need to access them with dot notation.

The second aspect is that other scopes are treated as well: If something is marked as companion, it is automatically available as a receiver in the corresponding scope. The proposal only talks about class properties, which are available in the class body automatically. This enables having Kotlin's scoped extension functions available without the need to use with(AdressPrinter) }{} everywhere.

I extended the Kotlin compiler with these two features and widened the application of the second aspect to all (?) possible scopes. This means if the companion val is top level, it's automatically available in the whole file. If a function parameter is marked as companion, the argument is going to be available as a receiver in the function body and so on. The implementation can be found here and examples can be found in the working tests.

Since the compiler has no simple and no stable API, I also implemented an annotation processor, that fulfils the first aspect. Repository can be found here. This would make the above code compile (with the right imports). Works by generating extension functions for all members, just as you would do in Kotlin by hand anyway if you would want to have this functionality.

Sonntag, 24. Februar 2019

Multivolume voxel cone tracing

Another feature experiment I nearly forgot to show you is voxel cone tracing with multiple voxel volumes. There are different approaches to support large scenes with voxel cone tracing:

* Use a single volume texture and scale it to the scene's bounds. This increases a single voxel's world extents, which leads to coarser lighting and even more light leaking.
* Use sparse octree voxelization with a buffer instead of a volume texture. This is a way more complicated implementation. Additionally, the performance hit for lighting evaluation is quite big, compared to hardware-filtered volume texel fetching.
* Use cascaded voxel cone tracing, a similar approach to cascaded shadow maps. Revoxelization of the whole scene (or what's visible for the player) is quite demanding - implementations that only revoxelize objects on the border of the cascades are way more complex than the traditional, non-cascaded approach. Not using such an approach and revoxelizing everything every frame leads to flickering in the voxelization step, which can't be eliminated complettely due to the "binary" nature" of voxels (or at least I didn't manage to achieve it).

My implementation goes a different way, that doesn't suffer from the above problems, by introducing world space voxel volumes. Instead of a single big one, there are many smaller ones. There are many advantages now:

* Not all voxel volumes have to be updated every frame - one can update the nearest n volumes per frame, depending on the given hardware specs.
* There can be higher resolutions where needed and less resolution where coarse illumination is sufficient.
* Since everything is in world space, no flickering on revoxelization - at least when materials change. For dynamic objects, one still has to do some tricks or use temporal filtering with multiple bounces or sth.
* Theoretically, the voxel data could be precalculated and streamed in.


I put a sized list of VoxelGrid entries into a generic buffer that my avaluation compute shaders can read. My VoxelGrid data structure is as simple as the following.
struct VoxelGrid {  
   int albedoGrid;  
   int normalGrid;  
   int grid;  
   int grid2;  
   int resolution;  
   int resolutionHalf;  
   int dummy2;  
   int dummy3;  
   mat4 projectionMatrix;  
   vec3 position;  
   float scale;  
   uvec2 albedoGridHandle;  
   uvec2 normalGridHandle;  
   uvec2 gridHandle;  
   uvec2 grid2Handle;  
 };  

As mentioned in an earlier post, my volumes use a kind of deferred rendering to cache geometry properties and onyl recaclulate lighting information when necessary, hence the need for 4 texture attachments - one for albedo, one for normals, and two for multiple bounces of gi.

The resolution (and the helper resolutionHalf) determine the resolution of the volume texture. This is needed, because the size of a volume can be arbitrary, while the resolution is fixed at some time, leading to arbitrary world space sizes for a single voxel.

Besides a little bit of padding, I also save the projection matrix that is used to voxelize objects into this volume. This isn't needed during evaluation, but I wanted to use a single data structure for both steps of the pipeline.

Since the texture ids don't give you much when using multiple volumes any more (you don't want to bind anything anymore...), those can be erased by now. What I use is bindless handles for everything, hence the long texture handles for the said four textures, passed in as uvec2 data types.

Tracing

Now the interesting part. When only a few volumes are used, let's say 5-10 or something, the tracing can easily be implemented as brute force iteration over an array. I don't think more volumes are practical, as each volume needs a lot of memory, and there comes the point where classic sparse voxel octrees are simply more efficient.

When implementing the tracing, I realized, that I want to favour higher resolution volumes when volumes overlap. Besides that, the tracing is quite simple: Take the gbuffer's world space position and trace diffuse or/and specular lighting in as many directions as you like. The sampling diameter increases with distance and determines the mipmap level to sample from.


    vec4 accum = vec4(0.0);
    float alpha = 0;
    float dist = 0;
    vec3 samplePos = origin;// + dir;

    while (dist <= maxDist && alpha < 1.0)
    {
        float minScale = 100000.0;
        int canditateIndex = -1;
        VoxelGrid voxelGrid;
        for(int voxelGridIndex = 0; voxelGridIndex < voxelGridArray.size; voxelGridIndex++) {
            VoxelGrid candidate = voxelGridArray.voxelGrids[voxelGridIndex];
            if(isInsideVoxelGrid(samplePos, candidate) && candidate.scale < minScale) {
                canditateIndex = voxelGridIndex;
                minScale = candidate.scale;
                voxelGrid = candidate;
            }
        }

        float minVoxelDiameter = 0.25f*voxelGrid.scale;
        float minVoxelDiameterInv = 1.0/minVoxelDiameter;
        vec4 ambientLightColor = vec4(0.);
        float diameter = max(minVoxelDiameter, 2 * coneRatio * (1+dist));
        float increment = diameter;

        if(canditateIndex != -1) {
            sampler3D grid;
            // sample grid here
        }

        dist += increment;
        samplePos = origin + dir * dist;
        increment *= 1.25f;
    }
 return vec4(accum.rgb, alpha);

The results are quite nice, with mixed resolutions and sizes of volumes. Here's an example of a transition between a fine and a coarse volume:

coarse and fine voxel volume side by side
Unfortunately, the performance of my tracing is not that good. It's remarkably slower than the single volume tracing, and I'm not too certain why this is the case. My test scene contained 4 volumes and decreased performance below 30 fps on my GTX 1060, to it's not capable of being realtime anymore. And I'm talking about avaluation only - no voxelization done here.

This again leads me to the conclusion, that voxel cone tracing is just too heavy on resources to be practical. I got the idea of using binary voxelization and only use voxels for occlusion and soft shadows, but evaluate global illumination from precomputed volumes. Much cheaper, no synchronization between voxelization threads and in general just a lot cheaper.

Samstag, 23. Februar 2019

Multibounce voxel cone tracing

It has been quite a while since I implemented a derivative of the classic voxel cone tracing with volume textures in my graphics engine that kind of applies deferred rendering with voxels in order to achieve multi bounce global illumination. The idea is to voxelize the whole scene to a voxel texture and save all parameters that are needed for lighting. Similar to the regular gbuffer in deferred rendering, positions (implicitly given by world space voxels..) normals and albedo can be sufficient. Additionally, my renderer writes a flag if the object is dynamic or static, in order to be able to cache voxel data for static objects, which massively speeds up voxelization proccess and just brings the whole thing closer to realtime capable.
Decoupling lighting from voxelization also frees enough frame time to implement multiple light bounces. Therefore, for n bounces, I added n light accumulation voxel texture targets. During voxel lighting, these are traced against. Although a second bounce can significantly enhance the scene's overall lighting. I struggled getting this to work with ping-ponging textures. I also struggled with parameters like samples on the hemisphere or tracing distance, cone aperture, etc. because in the voxel world, my default parameters for gbuffer tracing didn't lead to great results. Nonetheless, I wanted to share my results with you strangers, although I tend to discard this feature, because it  doesn't make voxel cone tracing's light leaking problem less appearent....

two bounces (first), one bounce (second)