A simple path tracer on the GPU

There comes a time in any graphics programmer's life that the siren song becomes too loud to endure:

"You should write a raytracer"

For me, the moment came after writing a fragment-shader raytracer in Futureproof: it was a terrible raytracer, but a lot of fun to write.

After a few weeks down the rabbit hole, I'm proud to present rayray: a tiny, slightly less-terrible GPU raytracer.

To start, here are renderings of two classic images: a Cornell Box and the test scene from Ray Tracing in One Weekend.

Cornell box Ray tracing in one weekend

(Click for high resolution, and keep scrolling for more pretty pictures)


rayray is a traditional forward path tracer. It casts rays from each pixel in the image, which scatter until they hit a light or exceed some max number of bounces. Averaging over thousands of samples creates the final image.

Drag the slider to compare 100 vs. 1000 samples while observing the orb:

Orb scene, relatively noisy
Orb scene, noticeably smoother

The architecture is inspired by Ray Tracing in One Weekend, but unlike the reference implementation, it runs completely on the GPU!

This require a data-driven design (rather than using inheritance), and leads to dramatically faster rendering.

All you need is trace(...)

The core of my raytracer is a function with the signature

bool trace(inout uint seed, inout vec3 pos, inout vec3 dir, inout vec4 color);

This function casts a ray to the next (nearest) object in the scene, starting at pos and traveling in direction dir. After finding this object, the ray scatters depending on material, modifying pos, dir, and color. It returns a flag indicating whether to terminate (if the ray hit a light or escaped the world).

This makes the actual raytracing function incredibly clean:

#define BOUNCES 6
vec3 bounce(vec3 pos, vec3 dir, inout uint seed, vec4 color) {
    for (int i=0; i < BOUNCES; ++i) {
        // Walk to the next object in the scene, updating the system state
        // using a set of inout variables
        if (trace(seed, pos, dir, color)) {
            return color.xyz;
    return vec3(0);

The grid of images below shows pos, dir, and color after one call to trace(...), then the result of a single bounce(...) sample.

First hit First direction

First color First sample

Pixels which were lucky enough to hit the light are colored based on their paths, and everything else is black. Over a few thousand samples, things add up to the correct image!

Rendering pipelines

Of course, we haven't yet specified how trace(...) is actually implemented!

While a scene is being edited, we use a preview implementation of trace(...), which is compiled once (at startup). This implementation renders an encoded scene from a storage buffer. This means that updates to the scene are cheap: they're just a buffer write operation.

The scene is packed into a vec4 array using a custom binary format:

// Header
[0] number of shapes | 0 | 0 | 0

// Shapes
[1] shape type | shape data offset | material data offset | material type
[2] shape type | shape data offset | material data offset | material type
[3] shape type | shape data offset | material data offset | material type

// Data section
[.] material data
[.] shape data

For example, encoding a scene with two spheres – one a diffuse red material, and the other a white light – takes 6 slots in the array, or 96 bytes:

// Header
[0] 2 | 0 | 0 | 0 // There are 2 shapes

// Shapes

// Data section
[3] 1.0 | 0.5 | 0.5 | 0.0 // Red (for diffuse material)
[4] 1.0 | 1.0 | 1.0 | 0.0 // White (for light)
[5] 0.0 | 0.0 | 0.0 | 0.5 // Sphere [x, y, z], r
[6] 0.5 | 0.5 | 0.0 | 0.3 // Sphere [x, y, z], r

The preview shader implements trace(...) as a tiny scene interpreter. It iterates over each shape in the scene, finding the first hit along the current ray and updating the ray's pos. Then, that shape's material is used to update the ray's dir and color.

After the scene has stabilized (defined as no user interaction for one second), the application builds an optimized kernel, which encodes the scene data directly. This takes longer to build, because it's building a full compute pipeline, but is much faster to evaluate.

The generated shader unrolls the shape-hit loop, then packs normal and material calculations directly into a series of switch statements. Compare the interpreter against one example of the generated trace(...).

Building a scene-specific shader makes a big difference! Testing across various scenes, I see a 2-6x speedup between the interpreter and the scene-optimized pipeline.

Focal blur

Focal blur is accomplished by jittering the ray at the camera plane, while maintaining focus at a particular distance:

This is another case where accumulating random samples just works: by picking a random jitter for each sample, we can get a lovely depth-of-field effect.

Hex grid of spheres with no focal blur
Hex grid of spheres with focal blur

Antialiasing works the same way: every pixel is jittered within a 1-pixel radius from its nominal position, which smooths out edges.

Spectral rendering

The final chapter of Ray Tracing: The Rest of Your Life suggests

If you want to do hard-core physically based renderers, convert your renderer from RGB to spectral. I am a big fan of each ray having a random wavelength and almost all the RGBs in your program turning into floats. It sounds inefficient, but it isn't!

This sounded like fun, so I built a spectral mode into rayray (invoked by the -s command-line argument).

On each frame, the GPU uniformly selects a wavelength between 400 and 700 nm, converts it to an RGB value, then renders as before, with the wavelength stored in the w coordinate of the vec4 color.

(This is why bounce(...) takes a vec4 color but only returns a vec3)

In spectral mode, glass materials use the w coordinate to tweak their index of refraction, to imitate real-world dispersion.

This means we can make a prism:


(Aside: Rendering a 2D scene with a 3D raytracer was challenging, as rays scattered randomly off the rear plane were unlikely to hit the narrow light. I fixed this using a special material which scatters light perpendicular to its normal; this means after the first bounce, all of the rays are scattered in XY and are more likely to hit the prism and light)

One expected-but-neat side effect is chromatic aberration. Compare the glass sphere with normal and spectral rendering:

In the spectral image, the refracted blue sphere has blue and red distortions around the edges, and the caustic shows a faint rainbow.

Editor GUI

Like all good graphics tools, rayray integrates Dear ImGui for a debug UI.

The UI is integrated into the raytracer's scene representation, so changes are instantly reflected in the image: Color editor GUI

At the time of implementation, there was no canonical WebGPU backend for Dear ImGui, so I wrote one (there's now an experimental implementation in the main repository).

The editor is extremely rough – it's mostly useful for tweaking parameters of scenes defined in code, then copying those parameters back into the original source. Still, being able to drag colors around in real-time is entertaining!

Importing .mol files

In search of more interesting models, I discovered MolView, which is an online database of chemicals. Each chemical can be downloaded as a .mol file, so I wrote a small importer which converts into a rayray scene.

For example, caffeine:


Aside: RNG on the GPU

A raytracer needs a good random number generator (RNG) to model ray scatter off a diffuse material (among other things).

As it turns out, RNGs on the GPU are a fun and contentious topic.

You'll often see the one-liner

float rand(vec2 co) {
    return fract(sin(dot(co.xy ,vec2(12.9898,78.233))) * 43758.5453);

I have philosophical objections to this function, because it only works once, and only in a 2D grid. For raytracing, each ray requires a stream of random values as it bounces around the scene. You'd need some way to adjust co.xy each time rand(...) is called, and it would be easy to accidentally add a pattern to the noise.

Ideally, we'd have a RNG which takes a seed and generates a continuous stream of random values, ad infinitum. With that in mind, I decided to use a hash-based approach, where the seed is a uint transformed repeatedly by some hash function.

These two blog posts informed my final strategy:

Here's my slow hash function, used to generate the initial seed:

// Jenkins hash function, specialized for a uint key
uint hash(uint key) {
    uint h = 0;
    for (int i=0; i < 4; ++i) {
        h += (key >> (i * 8)) & 0xFF;
        h += h << 10;
        h ^= h >> 6;
    h += h << 3;
    h ^= h >> 11;
    h += h << 15;
    return h;

(Note that this is slightly different from the "Jenkins' OAT algorithm" in the second blog, which doesn't handle each byte; mine is faithful to the canonical definition)

Applying this hash function to values 0 through 65535, we see the following:

Jenkins hash

This seems suitably random as a per-pixel seed value!

My fast hash is based on a linear congruential generator using a particular magic number from the literature: it's simply seed = 0xadb4a92 * seed + 1.

Here's what repeated iterations look like, starting from seed = 1 and mapping the top 24 bits to RGB values:

Fast hash result

As you can see, this is pretty good: there aren't any obvious visual patterns or repetition.

It's important to use the high bits, because the low bits show more obvious patterns. Here's what the bottom 24 bits look like:

Fast hash low bits

There's a clear cycle in the lowest 8 bits, visible in the blue channel!

Here's the complete rand(...) function, including bithacking to convert the highest 23 bits of the hashed seed directly into a float mantissa:

// Returns a pseudorandom value between -1 and 1
float rand(inout uint seed) {
    // 32-bit LCG Multiplier from
    // "Computationally Easy, Spectrally Good Multipliers for
    //  Congruential Pseudorandom Number Generators" [Steele + Vigna]
    seed = 0xadb4a92d * seed + 1;

    // Low bits have less randomness [L'ECUYER '99], so we'll shift the high
    // bits into the mantissa position of an IEEE float32, then mask with
    // the bit-pattern for 2.0
    uint m = (seed >> 9) | 0x40000000u;

    float f = uintBitsToFloat(m);   // Range [2:4]
    return f - 3.0;                 // Range [-1:1]

This function is tuned to return values in the range -1 to 1, to save an extra scaling step.

I came up with one neat trick myself: By specifying seed as an inout value, the function returns random floats while mutating the seed automatically, so you can call rand(seed) repeatedly to get your stream of values. For example, generating a random vec3 is simply

vec3 rand3(inout uint seed) {
    return vec3(rand(seed), rand(seed), rand(seed));

Finally, we generate our initial seed based on the sample count (which changes every frame) and the invocation ID (which changes for each pixel):

void main() {
    // Set up our random seed based on the frame and pixel position
    uint seed = hash(hash(u.samples) ^ hash(gl_GlobalInvocationID.x));


Performance data was captured on 2017 Macbook Pro, with a Radeon Pro 560 GPU and a 2.9 GHz Intel Core i7 CPU.

Two figures are presented: steady-state absolute performance (in mega-rays per second), and equivalent speed when rendering a 1200 x 1200 image.

SceneAbsolute speed
12002 image
Cornell Box 304211
Ray Tracing in One Weekend 19.113.3
the orb 3322
Golden sphere grid 8660
Prism 955663
Caffeine molecule 193134

To compare, the reference implementation for Ray Tracing in One Weekend takes about 577 seconds to render a 1200 x 1200 image at 500 samples per pixel, (max depth of 6 bounces, compiled with -O3, output disabled).

rayray takes 43.4 seconds to render the same image, which is a 13x speedup (even including startup and shader compilation time).

Of course, the reference implementation isn't optimized for performance – but then again, neither is rayray!


Like Futureproof, rayray is built on a stack of (unnecessarily) modern tools and technologies:

While building the scene-specific pipeline, I ran into a pathological complexity explosion in SPIRV-Cross. Coincidentally, it had been fixed a few days earlier, so I worked to push that fix all the way upstream to wgpu-native (and also wgpu-rs, to be polite).

This was a very positive experience: the gfx-rs and wgpu folks were consistently responsive and helpful, so props to that community.

(I also found a bug in cbindgen along the way, which was promptly fixed!)

Future plans

Like Futureproof, this was a toy project / proof of concept, and I'm not planning to maintain it into the future.

As always, the code is on Github, and I'd be happy to link any fork which achieves critical momentum.

Now that naga is released, it may be time to transition back to Rust for graphics work – my motivation for using Zig was seamless interoperability with C libraries (mainly shaderc), and that's not longer needed in wgpu 0.7.0.

Relevant reading material