The Solid-State Register Allocator

I recently found myself needing a very fast register allocator for very simple code. There's a huge amount of academic research on the subject, dating back decades, and just as many published implementations; my specific implementation – while not fundamentally novel – landed in an interesting-enough part of configuration space that it seemed worth writing up.

The system has one big limitation: it only works on straight-line code, without any jumps or branches. On the upside, it has a handful of desirable properties:

(Post-publication note: Through parallel evolution, it turns out that I've ended up with a naive version of the LuaJIT register allocator; if you'd like to learn more about their implementation, see Mike Pall's IP disclosure or their source code)

My specific implementation of register allocation is written in Rust with very few moving parts, so I've nicknamed it the "solid-state register allocator" (or "SSRA" for short).

Before we get started, you can take a peek at the finished system with this demo, configured for a fictional machine with only two registers. Edit the text on the left; it's interactive!



What is register allocation?

Let's start at the beginning.

Register allocation is a step in compilation where we go from abstract variables (infinite in supply) to physical machine registers (sadly non-infinite). In my case, I'm working with straight-line code (i.e. without any jumps or branches) in single static assignment form.

Imagine a toy language with only two operations:

A simple chunk of code looks something like this:

$0 = BAR()
$1 = BAR()
$2 = FOO($0, $1)
$3 = FOO($2, $1)

Incoming variables are prefixed with the dollar sign $; SSA form means that operations are only allowed to use variables that have already been defined and cannot write an existing variable.

Now, let's say we're working on a machine with two physical registers: r0 and r1.

Just by eyeballing it, the code above could be compiled to

r0 = BAR()
r1 = BAR()
r0 = FOO(r0, r1)
r0 = FOO(r0, r1)

Unfortunately, it's not always that easy. Let's add one more operation:

$0 = BAR()
$1 = BAR()
$2 = FOO($0, $1)
$3 = FOO($2, $1)
$4 = FOO($0, $3)

Intuitively, $0 now needs to live longer; we can no longer overwrite it with the first FOO operation. Instead, we need to use LOAD and STORE operations to "spill" it into memory:

r0 = BAR()
r1 = BAR()
STORE(r0, mem + 0)
r0 = FOO(r0, r1)
r0 = FOO(r0, r1)
r1 = LOAD(mem + 0)
r0 = FOO(r0, r1)

Our new program uses both registers r0 and r1, and one word of memory (indicated as mem + 0). $0 is moved from r0 into memory before r0 is overwritten, then moved back in time for the final FOO operation.

Liveness ranges

When doing register allocation, we'll often talk about "liveness" or "live ranges" of variables. The live range of a variable is simply the time between when it's first defined and when it's used for the last time.

Here's what that looks like for our previous program:

                    $0  $1  $2  $3  $4
$0 = BAR()           ↑
$1 = BAR()           |   ↑
$2 = FOO($0, $1)     |   |   ↑
$3 = FOO($2, $1)     |   ↓   ↓   ↑
$4 = FOO($0, $3)     ↓           ↓   ↑
                                     ↓

Interestingly, this gives us an intuition for why we can't fit the program into two registers: on the third operation, we have three live values.

In conventional linear scan allocation (PDF), these liveness ranges are critical! Here's the pseudocode from that paper:

Pseudocode from the "Linear Scan Register Allocation" paper, Fig. 1

You don't need to follow the whole thing, but note how much effort goes into bookkeeping of live ranges:

The One Weird Trick

The "one weird trick" to avoid all this bookkeeping is simple:

Do register allocation in reverse.

You may ask, "How does that help?"

When walking through the code in reverse, you get liveness ranges for free, without needing an extra pass:

(Read the parenthesized words to get the "forward" perspective)

It turns out that this isn't a new idea; as I mentioned above, Mike Pall – author of LuaJIT – disclosed this trick back in 2009. Here's a good blog post summarizing how LuaJIT uses this strategy.

A worked example

Register allocation in reverse is extremely counterintuitive. For example, when we reach an operation, we must already know the location of its output (which is either a register or memory location).

This means that we start by assigning the final output variable to a register, typically r0 (for simplicity).

# $0 = BAR()
# $1 = BAR()
# $2 = FOO($0, $1)
# $3 = FOO($2, $1)
# $4 = FOO($0, $3)
                   <- You are here; r0 = $4

Let's proceed through our list of operations, and see what we encounter:

# $0 = BAR()
# $1 = BAR()
# $2 = FOO($0, $1)
# $3 = FOO($2, $1)
$4 = FOO($0, $3)   <- r0 = $4

At this point, neither input is assigned. Remember that the live span of a variable ends when it's written to (thinking about the tape in reverse); this means that we can reuse r0 immediately, remapping it from $4 to $0:

# $0 = BAR()
# $1 = BAR()
# $2 = FOO($0, $1)
# $3 = FOO($2, $1)
r0 = FOO(r0, r1)   <- r0 = $0, r1 = $3

Okay, let's keep going:

# $0 = BAR()
# $1 = BAR()
# $2 = FOO($0, $1)
$3 = FOO($2, $1)   <- r0 = $0, r1 = $3
r0 = FOO(r0, r1)

Now, we are at an impasse. We can reuse r1 as $3 goes out of scope, but cannot yet reuse r0, since $0 is still alive. After partially allocating this instruction, we're stuck here:

# $0 = BAR()
# $1 = BAR()
# $2 = FOO($0, $1)
r1 = FOO(r1, $1)   <- r0 = $0, r1 = $2
r0 = FOO(r0, r1)

We must now evict r0 to make space for $1. This takes the form of a LOAD operation after the operation of interest; we'll move it to mem + 0

# $0 = BAR()
# $1 = BAR()
# $2 = FOO($0, $1)
r1 = FOO(r1, r0)    <- r0 = $1, r1 = $2, mem + 0 = $0
LOAD(r0, mem + 0)   <- this is added!
r0 = FOO(r0, r1)

Think through how this works for a moment: from the forward perspective, r0 was bound to $0 after the target operation, so the LOAD operation recovers this binding during forward execution.

Continuing up, we run into a new issue: we're now trying to do an operation on a variable from memory, instead of a register!

# $0 = BAR()
# $1 = BAR()
r1 = FOO($0, r0)   <- r0 = $1, mem + 0 = $0
r1 = FOO(r1, r0)
LOAD(r0, mem + 0)
r0 = FOO(r0, r1)

At this point, we know that r1 is newly available. To use it, we add a STORE operation before the target, moving the variable from r1 into memory (when looking at the program from a forward perspective):

# $0 = BAR()
# $1 = BAR()
STORE(r1, mem + 0)   <- r0 = $1, r1 = $0
r1 = FOO(r1, r0)
r1 = FOO(r1, r0)
LOAD(r0, mem + 0)
r0 = FOO(r0, r1)

The remaining two lines are easy, because their outputs are already in registers:

r1 = BAR()
r0 = BAR()
STORE(r1, mem + 0)
r1 = FOO(r1, r0)
r1 = FOO(r1, r0)
LOAD(r0, mem + 0)
r0 = FOO(r0, r1)

There you have it: a fully allocated program in a single pass!

The big picture

Taking a more rigorous look, when we arrive at an operation, the output is located in either a register or memory; each input is either in a register, memory, or unassigned.

We need to make sure that all inputs and outputs are assigned to registers at the moment of the operation. To do so, we may need to evict existing register assignments to make room.

All of this boils down to very carefully placed LOAD and STORE operations around the target operation; at one point, I wrote out all 18 possible states for a two-input operation by hand to make sure I got them right.

(There are also special cases if an input is repeated!)

We'll dive into the code shortly, but first, let's take a little detour.

A cheap LRU cache

Picking which register to evict is an easy decision in a two-register machine. Real-world machines, fortunately, have more than two registers. For systems with more than two registers, how do we quickly decide which one to evict?

We expect a value that has been used recently to be used again soon; this is called "temporal locality".

Temporal locality suggests that we should avoid evicting values that have been used recently. I implemented a dead simple least-recently-used (LRU) cache using an array-backed doubly-linked list:

/// Dead-simple LRU cache, implemented as a doubly-linked list with a static
/// backing array.
/// 
///              <-- prev next -->
///      -------      -------      -------      --------
///      |  a  | <--> |  b  | <--> |  c  | <--> | head | <--|
///      -------      -------      -------      --------    |
///         ^                       oldest       newest     |
///         |-----------------------------------------------|
pub struct Lru<const N: usize> {
    data: [LruNode; N],
    head: usize,
}

#[derive(Copy, Clone)]
struct LruNode {
    prev: usize,
    next: usize,
}

(who says Rust can't handle linked lists?)

Keeping the LRU cache in stack-allocated arrays – instead of nodes allocated in the heap, stitched together by pointers – improves its (CPU) cache friendliness.

Each of the N registers has a slot in the data array, with indexes 0 through N - 1. The prev and next indexes, when followed as indexes into self.data, form a circular linked list; the index head is the newest node.

At this point, we can bring all of the usual linked-list algorithms to bear. We'll initialize the list with self.head = 0, and implement node removing and insertion:

impl<const N: usize> Lru<N> {
    pub fn new() -> Self {
        let mut out = Self {
            head: 0,
            data: [LruNode { prev: 0, next: 0 }; N],
        };
        for i in 0..N {
            out.data[i].prev = i.checked_sub(1).unwrap_or(N - 1);
            out.data[i].next = (i + 1) % N;
        }

        out
    }

    /// Remove the given node from the list
    fn remove(&mut self, i: usize) {
        self.data[self.data[i].prev].next = self.data[i].next;
        self.data[self.data[i].next].prev = self.data[i].prev;
    }

    /// Inserts node `i` before location `next`
    fn insert_before(&mut self, i: usize, next: usize) {
        let prev = self.data[next].prev;
        self.data[prev].next = i;
        self.data[next].prev = i;
        self.data[i] = LruNode { next, prev };
    }

To mark a node as recently-used, we implement poke(...), which moves it to the head of the list.

impl<const N: usize> Lru<N> {
    /// Mark the given node as newest
    pub fn poke(&mut self, i: usize) {
        let prev_newest = self.head;
        if i == prev_newest {
            return;
        } else if self.data[prev_newest].prev != i {
            // If this wasn't the oldest node, then remove it and
            // reinsert itright before the head of the list.
            self.remove(i);
            self.insert_before(i, self.head);
        }
        self.head = i; // rotate
    }
}

To get the oldest item, we look to head.prev, then rotate the list, which makes that node the newest.

impl<const N: usize> Lru<N> {
    /// Look up the oldest node in the list, marking it as newest
    pub fn pop(&mut self) -> usize {
        let out = self.data[self.head].prev;
        self.head = out; // rotate so that oldest becomes newest
        out
    }
}

These two functions are all we need for our register allocator: Whenever a register is used, we poke it in the LRU cache; whenever we need to pick a register to evict, we pop the oldest one (which has the side effect of marking it as fresh).

The RegisterAllocator itself

Now, let's take a look at the RegisterAllocator itself. I'm not going to drag you through every single line of code, but looking at the data structure should clarify the plan.

First, our variable type from the SSA input:

#[derive(Copy, Clone, Debug, Eq, PartialEq)]
struct SsaVariable(usize);
impl SsaVariable {
    pub const INVALID: Self = Self(usize::MAX);
}

Next, strongly-typed values representing various kinds of allocations

#[derive(Copy, Clone, Debug, Eq, PartialEq)]
struct Register(usize);

#[derive(Copy, Clone, Debug, Eq, PartialEq)]
struct Memory(usize);

#[derive(Copy, Clone, Debug, Eq, PartialEq)]
enum Allocation {
    Register(Register),
    Memory(Memory),
    Unassigned,
}

Finally, the RegisterAllocator itself:

pub struct RegisterAllocator<const N: usize> {
    /// Map from a `SsaVariable` in the original tape to the
    /// relevant allocation
    allocations: Vec<Allocation>,

    /// Map from a particular register to the index in the original tape
    /// that's using that register, or `SsaVariable::INVALID` if the
    /// register is unused.
    registers: [SsaVariable; N],

    /// Our least-recently-used cache of registers
    register_lru: Lru<N>,

    /// Available registers
    spare_registers: ArrayVec<Register, N>,

    /// Available memory slots
    spare_memory: Vec<Memory>,

    /// Total allocated slots (registers and memory)
    total_slots: usize,

    /// Output slots, assembled in reverse order
    out: Vec<AsmOp>,
}

Let's look at just one of the functions within RegisterAllocator: the function which is responsible for handling a unary operation:

    /// Lowers an operation that uses a single register into an
    /// `AsmOp`, pushing it to the internal tape.
    ///
    /// This may also push `Load` or `Store` instructions to the internal
    /// tape, if there aren't enough spare registers.
    fn op_reg(&mut self, out: SsaVariable, arg: SsaVariable, name: &str) {
        let out = self.get_out_reg(out);
        match *self.get_allocation(arg) {
            Allocation::Register(lhs) => {
                self.out.push(AsmOp::Unary { out, lhs, name });
                self.release_reg(out);
            }
            Allocation::Memory(m_y) => {
                self.out.push(AsmOp::Unary { out, lhs: out, name });
                self.rebind_register(arg, out);

                self.out.push(AsmOp::Store { src: out, dst: m_y });
                self.release_mem(m_y);
            }
            Allocation::Unassigned => {
                self.out.push(AsmOp::Unary { out, lhs: out, name });
                self.bind_initial_register(arg, out);
            }
        }
    }

There are a bunch of unexplained helper functions with descriptive names (get_out_reg, get_allocation, etc), but the broad strokes should be clear:

For all of the finer details, I'll direct you to the code, which has extensive docstrings explaining what's going on.

Before we move on, let's count the places that could perform allocation:

It's worth noting that the RegisterAllocator itself could be reused, which amortizes the allocation cost down even further!

A small showdown with Clang

I've been testing on a file with 6362 instructions, which begins

$14 = INPUT0()
$13 = MUL_IMM($14)
$6225 = ADD_IMM($13)
$197 = ADD_IMM($13)
$6361 = NEG($197)
$6360 = MAX($6225, $6361)
$6224 = NEG($6225)
$6357 = SQUARE($6224)
# ...and continues in this vein for some time

Running in a tight loop, I see roughly the following timing:

Doing this 100×, the average round-trip time is 3.9 ms.

This is pretty great: our register allocation is already one of the fastest parts of the system, second only to reading the file!

(The parser is a total hack job, so I'm not surprised to see it underperforming)

The output is a fully-allocated script, written into out.txt:

# less out.txt
r12 = INPUT0()
r2 = MUL_IMM(r12)
r22 = ADD_IMM(r2)
r0 = ADD_IMM(r2)
STORE(r0, mem + 36)
r0 = NEG(r0)
r0 = MAX(r22, r0)
r22 = NEG(r22)
r13 = SQUARE(r22)
# ...and so on

You can see that instructions are in the same order as the input script, but use registers instead of SSA variables; a single STORE operation is in evidence, and both r22 and r0 are reused a few times even in this short block of code.

Obviously, Clang cannot accept our homebrew scripting language. However, we can trivially convert it into a C program:

// Forward declarations of all our operations
float INPUT0(void);
float MUL_IMM(float);
float ADD_IMM(float);
float NEG(float);
// ... etc

float run() {
    const float v14 = INPUT0();
    const float v13 = MUL_IMM(v14);
    const float v6225 = ADD_IMM(v13);
    const float v197 = ADD_IMM(v13);
    const float v6361 = NEG(v197);
    // ...
    return v0;
}

Using external functions forces the compiler to call them, meaning it can't get too clever with optimization. Since we've included parse + output time in our benchmarking of SSRA, this is a fair-ish comparison: the compiler also has to parse the input, perform register allocation, and write an output file.

(Of course, it's not totally fair, because the compiler is also doing codegen, but that should be pretty simple for this code. The goal here is a rough comparison!)

Compiling this to an object file, the disassembly looks familiar:

# cc -c -O3 run.c -o out.o && otool -XtvV out.o > out.s && less out.s
_run:
        stp     d15, d14, [sp, #-0x60]!
        stp     d13, d12, [sp, #0x10]
        stp     d11, d10, [sp, #0x20]
        stp     d9, d8, [sp, #0x30]
        stp     x28, x27, [sp, #0x40]
        stp     x29, x30, [sp, #0x50]
        add     x29, sp, #0x50
        sub     sp, sp, #0x1e0
        bl      _INPUT0
        stur    s0, [x29, #-0xb0]
        bl      _MUL_IMM
        fmov    s9, s0
        bl      _ADD_IMM
        fmov    s8, s0
        fmov    s0, s9
        bl      _ADD_IMM
        str     s0, [sp, #0xa8]
        bl      _NEG
        fmov    s1, s0
        fmov    s0, s8
        bl      _MAX
        fmov    s10, s0
        fmov    s0, s8
        bl      _NEG
        stur    s0, [x29, #-0x78]
        bl      _SQUARE
        # ...etc

(I'm running on an M1 Macbook, so all of the assembly in this section is AArch64)

We see a function prelude (the stp instructions), followed by calls to our dummy operations (beginning with bl _INPUT0). In addition, we see one register being moved to the stack:

        stur    s0, [x29, #-0xb0]

It's more verbose than the pseudocode produced by SSRA – particularly because arguments to our operations must be in s0 and s1 – but definitely has a similar flavor.

How long did the compilation take?

$ time cc -c -O3 run.c -o out.o

real    0m0.369s
user    0m0.312s
sys     0m0.045s

By sheer coincidence, SSRA is almost exactly 100× faster (95×, to be precise).

With the disassembly, we can do a few other comparisons.

First, functions are called with bl or b instructions. Let's make sure we have the right number of function calls, counting them with wc -l:

$ grep "\tb" out.s|head
        bl      _INPUT0
        bl      _MUL_IMM
        bl      _ADD_IMM
        bl      _ADD_IMM
        bl      _NEG
        bl      _MAX
        bl      _NEG
        bl      _SQUARE
        bl      _INPUT1
        bl      _MUL_IMM
$ grep "\tb" out.s|wc -l
    6362

Sure enough, this matches the number of operations in our input!

Now, let's see how many memory operations were generated in the Clang code. Memory operations use the ldr, ldur, str, and stur instructions:

$ grep -E "ldu?r" out.s|wc -l
     1115
$ grep -E "stu?r" out.s|wc -l
     431

We see 1115 loads and 431 stores.

Compare that with our generated code:

$ grep LOAD out.txt|wc -l
     582
$ grep STORE out.txt|wc -l
     582

Amazingly, our homebrew register allocator produces fewer memory operations in total: 1163, versus 1546 for Clang.

Or does it?

Remember, we performed register allocation assuming we had 24 registers available. This is based on the A64 calling convention, using floating-point registers 8 through 32. In this case, we're doing single-precision floating-point operations, so the registers are indicated with s, e.g. s8.

Now, which registers did Clang use? We can find out from the disassembly:

$ grep -Eo "s[0-9]+" out.s|sort --version-sort -u
s0
s1
s8
s9
s10
s11
s12
s13
s14
s15

It turns out that Clang only used 10 registers:

This was surprising!

I attempted to convince Clang that it could use all the registers by annotating all of my dummy functions with __attribute__((preserve_all)).

This promptly crashed Clang:

$ cc -c -O3 run.c
fatal error: error in backend: Unsupported calling convention.
clang: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Apple clang version 14.0.0 (clang-1400.0.29.102)
Target: arm64-apple-darwin21.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
clang: note: diagnostic msg:
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang: note: diagnostic msg: /var/folders/r7/0hh1cx9925d9l33bwbv712fh0000gn/T/run-65434d.c
clang: note: diagnostic msg: /var/folders/r7/0hh1cx9925d9l33bwbv712fh0000gn/T/run-65434d.sh
clang: note: diagnostic msg: Crash backtrace is located in
clang: note: diagnostic msg: /Users/mkeeter/Library/Logs/DiagnosticReports/clang_<YYYY-MM-DD-HHMMSS>_<hostname>.crash
clang: note: diagnostic msg: (choose the .crash file that corresponds to your crash)
clang: note: diagnostic msg:

********************

(try it for yourself on Compiler Explorer)

At this point, despite having spent too much time on this writeup, I turned to gcc.

GCC 12 runs in about 1.2 seconds, between 3-4× slower than Clang. However, it uses many more registers!

$ time gcc-12 -O3 run.c -c -o out.gcc.o

real    0m1.179s
user    0m1.099s
sys     0m0.062s
$ otool -XtvV out.gcc.o > out.gcc.s
$ grep -Eo "s[0-9]+" out.gcc.s|sort --version-sort -u
s0
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
s11
s12
s13
s14
s15
s16
s17
s18
s19
s20
s21
s22
s23
s24
s25
s26
s27
s28
s29
s30

(This is without __attribute__((preserve_all)), which has no effect)

Like Clang, GCC uses s8-15. Unlike Clang, GCC uses s16-30 (but not s31, oddly enough); it also uses s2-7, which are defined as function arguments in our calling convention.

So, does it successfully use those registers to minimize loads and stores?

$ grep -E "ldu?r" out.gcc.s|wc -l
    1268
$ grep -E "stu?r" out.gcc.s|wc -l
     569

Nope! Despite using 29 (non-argument) registers to Clang's 8, GCC ends up doing more loads and stores!

Performance summary

I collected a little more data and made a summary chart:

ClangGCCSSRA[24]SSRA[10]SSRA[8]
Runtime369 ms1.2 sec3.9 ms3.8 ms3.8 ms
Registers82924108
Load1115126858211511424
Store43156158211511424
Memory words139117102116118

Notes

Commentary

The asymmetry between load and store counts puzzled me at first, until I realized what the compilers are doing: they're smart enough to leave a value in memory and load it multiple times; SSRA will only allow a value to be in one place at a time (register or memory).

Clang seems pretty impressive: given its self-imposed limitation of 2 argument + 8 scratch registers, it manages to use an impressively small number of loads and stores, at a slight penalty to stack space.

GCC is less impressive: it's allowing itself more registers than any other system, but is still generating a bunch of memory operations.

SSRA, in my humble opinion, is also impressive: it runs two orders of magnitude faster, and doesn't dramatically underperform the more powerful compilers. Even when limited to 8 registers, it's within a factor of 2x of Clang's load/store count (to say nothing of GCC).

Take all of this with a grain of salt: I'm not a compiler engineer, and there are probably good reasons for the choices that they're making. The real test would be benchmarking the generated code, which I'll classify as "future work".

(In particular, I'm curious whether loads are cheaper than stores, which would explain why the compilers prefer to keep copies of values on the stack; Performance Speed Limits suggests that's the case in steady-state loops. If you've got any insights, send me an email or ping me on Twitter).

Register pressure

Just for fun, here's a quick graph of register count against load/store operations:

Register pressure, starting at 8000 memory ops with 2 registers

This is running on the same 6362-operation program; at 24 registers, about 20% of the generated code is memory operations.

To keep every value in registers, we would need 126 of them!

Demo

Here's a second demo editor, with an extra slider to adjust register count.



24 registers

Conclusions and notes about this writeup

I went into this exercise unsure of whether I could improve on state-of-the-art compilers. It turns out that – for this very specific sub-problem – it's definitely possible!

The solid-state register allocator is a means to an end, and exists as a subcomponent in a larger (unreleased) system. If you'd like to hear when I publish more, subscribe via Atom or follow me on Twitter.

The code for this writeup is all published on Github. The main branch is what's running on this page; the bench branch includes benchmarking in the test suite.

The repo is just sufficient to run the demos and benchmarks in this write-up – and isn't suitable for other applications – so I'm not releasing it on crates.io. If you want to do something similar, I encourage you to adapt the code for your own applications!

Benchmarking was done with native code, but the demos on this page use wasm-pack to run the same code in the browser. Very casual benchmarking reveals the overhead to be under 2× (the large sample program runs in 6.1 ms, versus 3.8 ms for the native build).

Text highlighting is done with highlight.js and a custom syntax file for the pseudo-assembly language. The text editor with syntax highlighting was originally based on this CSS-tricks writeup, but evolved significantly since then; look at the source for details if you dare (run.js, style.css).