Elf Assembly JIT Compiler

Overview

On Days 16, 19, and 21 of Advent of Code 2018, the problems are based on a fictional virtual machine and assembly language.

The machine has a small number of registers and opcodes, and is programmed in a textual assembly language. Unusually, there are no jump instructions. Instead, the instruction pointer is assigned to one of the registers, and modified with normal math operations.

  Instructions          Registers
  /----------------\   /-----------------------\
  | seti 123 0   3 |   | 0 | 1 | 2 | 3 | 4 | 5 |
  | bani 3   456 3 |   \-----------------------/
  | eqri 3   72  3 |                 ^ip
  | addr 3   4   4 |
  | ...            |  The running system looks
  \----------------/    something like this!

(not to be confused with the ELF file format)

The Challenge

Day 21 is particularly interesting: We're given a particular program and asked to solve for values in register 0 that causes the program to terminate as soon as possible, then as late as possible (without looping forever).

After digging into the code, it becomes clear that at a certain point, we compare registers 0 and 3, and terminate if they're equal.

By reading the value in register 3 when this comparison is made, we can solve for the two objectives. The first comparison value, if loaded into register 0, would cause the program to terminate as soon as possible; the last non-repeating comparison would cause the program to run as long as possible.

We reach the first comparison relatively quickly, after evaluating 1,845 instructions.

This can be run on a fairly simple interpreter, which operates on the instruction list and terminates when we hit that target instruction.

I implemented such an interpreter in two pieces. First, the op-code evaluator:

    fn eval(&self, s: &mut Registers) {
        s.0[self.c] = match self.op {
            addr => s.0[self.a] + s.0[self.b],
            addi => s.0[self.a] + self.b,
            mulr => s.0[self.a] * s.0[self.b],
            muli => s.0[self.a] * self.b,
            banr => s.0[self.a] & s.0[self.b],
            bani => s.0[self.a] & self.b,
            borr => s.0[self.a] | s.0[self.b],
            bori => s.0[self.a] | self.b,
            setr => s.0[self.a],
            seti => self.a,
            gtir => (self.a > s.0[self.b]) as usize,
            gtri => (s.0[self.a] > self.b) as usize,
            gtrr => (s.0[self.a] > s.0[self.b]) as usize,
            eqir => (self.a == s.0[self.b]) as usize,
            eqri => (s.0[self.a] == self.b) as usize,
            eqrr => (s.0[self.a] == s.0[self.b]) as usize,
        };
    }

Second, the run loop, which terminates at instruction 28:

    let mut state = Registers([0, 0, 0, 0, 0, 0]);
    let mut ip = 0;
    loop {
        if ip >= tape.len() {
            break;
        }
        state.0[ip_reg] = ip;
        tape[ip].eval(&mut state);
        ip = state.0[ip_reg] + 1;
        if ip == 28 {
            println!("Part 1: {}", state.0[3]);
            break;
        }
    }

This works great to find the first answer, which, remember, is found after 1,845 instructions.

The final comparison, on the other hand, is all the way at 2,893,422,946 instructions.

On my first pass through, I didn't know that value. I started running my interpreter, but was impatient, and killed it after about 10 seconds. To speed things up, I analyzed the assembly and rewrote it in Rust, preserving the computation but allowing for much faster execution.

It ended up running incredibly fast, but required a lot of grunt-work to churn through the assembly.

After getting the solution, I got curious – just how slow would the interpreter be? It turned out to be faster than I expected: it found the second solution after about 16 seconds, i.e. just slightly above my impatience threshold.

After exploring these two routes, I decided to take a third path: Using LLVM to compile the Elf assembly language into native code, so it would could run fast without needing a manual rewrite.

The code is on Github. The sections below give a walkthrough, but you can also skip to the Results to see performance details.

Opcode evaluation

After a cursory search, I decided to use the Inkwell LLVM bindings.

There's a lot of boilerplate, but the core instruction evaluation is fairly simple.

I start by allocating an array with enough memory for six uint64, which will be our six registers. To be conservative, I plan to load + store data to and from those registers at every operation.

We iterate over every instruction, which have already been parsed into opcodes and argument descriptions.

for (i, line) in tape.iter().enumerate() {
    builder.position_at_end(&instruction_blocks[i]);

Based on whether arguments are immediate or from registers, load values for the left and right arguments:

    let a = match line.op.1 {
        Source::Immediate => i64_type.const_int(line.a as u64, false),
        Source::Register  => *builder.build_load(reg[line.a], "a")
                                     .as_int_value()
    };
    /* Do the same for b */

There are very limited number of actual operations:

    let value = match line.op.0 {
        Add => builder.build_int_add(a, b, ""),
        Mul => builder.build_int_mul(a, b, ""),
        And => builder.build_and(a, b, ""),
        Or => builder.build_or(a, b, ""),
        Set => a,
        Gt => builder.build_int_z_extend(
                builder.build_int_compare(IntPredicate::UGT, a, b, ""),
                i64_type, ""),
        Eq => builder.build_int_z_extend(
                builder.build_int_compare(IntPredicate::EQ, a, b, ""),
                i64_type, ""),
    };
    builder.build_store(reg[line.c], value);

Handling the instruction pointer

The instruction pointer is tricky, because it's stored in a register and modified through normal arithmetic.

For any instruction that doesn't modify that register, it's easy: jump straight to the next instruction.

Instructions that modify that register require a bit more thought.

I started by deploying a conditional chain after each instruction that could modify the instruction pointer register. This chain checks for each possible new value of the instruction pointer and branches to the appropriate block.

In LLVM IR, this looks something like

i14j0:                                            ; preds = %i14
  %cmp_14_0 = icmp eq i64 %ip14, 0
  br i1 %cmp_14_0, label %i0, label %i14j1

i14j1:                                            ; preds = %i14j0
  %cmp_14_1 = icmp eq i64 %ip14, 1
  br i1 %cmp_14_1, label %i1, label %i14j2

i14j2:                                            ; preds = %i14j1
  %cmp_14_2 = icmp eq i64 %ip14, 2
  br i1 %cmp_14_2, label %i2, label %i14j3

i14j3:                                            ; preds = %i14j2
  %cmp_14_3 = icmp eq i64 %ip14, 3
  br i1 %cmp_14_3, label %i3, label %i14j4

  ...

Each block compares the instruction pointer against a constant value, branches to that instruction block if it matches, and moves to the next branch-table block otherwise.

This works, but ends up being fairly slow; for some reason, LLVM doesn't turn this into a jump table.

To optimize, I recognized three patterns in the assembly:

seti (Set Immediate) is an absolute, unconditional branch
addi (Add Immediate) is a relative unconditional branch
addr (Add Register) is a relative conditional branch

I also noticed that addr usually appeared after a conditional operator, which means it only jumps 1 or 2 slots (because the conditional returns either 0 or 1).

Armed with these patterns, I deployed a more sophisticated system to decide which branches to check, and in what order. The two unconditional branches only have a single target location; the relative conditional branch could end up anywhere, but checks the next two locations first:

    if line.op.0 == Set && line.op.1 == Immediate {
        target_list.push(line.a + 1);
    } else if line.op.0 == Add && line.op.1 == Immediate {
        target_list.push(i + line.a + 1);
    } else if line.op.0 == Add {
        target_list.push(i + 1);
        target_list.push(i + 2);
        for j in 0..tape.len() {
            if !target_list.contains(&j) {
                target_list.push(j);
            }
        }
    }

That's the last significant chunk; you can see a full set of generated IR on Compiler Explorer.

This code worked okay: it was about an order of magnitude faster than my interpreter, but an order of magnitude slower than my hand-ported code.

Branch safety checking

After discussing this on Reddit, I've found a few possible improvements. It was particularly helpful to compare against the LLVM IR from another user's auto-ported code.

Here's what I noticed:
My code is very heavy on load and store operations, because of paranoia about the instruction pointer register. In contrast, the auto-ported code's IR has very few explicit load / store operations, and instead uses the phi operator to deal with branching.

This helps to explain why my code is slow: explicitly specifying load + store in each block constrains the compiler, making it less able to reason about and optimize data dependencies.

To start improving the situation, I implemented a safety check which detects whether every jump was of a known, safe form (corresponding to the heuristics above).

For example, consider the instructions

7: eqri 3 72 3  ; check whether register 3 == 72
8: addr 3 4 4   ; jump either one or two instructions

(where the instruction pointer is register 4).

If the rest of the program can't jump to instruction 8, then this pair of instructions can either jump to instruction 9 or instruction 10. In other words, instruction 8 is an unsafe landing zone, and instructions 9 and 10 are jump targets. If, through the whole program, there are no jumps that could land in unsafe landing zones, then we can be much more aggressive about IR generation.

Here's the Rust code to accumulate this information about the whole program:

let mut unsafe_landing_zones = HashSet::new();
let mut jumps = HashMap::new();

for (i, line) in tape.iter().enumerate() {
    if line.c == ip_reg {
        let t: &mut HashSet<_> = jumps.entry(i).or_insert(HashSet::new());

        // seti is an absolute jump
        if line.op.0 == Set && line.op.1 == Immediate {
            println!("Found jump from {} to {}", i, line.a + 1);
            t.insert(line.a + 1);

        // addi is a relative jump
        } else if line.op.0 == Add && line.op.2 == Immediate {
            println!("Found jump from {} to {}", i, i + line.a + 1);
            t.insert(i + line.a + 1);

        // Detect a comparison or equality operation followed by
        // an addition that uses that same register
        } else if line.op.0 == Add && line.op.2 == Register && i > 0 &&
            (tape[i - 1].op.0 == Eq || tape[i - 1].op.0 == Gt) &&
            ((line.b == ip_reg && line.a == tape[i - 1].c) ||
             (line.a == ip_reg && line.b == tape[i - 1].c))
        {
            println!("Found jump from {} to {} or {}", i, i + 1, i + 2);
            t.insert(i + 1);
            t.insert(i + 2);
            unsafe_landing_zones.insert(i);
        } else {
            println!("Found unconstrained jump");
            for j in 0..tape.len() {
                t.insert(j);
            }
        }
    }
}
let jump_targets = jumps.values().flat_map(|i| i).cloned().collect();
let proved_safe =
    unsafe_landing_zones.intersection(&jump_targets).count() == 0;

My code falls back to the slow, conservative path if we couldn't prove the jumps safe; otherwise, it uses a more optimized IR generator. This is overkill for the Advent of Code challenge, which only uses safe jumps, but safety is important.

Optimized IR generation

The optimized IR generator doesn't try to work around SSA form with load / store operations; instead, we fully embrace its constraints and quirks.

The implementation is similar to the IR generator described above. I'll go through key details below, but refer to the full source for the actual implementation.

We start by defining our own Block struct, which represents an LLVM BasicBlock plus data about the inputs, outputs, and phi values.

struct Block {
    phi: Vec<PhiValue>,
    input: Vec<IntValue>,
    output: Vec<IntValue>,
    block: BasicBlock,
};

We build a block with empty inputs and phi nodes, with the exception of the first block, which gets zeros for its registers when coming from the setup block:

let mut block = Block {
    phi: Vec::new(),
    input: Vec::new(),
    output: Vec::new(),
    block: b
};
for j in 0..6 {
    if j == ip_reg {
        block.input.push(i64_type.const_int(i as u64, false));
    } else {
        let phi = builder.build_phi(i64_type, &format!("r{}_{}", j, i));
        if i == 0 {
            phi.add_incoming(&[(&i64_type.const_int(0, false), &setup_block)]);
        }
        block.phi.push(phi);
        block.input.push(phi.as_basic_value().into_int_value());
    }
}
block.output = block.input.clone();
instruction_blocks.push(block);

Within each block, at most one output variable is modified based on the opcode; this is very similar to the earlier implementation:

let name = format!("r{}_{}_", line.c, i);
let value = match line.op.0 {
    Add => builder.build_int_add(a, b, &name),
    Mul => builder.build_int_mul(a, b, &name),
    And => builder.build_and(a, b, &name),
    Or => builder.build_or(a, b, &name),
    Set => a,
    Gt => builder.build_int_z_extend(
            builder.build_int_compare(IntPredicate::UGT, a, b, ""),
            i64_type, &name),
    Eq => builder.build_int_z_extend(
            builder.build_int_compare(IntPredicate::EQ, a, b, ""),
            i64_type, &name),
};
instruction_blocks[i].output[line.c] = value;

Then, jumps install themselves into the phi blocks of their target locations:

let get = |t: usize, prev: &BasicBlock| -> &BasicBlock {
    let (phi, block) = instruction_blocks.get(t)
        .map(|b| (&b.phi, &b.block))
        .unwrap_or((&exit_phi, &exit_block));

        let mut k = 0;
        for j in 0..6 {
            if j != ip_reg {
                phi[k].add_incoming(&[(&instruction_blocks[i].output[j], prev)]);
                k += 1;
            }
        }
        block
};

Callbacks are similar as before, but we need to populate variables before executing the call:

for j in 0..6 {
    builder.build_store(reg[j], instruction_blocks[i].output[j]);
}

let cb_result = builder
    .build_call(cb_func, &[reg_array.into()], "cb_call")
    .try_as_basic_value()
    .left()
    .unwrap();

(this is the only place that we emit explicit store instructions!)

You can see the full generated IR on Compiler Explorer; it's definitely worth comparing to the previous IR, and contrasting the size of the resulting assembly (453 vs 85 lines!).

Results

Here's the time taken by various strategies, all running on my laptop:

Implementation	Time (sec)	Instructions/sec
Interpreter	15.3	189M
JIT (original)	1.7	1.7B
JIT (improved)	0.27	11B
Hand-ported	0.13	N/A

My fastest JIT is about 56x faster than my interpreter!

It's only 2x slower than the hand-ported code, including the time it takes to compile; the actual evaluation runs in about 236 ms.

I find this result perfectly acceptable: in hand-porting the code, I used a few abstractions that aren't available in the original assembly language, so it's understandably faster.

Thanks to the folks on Reddit who offered feedback on my original writeup, and to Eric Wastl for running this madcap adventure, year after year.