Optimizing the KETJU Integrator

Two supermassive black holes spiralling together in a galactic nucleus

I made a research-grade black hole simulation code about 20% faster, without changing a single digit of its physics output.

When two galaxies collide, the supermassive black holes at their centres sink inward and eventually pair up into a tightly bound binary that grinds down over millions of years before merging in a burst of gravitational waves. Simulating that process is brutally expensive, and the code that does it has to be both fast and extraordinarily careful with its arithmetic. This is the story of how I sped up the gravitational core of one such code, and the discipline required to prove I hadn’t broken the science in the process.

What we are actually simulating

The code I worked on is the integrator at the heart of KETJU / MSTAR (the name means “chain” in Finnish). Its job is to follow the motion of stars and supermassive black holes in the dense, chaotic centre of a galaxy, where Newton’s law of gravity alone isn’t accurate enough. Near a black hole, objects move at a noticeable fraction of the speed of light, so the code layers on post-Newtonian corrections: extra force terms borrowed from general relativity that capture effects like the slow inspiral that eventually produces a gravitational-wave signal.

Black hole with stars orbiting around it (artistic representation)

This is genuinely hard physics to compute, for one main reason: dynamic range. A black hole binary can be a million times tighter than the star cluster surrounding it. A naive integrator would either take impossibly tiny timesteps to resolve the binary, or smear it out entirely. KETJU gets around this with a technique called algorithmic regularization, a clever change of variables that lets the integrator sail smoothly through violent close encounters that would otherwise blow up the equations. Wrapped around that is a Gragg–Bulirsch–Stoer extrapolation scheme, which repeatedly halves the step size and extrapolates toward the “infinitely accurate” answer.

The upshot is a code where accuracy is the entire point. And that turns out to matter enormously the moment you try to make it faster.

The cost of gravity

Strip away the relativity and the regularization, and the beating heart of any N-body code is the same humble loop: to know how every object accelerates, you have to add up the gravitational pull of every other object. With N particles, that’s roughly N²/2 pairwise interactions, every single timestep. This is the famous O(N²) force loop, and in this code it looks like this:

for (int i = istart; i < istop; i++) {
    for (int j = jstart; j < jstop; j++) {
        const double r2 = separation_and_r2(/* ... */);   // distance between i and j
        const double invr = 1. / sqrt(r2);
        const double Ginvr3 = G * invr * invr * invr;
        // accumulate the force on both particles...
    }
}

Profiling confirmed that this is where essentially all the time goes. I used gprof, the same trusty tool I reach for in most of my optimization work, and the flat profile was blunt about it:

  %   cumulative   self              self
 time   seconds   seconds    calls            name
 51.75     2.95     2.95     21750            comp_U_acc_loop_standard
 25.88     4.42     1.48  114712770           separation_and_r2
  7.19     4.83     0.41   38273772           ketju_check_relative_proximity_ND_2

Two functions, the force loop and the distance helper it calls, accounted for about 76% of the entire runtime. That helper, separation_and_r2, was being called one hundred and fourteen million times. If I wanted a faster code, this was the only neighbourhood worth visiting.

The golden rule: don’t touch the physics

Before changing anything, I want to explain the constraint that shaped every decision, because it’s what separates “optimizing a simulation” from “optimizing a web server.”

Gravitational systems are chaotic. Two simulations that differ by a single bit in the last decimal place will, after enough steps, diverge into completely different futures. That sounds like a disaster, but physicists handle it by demanding that the code conserve certain quantities, above all energy, to a known tolerance. KETJU’s whole architecture exists to keep that floating-point error under control.

Here is the part that surprised me most when I read the code. That innocent-looking separation_and_r2 function does not always compute the distance between two particles in the obvious way. It has two branches:

if (/* the two particles are close together in the tree */) {
    // Sum up small relative-position vectors along a "minimum spanning tree"
    for (int l = 0; l < *d; l++) { dr[k] += sign[l] * edge_rel_pos[path[l]][k]; }
} else {
    // The ordinary way: subtract one absolute position from the other
    dr[k] = pos_j[k] - pos_i[k];
}

Why the complexity? Because subtracting two large, nearly-equal numbers is the cardinal sin of floating-point arithmetic. If two stars are both about 10,000 units from the origin but only 0.001 units apart, computing pos_j - pos_i throws away almost all your precision. This is called catastrophic cancellation.

KETJU’s answer is the minimum spanning tree (MST) coordinate system: instead of storing every position relative to a distant origin, it stores positions relative to neighbours, chained along a tree. For nearby particles, the separation is then built by adding up a handful of tiny, already-precise edge vectors, no cancellation, no lost digits.

The consequence for me was a hard rule: whichever branch a given pair takes, and the exact order in which those numbers are added, must never change. Flip a pair from the tree branch to the subtraction branch and you’ve altered its round-off, which alters the chaotic trajectory, which means you are now simulating a different universe. So my target wasn’t “compute the same thing approximately faster.” It was “compute the bit-for-bit identical result faster.” My pass/fail test was a single number printed at the end of every run, the relative energy error:

relative_energy_error: -3.408942e-07

If that number changed in any digit, the optimization was wrong, full stop.

With the rules set, I went looking for wasted work rather than different work. The first culprit was a subtle one that trips up almost every C programmer eventually: pointer aliasing.

Inside the loop, the code reads a lot of read-only data, particle positions, the tree structure, masses, and writes the results (accelerations and energy) back through a big nested struct, roughly R->internal_state->…. The problem is that the compiler can’t prove those two things live in different places in memory. As far as it knows, writing a new acceleration value might secretly overwrite the tree data it’s about to read next. To stay safe, it does the paranoid thing: it re-reads the entire chain of pointers on every one of the 114 million iterations, even though they never actually change.

The fix is to make a promise to the compiler. I hoisted those unchanging pointers into local variables marked with the restrict keyword, which is C’s way of saying “trust me, these don’t overlap with anything else”:

const struct graph_vertex *const restrict vertex = is->mst->vertex;
double (*const restrict pos)[3]                   = ps->pos;
double (*const restrict acc)[3]                   = is->acc;

With that single guarantee, the compiler relaxes. It loads each pointer once, keeps it in a register for the whole loop, and stops second-guessing itself. Same arithmetic, same results, just without the redundant memory traffic.

Inlining the hot path

The next target was that third profiler entry, ketju_check_relative_proximity_ND_2, the function that decides which of the two distance branches a pair should take. It was being called 38 million times, and it lived in a separate source file from the loop that used it. That meant every call paid the full overhead of a function call, and the compiler couldn’t see inside it to optimize across the boundary.

Worse, the function recomputed a value the caller had already computed. The loop checks how far apart two particles are in the tree to decide whether to even bother; the function then computed that exact same difference again from scratch.

The fix was straightforward: I moved these small, hot decision functions into the shared header as static inline functions, right next to where they’re used. Now the compiler can paste their bodies directly into the loop, delete the duplicated work, and erase 38 million function calls. Again, identical logic, identical decisions, just no overhead around them.

The biggest win: keeping accumulators in registers

The single most effective change came from thinking about memory like a physicist thinks about a conserved quantity, something you want to hold onto, not keep putting down and picking back up.

In the inner loop, each pair updates the acceleration of both particles involved: particle i and particle j. But across the entire inner loop, i is fixed, only j changes. So particle i’s acceleration was being read out of memory, nudged, and written back to memory on every single iteration, millions of times for the same three numbers.

Ideally the compiler would keep i’s acceleration in a CPU register for the duration. But it couldn’t, for the same aliasing reason as before: it couldn’t rule out that a write to particle j might secretly be a write to particle i. So I did it by hand, accumulating into three plain local variables and only writing back once, after the loop finishes:

double ai0 = acc[i][0], ai1 = acc[i][1], ai2 = acc[i][2];   // pull into registers
 
for (int j = jstart; j < jstop; j++) {
    // ...
    const double t0 = dr[0] * Ginvr3, t1 = dr[1] * Ginvr3, t2 = dr[2] * Ginvr3;
    ai0 += mj * t0;  ai1 += mj * t1;  ai2 += mj * t2;          // i: cheap register adds
    acc[j][0] += -mi * t0;  acc[j][1] += -mi * t1;  acc[j][2] += -mi * t2;  // j: still memory
}
 
acc[i][0] = ai0;  acc[i][1] = ai1;  acc[i][2] = ai2;          // write back once

This removed three memory reads and three memory writes per iteration, replacing them with register arithmetic that the CPU does almost for free. And because the additions happen in exactly the same order on exactly the same values, the result is still bit-identical. This one change was worth about a 13% speedup in the force loop on its own.

Knowing when to stop: measuring the ceiling

There was one more idea I was tempted by. Of those 38 million calls to the branch-decision function, the overwhelming majority answer “no, these particles aren’t close in the tree.” I could imagine a cheaper pre-filter to reject most of them faster.

But before writing a line of it, I ran an experiment to find the ceiling: I temporarily ripped the decision out entirely (producing deliberately wrong physics, just to time it) to see the absolute most I could ever save. The answer:

proximity check, full:    ~0.477 s
proximity check, removed: ~0.468 s   (≈ 1.9% of the loop)

Less than 2%, and that’s the best case, with a magic zero-cost filter that doesn’t exist. A real pre-filter would have its own cost plus extra bookkeeping every time the tree is rebuilt, almost certainly netting out to zero or worse. So I didn’t build it. Measuring first saved me from spending a day making the code more complicated and no faster. Knowing which optimizations not to chase is, I think, as valuable as knowing which ones to.

A humbling plot twist

Now for the part that taught me the most.

I had benchmarked everything carefully, the optimized code was consistently ~20% faster, energy error unchanged. Then a comparison against a pristine copy of the original code came back saying my “optimized” version ran in 2.6 seconds versus the original’s 0.76 seconds. According to that, I’d made it three times slower.

My stomach dropped. But because I’d been religious about measuring, I didn’t trust the headline, I went and reproduced it from scratch. The culprit was sitting in the build configuration:

CMAKE_BUILD_TYPE:STRING=Relase     # ← "Release" is misspelled

A single missing “e”. The build system selects optimization flags by matching the build-type name exactly. Relase matches nothing, so the compiler silently dropped to no optimization at all, no -O3, building a debug-grade binary that happened to be lying around. I was comparing an unoptimized build of my code against a properly optimized build of the original. Once I fixed the typo and rebuilt, the picture snapped back to reality:

original  (optimized build):  ~0.82 s
mine      (optimized build):  ~0.66 s

The lesson is one I’ll carry into any research setting: a surprising performance number is far more often a measurement bug than a real result. The methodology, interleaving runs to cancel out machine noise, fixing every flag, checking a known invariant after every change, is not bureaucracy. It’s the only thing standing between you and a confidently-stated wrong conclusion. In a chaotic simulation where you can’t eyeball whether the answer is right, that discipline is the whole game.

Results

Stacking up the changes, here is the final scorecard on the gravitational force loop, the part that dominates the runtime:

original                                   ~0.531 s   (force loop)
+ restrict / hoisting / inlining           ~0.477 s   (~10% faster)
+ register-accumulated acceleration        ~0.414 s   (~13% faster)

End to end, that worked out to roughly a 21% speedup in the force kernel and about 19–20% off the total runtime, verified by running the old and new binaries back-to-back, dozens of times, and taking the comparison pair by pair to beat down the noise.

And the number that actually mattered, the one I checked after every single edit:

relative_energy_error: -3.408942e-07     # before
relative_energy_error: -3.408942e-07     # after

Identical to the last digit. The full unit-test suite, including the tree-construction and proximity tests, passed unchanged. The physics is exactly what it was. It just gets there faster.

Why a fifth of a second matters

It would be fair to ask why anyone should care about shaving 20% off a toy run that takes under a second.

The answer is that real KETJU simulations don’t take a second. They run on supercomputers for weeks, following galaxy mergers with millions of particles across millions of timesteps, and they’re run not once but in large suites to map out how the answer depends on the initial conditions. In that world, a 20% reduction is 20% fewer core-hours, which is real money, real energy, and real waiting time for a research group, multiplied across every run they ever do. It’s also 20% more science you can afford with the same allocation: a slightly larger cluster, a slightly longer integration, one more parameter you get to vary.

But honestly, the part I enjoyed most wasn’t the speedup. It was the constraint. Working inside a code where correctness is non-negotiable and chaos punishes the smallest mistake forced a kind of rigor I find genuinely exciting, the discipline of proving, not hoping, that a change is safe. That intersection, where careful low-level engineering meets the demands of real physics, is exactly where I want to spend the next several years.

You can find the repo on my Gitlab:

Thanks for reading!
Until next time =)