In my last post I made the gravitational force loop of the KETJU integrator about 20% faster without changing a digit of its physics output. The whole thing rested on a single profile: 102 particles, one MPI rank, where the force loop and its distance helper together ate 76% of the runtime. When the heat is that concentrated, the job is easy. You go where the profiler points.
A 30% hole in the timing
I noticed because the numbers didn’t close. KETJU already has a handful of performance sub-timers, and once I pushed the task count up, they only added up to about 70% of the total:
time_total: (everything)
time_force + time_force_comm + time_coord + time_mst
+ time_gbscomm + time_gbsextr ≈ 70% of time_total
^^^ so where's the other 30%?Looking at what those six timers actually wrapped, the answer was obvious in hindsight. They covered the force loop, the coordinate transforms, the MST build, and the GBS comm and extrapolation, i.e. exactly the things that look expensive on one rank. The replicated serial work ran on every task and was timed by nothing. About 30% of the runtime simply wasn’t being measured.
Making the missing work visible
I added three counters to the performance struct, splitting the unmeasured work into chunks that meant something:
struct ketju_performance_counters {
/* ...existing timers... */
double time_drift_kick; // leapfrog arithmetic + compute_T
double time_backup; // state / GBS data copies
double time_init; // per-call GBS/MST setup + teardown
};time_drift_kick is the leapfrog drift/kick arithmetic pluscompute_T .compute_T is the real offender here: a full O(N) loop with no load-splitting, run on every task, once per drift. Nothing at one rank, a wall at a thousand.time_backup is the state and GBS data copies,store_state_backup ,store_gbs_data , and the memcpy half of folding a GBS result back into the system.time_init is the per-call GBS and MST setup and teardown, the initial timestep estimate, and the reinit that fires on an order change or a black-hole merger.
A couple of things mattered while wiring these in. I kept every timer at leaf level and never wrapped a region that already had a counter inside it, otherwise the same nanoseconds get charged twice and the total stops meaning anything. And I had
time_unaccounted = time_total - (sum of all sub-timers)I care about that residual line more than the new counters. It’s a tripwire. If it ever drifts back up, it means someone added expensive work that nothing is timing, and now the output says so instead of leaving future-me to rediscover the 30% gap from scratch.
Wrapping up
The first post was the fun part: find the hot loop, make it faster. This one is the less glamorous half, which is making sure the performance metrics you’re staring are actually accurate. A single-rank profile is a narrow view, and the only way to trust it is to check that everything it leaves out adds up to nothing. Here it didn’t, and that gap was the more interesting result.
You can find the repo on my Gitlab:
Thanks for reading!
Until next time =)
