Making ries-rs 4–21× faster without breaking reproducibility

ries-rs is my Rust reimplementation of Robert P. Munafo’s RIES: you hand it a number and it searches for compact algebraic equations that have that number as a solution. The first writeup was about the rewrite itself. This one is about a specific question I kept avoiding: why is a hard search slow, and what can I honestly do about it?

At a glance: Rust search engine · profile-guided optimization · top-k ranking repair · deterministic-vs-throughput contract split · opt-in parallel matching · 4.1× to 20.8× faster level-3 search on an 8-core Apple M1.

The short version: a level-3 search that took 16 to 70 seconds now finishes in a few seconds on a normal laptop. But the interesting part is not the number. It is the order in which I learned things, including the parts where I was wrong.

Where we started

RIES runs a pipeline: generate candidate equations, match each by refining LHS(x) = RHS to high precision, and rank the survivors into a bounded pool. A level sets how complex the equations are allowed to get; this post measures level 3.

A level-3 search for an “ugly” target — a number with no clean closed form — looked like this on my 8-core machine:

Generation:  ~1.0 s
Search:      ~14.8 s
  Newton calls:   22,817,186
  Pool insertions: 17,586,629   (final pool size: 16)

Two things jumped out.

First, generation was not the problem. I had assumed that enumerating millions of expressions was the cost. It was about one second of sixteen. The real cost was the matching phase: twenty-three million Newton-Raphson refinements, each trying to solve LHS(x) = RHS to high precision.

Second, the result pool was thrashing. To return sixteen matches it performed seventeen million insert-and-evict cycles. The pool was doing enormous work to keep almost nothing.

The existing --parallel mode did not help here — a surprise until I read my own code and realized it only parallelizes generation. The dominant matching phase stayed single-threaded on purpose, to preserve a guarantee I will come back to. So on a matching-bound workload, parallel mode was essentially break-even.

The first lesson: measure, then cut

My initial instinct was to optimize a function in the generation hot loop that allocated two small vectors on every call. I rewrote it to allocate nothing. It is a better function, and I kept it.

It made no measurable difference to the benchmark, because generation was never the bottleneck.

An allocation you can see is not the same as a cost that matters. The profile said matching and refinement dominated. I should have started there. The clean-but-irrelevant rewrite was a tax I paid for guessing.

Fixing the ranking model (which also fixed quality)

Once I looked at the pool, the thrashing made sense — and it pointed at a ranking bug, not just a speed bug.

The pool ranked matches first by exactness, then by absolute error. The problem: once a match is “exact” within floating-point tolerance, its remaining error is noise, down at the 10⁻¹⁶ level. Ranking exact matches against each other by that noise meant a complicated equation with a slightly smaller rounding residual would outrank a simple, beautiful one.

Concretely, for a target of 2π, a dense complexity-123 expression could beat plain x = 2π because its residual happened to be a few bits smaller. That is backwards. RIES is supposed to prefer simplicity.

The fix was to treat all sub-tolerance matches as equally exact and let complexity decide between them. This is more faithful to the original RIES intent and it stabilizes the pool. As I would discover, it is also what later made parallelism possible.

It also unlocked a clean pruning rule. Once the pool is full of exact matches, no expression more complex than the simplest one already in the pool can ever earn a place. Since the search walks candidates in increasing complexity, it can stop early the moment that ceiling is reached. This is provably safe: it only skips work that could never change the answer. For targets with a simple closed form it is dramatic: a search that used to grind for a second now finishes in a handful of refinements.

The honest wall: a contract worth keeping

The obvious way to attack twenty-three million independent Newton calls is to spread them across cores. It should be embarrassingly parallel.

Except ries-rs makes a promise, pinned by a test: parallel search returns the exact same result set as serial search, byte for byte. That promise is why --parallel only touches generation. It is valuable — it means a reproducible run is reproducible regardless of how many cores you have.

A naive parallel matcher breaks that promise. Split the candidates across workers, give each its own bounded pool, merge the results, and the marginal ranks diverge: each worker’s pool tightens its acceptance thresholds independently, so different workers keep slightly different sets. I tried it. It broke nine tests, and worse, in one early version it dropped the best match entirely. I reverted it.

The right move was not to weaken the existing guarantee. It was to add a new mode with a different, explicit contract, and leave the old one untouched.

Turbo mode, and the contract that makes it sound

--turbo parallelizes the matching phase across every core. It opts out of byte-identical reproducibility on purpose. The question is: what can it still honestly promise?

Here the earlier ranking fix paid off. Because exact matches now rank by complexity instead of by residual noise, an exact match is discovery-order-independent — whoever finds it scores it the same way. And an exact match’s right-hand side always sits well inside any worker’s search window, so no worker can miss one that the serial path would have found.

That gives a guarantee I can actually prove rather than hope for:

Turbo returns the same single best match as serial search. The global best is rank one within whichever worker owns it, so it is never evicted; and a worker that sees a subset of the candidates uses a search radius at least as wide as serial’s, so it cannot miss what serial found. After merging, the best result sorts to the top.

What turbo does not promise is the lower-ranked tail. When more matches qualify than you asked for, each worker keeps only its own top-k, so the merged tail is a valid top-k but not necessarily the same one, and it can vary with thread count. It also uses more memory, because it materializes the full candidate set instead of streaming it.

I want to flag one moment where I was wrong, because it is the most useful part. My first draft of this contract claimed turbo returned the same exact-match set as serial. Then I found a target where it didn’t: under classic ranking there can be hundreds of equally-simple exact forms, and the parallel workers each kept a different sixteen of them. The honest, provable guarantee is narrower — the single best match — so that is what the documentation and the tests now say. A guarantee you can prove beats a guarantee you wish were true.

The final numbers

Level 3, sixteen requested matches, 8 cores, serial deterministic versus turbo (full exhaustive search, no early stopping). Apple M1, 16 GB RAM, MacBook Air, macOS, native release build; times are the median of three runs:

Target	Serial	Turbo	Speedup
`2.506314` (no clean form — worst case)	16.4 s	4.0 s	4.1×
`π/2`	10.4 s	1.9 s	5.5×
`3.301`	30.7 s	4.3 s	7.1×
`2π`	71.3 s	3.4 s	20.8×

Reproduce with ries-rs 2.0.2 (commit 0904c3e), median of three exhaustive runs: ries-rs 2.506314 -l3 -n16 --deterministic vs. --turbo.

A few caveats. The speedup grows with search size: 2π is the largest workload (about 71 s serial) and parallelizes best (20.8×), while the smaller 2.506314 search (about 16 s) reaches 4.1× — the spread is load-balancing overhead on small workloads, not varying core counts, and 2.506314 (no clean closed form, a genuine grind) is the fairest single number. Turbo’s best match is identical to serial’s on every target here; that rank-1 parity is pinned by a regression test. Turbo costs memory for the speed: about 420 MiB at level 3 versus roughly 260 MiB for serial’s memory-bounded streaming path. It is opt-in, and --deterministic disables it.

The in-browser WASM build is unchanged by all of this except that it inherits the better ranking. Turbo needs a real thread pool, so the browser keeps using the sequential path; the web demo on the project page returns the same — now slightly nicer — results.

Where it can still go

I am not done. The clearest thing the profile left me with is the shape of the remaining cost: the search is exhaustive and error-first by design — it refines every plausible candidate to high precision, then ranks. Most of what follows is a consequence of that.

Load balancing. Turbo splits work into complexity-ordered bands, but the high-complexity bands carry far more work than the low ones, so work-stealing leaves smaller searches under-parallelized — which is why 2.506314 reaches only 4.1× while 2π reaches 20.8×. Finer-grained scheduling would lift the small-workload cases toward the large ones.

The real algorithmic win is upstream of all this. The original RIES is fast partly because it searches in complexity order and stops early. The complexity-ceiling pruning I added is a first step in that direction, but a search that is complexity-ordered from the ground up — rather than error-first with pruning bolted on — would cut the work before it ever reaches Newton’s method. That is a design change, not a tuning pass, so it is its own project.

The streaming path is still serial. At very high levels the engine streams to bound memory, and turbo can’t help there yet because there is no materialized candidate set to divide. A streaming parallel matcher is harder but possible.

Requesting more matches fills the pool with noise. Asking for sixteen results when a number only has two simple forms means the pool pads the rest with complex junk, which keeps the pruning ceiling high and defeats early exit. Smarter handling of “this number simply doesn’t have sixteen good answers” would help the common interactive case.

None of these are blocked. They are just the next layers, and I would rather ship a proven 4–21× with an honest contract than a bigger number I can’t stand behind.

You can try the solver on the project page, run it in the browser demo, or read the search model and turbo contract in the repository.