Skip to content

Performance evidence

Flagship benchmark page

This page is the committed benchmark bundle for the runtime questions that matter most in review: workload-class scaling, PDE cost versus accuracy, digital-payoff remedies, and the end-to-end surface-to-PDE stage budget.

Read it as a benchmark case study rather than a leaderboard. The point is to show what actually matters, where extra runtime buys something real, and which stages deserve optimization effort before the quieter ones do.

Vectorized IV scaling Runtime versus error Stage-budget attribution

Read after the proof path

Benchmark overview

This opening proof object does the same job the other flagship pages do: land one bounded review surface first, then let the page zoom into interpretation. It compresses the benchmark families that matter most in review before the narrative narrows to runtime priorities and supporting checks.

Benchmark overview: committed IV scaling, PDE runtime-versus-error, macro stage budget, and benchmark callouts derived from published artifacts. Benchmark overview: committed IV scaling, PDE runtime-versus-error, macro stage budget, and benchmark callouts derived from published artifacts.

One view across the benchmark families that matter most in review: vectorized IV scaling, runtime/error tradeoffs, digital remedies, and integrated stage-budget attribution.

What actually matters

  • The strongest review signal here is not one absolute timing number. It is whether the bundle separates workload-class wins, cost-versus-accuracy tradeoffs, and genuine end-to-end bottlenecks.
  • In the committed macro run, surface fitting plus PDE repricing already consume 97.5% to 99.1% of total runtime, while quote mesh + handoff + local-vol stay between 17.78 ms and 26.16 ms combined.
  • That is why this page is best read as a default-setting case study: it justifies method choice, escalation guidance, remedy selection, and optimization order, not machine-independent latency promises.

Runtime priorities

This is the flagship benchmark moment on the page because it answers the optimization question directly: where does the workflow actually spend time, and which stages are already cheap enough that polishing them first would be the wrong judgment?

Editorial flagship figure

Macro pipeline benchmark: authored stage-budget view for the synthetic surface-to-local-vol-to-PDE workflow, with emphasis on optimization priority rather than raw totals. Macro pipeline benchmark: authored stage-budget view for the synthetic surface-to-local-vol-to-PDE workflow, with emphasis on optimization priority rather than raw totals.

The integrated stage-budget family is the clearest optimization-priority proof object: surface fitting and PDE repricing dominate, while the handoff stages stay cheap.

Optimization judgment

  • Surface fitting is the main scaling driver in every published scenario. It rises from 534.09 ms to 1.68 s, which is why optimization effort belongs there before smaller orchestration stages.
  • PDE repricing is the second budget line, not a rounding error. It grows from 163.70 ms to 715.95 ms and stays materially larger than the handoff or local-vol construction steps.
  • Quote mesh generation, smooth handoff, and local-vol extraction remain the low-leverage stages. Combined, they stay at 17.78 ms to 22.65 ms, so the benchmark bundle argues against optimizing them first.

What the committed snapshot establishes

The table is the compact claim set for the benchmark bundle: each row pairs a workload, a reference, and the reason the published result matters. It is strongest when read as a bounded justification set, not as a promise that every workload should be pushed to the finest or fastest-looking point.

Family Conditions Reference What the committed snapshot shows
Implied-vol inversion Scalar BS inversion and Black-76 slice inversion on 21-801 strikes Scalar loop baseline The vectorized slice path reaches 578x speedup at 801 strikes while preserving zero measured vol difference versus the scalar loop. Scalar single-option inversions stay around 1.50 ms to 2.18 ms.
Vanilla PDE Black-Scholes European call, Nx=Nt from 101 to 801 Analytic Black-Scholes price Absolute error falls from about 1.74e-02 at 101x101 to about 2.28e-04 at 801x801. Runtime rises from about 91.0 ms to about 2.7 s.
Digital PDE remedies Digital call, Nx=Nt in Analytic digital price The untreated path keeps an ~1.18e-02 refinement spread, while Rannacher + cell average cuts it to 3.00e-05 and Rannacher + L2 projection cuts it to 1.43e-05.
End-to-end macro pipeline Synthetic SVI quote mesh -> fitted surface -> handoff probe -> local-vol surface -> representative local-vol PDE Stage-level timing only Surface fitting dominates the measured stage budget (534.1 ms, 850.0 ms, 1.7 s for small, medium, large), with PDE repricing second (163.7 ms, 460.0 ms, 716.0 ms).

Core benchmark families

Implied-vol inversion

This is the workload-class proof: the library is built for real smiles and surfaces rather than just isolated single-option demos.

IV scaling figure: vectorized slice inversion versus scalar-loop inversion, plus scalar single-option latency scenarios. IV scaling figure: vectorized slice inversion versus scalar-loop inversion, plus scalar single-option latency scenarios.

Treat this figure as a workload-class result rather than a single-option latency contest. Use the vectorized slice inverter whenever the workload is a smile or surface rather than an isolated option. At 801 strikes the committed snapshot records about 2.43 ms for the vectorized slice path versus about 1.41 s for the scalar-loop baseline.

PDE runtime versus error

The numerical question here is not whether a finer grid is better in principle. It is how much extra runtime buys how much visible error reduction, and when the finest published point stops being the most sensible default.

PDE runtime versus error tradeoff for Black-Scholes vanilla PDE and local-vol PDE refinement. PDE runtime versus error tradeoff for Black-Scholes vanilla PDE and local-vol PDE refinement.

The vanilla PDE curve shows the expected refinement tradeoff against an analytic benchmark. From 201x201 to 801x801, runtime rises from about 260.57 ms to 2.66 s while absolute error only falls from 9.41e-04 to 2.28e-04. That supports medium grids as the practical default starting point unless the error budget says otherwise. The local-vol curve uses a finer-grid local-vol PDE solve as its reference because there is no closed-form target for that path. In the committed snapshot, the published local-vol tradeoff runs through 401x401 against a 601x601 reference solve.

Digital-payoff remedies

This is the clearest benchmark family for engineering judgment rather than raw speed, because the remedy choice changes whether refinement is actually trustworthy.

Digital payoff remedy tradeoff: runtime versus error, plus grid-refinement price spread across remedy choices. Digital payoff remedy tradeoff: runtime versus error, plus grid-refinement price spread across remedy choices.

Why this matters

Leaving the discontinuity untreated keeps runtime modest, but it also leaves materially larger error and much wider grid-to-grid drift. The cell-average and L2-projection remedies both stabilize the refinement path without changing the runtime order of magnitude. This is the benchmark family that most clearly shows domain-aware numerical maturity.

Supporting checks

These are useful supporting checks, but they are not the main benchmark story. They help place the local-vol and tree paths in the broader method lineup without asking every plot to carry hero weight.

Local-vol extraction

The published local-vol extraction run compares strike_coordinate="K" and strike_coordinate="logK" on the same smooth synthetic SVI-driven call grids. Both stayed at 0% interior invalid share through the largest tested grid.

Coordinate Largest tested grid Median runtime Interior invalid share
K 121 x 241 3.51 ms 0.0%
logK 121 x 241 4.15 ms 0.0%

The full figure is available at localvol_scaling.png.

Tree scaling

The CRR benchmark remains useful for convergence discussion, but the published numbers make its placement clear: it is an interpretable reference path, not the preferred engine for repeated European-vanilla work at large step counts.

n_steps Median runtime Absolute error vs Black-Scholes
200 47.2 ms 9.90e-03
1000 1.7 s 1.98e-03
5000 40.4 s 3.96e-04

The full figure is available at tree_scaling.png.

Environment and reproducibility

  • Snapshot environment: Python 3.12.0, NumPy 2.4.0rc1, SciPy 1.16.3, pandas 2.2.3, 16 logical CPUs.
  • Treat the absolute timings as machine-specific. The defensible signal is the relative slope of each curve, the runtime/error tradeoff, and which workflow dominates the stage budget.
  • The machine-readable snapshot used for this page is committed under benchmarks/artifacts/, with stable summary metadata in benchmarks/artifacts/performance_summary.json.

To reproduce the published bundle from a standard project environment:

python -m pip install -c scripts/ci-constraints.txt -e ".[dev,docs,plot]"
RUN_BENCHMARKS=1 pytest benchmarks -q --benchmark-only --benchmark-json benchmarks/artifacts/pytest-benchmark.json --benchmark-verbose
python scripts/build_benchmark_artifacts.py --pytest-benchmark-json benchmarks/artifacts/pytest-benchmark.json
python scripts/render_performance_page.py

The pytest-benchmark JSON is kept as the raw timing record. The publishing script turns that raw run plus direct error and reference computations into committed CSV, JSON, and figure artifacts for docs use.

What These Benchmarks Justify

The benchmark bundle is strongest when it supports a design decision rather than advertising a runtime in isolation.

Main takeaway

The published evidence supports concrete defaults:

  • prefer the vectorized implied-vol slice path for smile and surface workloads
  • treat medium PDE grids as the practical starting point and escalate when the error budget says otherwise
  • discuss digital payoffs together with the remedy choice, not just the runtime line
  • spend optimization effort on surface fitting or PDE repricing before touching the already-cheap handoff stages

What this bundle does not prove

It does not establish machine-independent latencies, universal optimal grids, or blanket claims about every payoff family. It justifies relative workload choice, bounded numerical escalation, remedy necessity for digital payoffs, and optimization order inside the committed publication scenarios.