Comparison Performance¶
This page documents the runtime cost of comparing real, large shared libraries, the bottlenecks that were found and fixed, and the tooling that guards against regressions in CI.
TL;DR¶
- Dump scales fine. Snapshotting
libonedal_core.so(~10,550 exported functions) takes ~5 s. - Compare used to blow up. On the same library,
comparedid not finish within 60 s. The cost was entirely in the post-processing detectors, not the core symbol diff. A profiling sweep found six super-linear paths — several quadratic, one effectively cubic — all now fixed (see What was fixed). - A synthetic scaling harness (
scripts/benchmark_scaling.py) reproduces each path without a real binary, compiler, or castxml, and aslowregression test guards the realistic hot path.
What was fixed¶
Every fix preserves detector behaviour (the full unit suite, the FP-rate gate, and the metamorphic/oracle detector tests all stay green); they only change how the work is organised.
| # | Path | Was | Fix | Result |
|---|---|---|---|---|
| 1 | Public-surface scoping (surface.classify_change_surface) |
Recomputed four old∪new set unions per finding → O(findings × surface). Made every comparison quadratic. | Compute the unions once per pass (surface_unions) and reuse. |
add_remove 4000: 9.1 s → 0.32 s (linear) |
| 2 | Namespace detection (diff_namespaces) |
demangle_batch called one symbol at a time → one c++filt subprocess per symbol. |
Batch-demangle each snapshot once (_batch_demangle_public) and thread the map through. |
elf_namespace 4000: 5.2 s → 0.33 s (linear) |
| 3 | Variable / symbol diffing | Quadratic via the same per-finding surface unions (#1). | Fixed by #1. | var_churn 4000: 2.1 s → 0.06 s (linear) |
| 4 | Batch-rename heuristic (diff_symbols._find_rename_pairs) |
O(removed × added) suffix scan. | Reversed-name index + binary search (endswith → reversed prefix lookup). |
folded into add_remove win |
| 5 | Type-spelling fallback (diff_type_spellings) |
Rebuilt a set(...) inside a comprehension → O(n²). |
Hoist the set once. | folded into add_remove win |
| 6 | Affected-symbol enrichment / ancestor closure (diff_filtering) |
Transitive ancestor function lists accumulated duplicates, then re-sorted per change → effectively cubic on nested type graphs. | Use sets (dedup on union); sort once. | nested_types n=200: >60 s → 0.16 s |
| 7 | ELF-only rename matching (binary_fingerprint, diff_symbols._plausible_rename) |
O(removed × added) name-similarity scan; the name predicate re-demangled both names per pair. | Scan only the size-tolerance window via the existing size index; cache the per-name parse; cap the heuristic pass for mass-rename inputs. | rename_churn n=1000: 13.2 s → 2.1 s, larger inputs bounded |
| 8 | Affected-symbol enrichment type↔function/field mapping (diff_filtering._build_type_to_funcs, _build_type_embed_index) |
any(tname in ft ...) nested inside the type loop and the function/field loop → O(types × functions × refs); quadratic when many distinct types churn (a header refactor or versioned upgrade). The original perf sweep only sampled these scenarios at n=500, so the exponent was never computed and the table mislabelled them "linear". |
One Aho-Corasick _SubstringMatcher over the affected type names, built once and shared; each ref/field is matched in O(len) with identical substring semantics. |
typedef_churn n=4000: 6.3 s → 0.73 s; union_churn 9.1 s → 1.10 s; vtable_churn 7.8 s → 1.03 s; enum_churn ≈1.8 → 1.0; opaque_filter ≈1.7 → 1.2 (all now linear) |
| 9 | Opaque-handle pointer-only / factory check (diff_filtering._is_pointer_only_type, _has_public_pointer_factory via _filter_opaque_size_changes) |
Each opaque candidate rescanned every public function/variable with a word-boundary regex → O(candidates × functions), a regex per pair (type_churn n=4000: ~3.2 M searches). |
One indexed pass per snapshot (_opaque_usage_index): an Aho-Corasick prefilter narrows each type string to the candidates present, then the same regex oracle decides — so the verdict is unchanged (verified by a fuzz test vs the per-candidate functions). |
type_churn n=4000: 1.13 s → 0.38 s (≈1.6 → linear) |
With fixes #8 and #9 the compare pipeline has no remaining quadratic path at
the tracked sizes — every scenario is linear (tail exponent ≈1.0–1.3) except the
inherently deep nested_types chain. The opaque-handle pointer-only check
(_is_pointer_only_type) used to be O(candidates × functions) with a
word-boundary regex per pair (type_churn n=4000: ~3.2 M regex searches, ≈1.6);
fix #9 replaced the per-candidate rescan with one indexed pass.
How to reproduce¶
No real binary, compiler, or castxml required — the harness synthesises
AbiSnapshot pairs that exercise each path:
# Sweep all scenarios and print a table with a scaling exponent per scenario.
python scripts/benchmark_scaling.py
# Focus one path and emit machine-readable JSON.
python scripts/benchmark_scaling.py --scenario type_churn \
--sizes 1000 2000 4000 --json-out reports/perf/scaling.json
Scenarios (add_remove is the linear control; the rest target a specific
path). The first group exercises compare() (the original focus); the second
group, added later, extends coverage beyond compare() to the suppression and
reporting stages — see Coverage beyond compare():
| Scenario | Measures | Stresses |
|---|---|---|
add_remove |
compare() |
Core symbol diff + surface scoping (control) |
type_churn |
compare() |
Affected-symbol enrichment, opaque filtering (structs) |
enum_churn |
compare() |
Enum diffing (diff_types._diff_enums) |
typedef_churn |
compare() |
Typedef base-change diffing (_diff_typedefs) |
union_churn |
compare() |
Union member diffing |
wide_struct |
compare() |
Per-field diffing within large records |
vtable_churn |
compare() |
Vtable / virtual-layout diffing |
elf_namespace |
compare() |
Namespace detection + demangling (stripped lib) |
pe_churn |
compare() |
PE/COFF export diffing (diff_platform PE arm) |
macho_churn |
compare() |
Mach-O export diffing (diff_platform Mach-O arm) |
var_churn |
compare() |
Public-surface classification |
rename_churn |
compare() |
ELF-only fingerprint rename matching — the reject path (disjoint names, no match emitted) |
fuzzy_rename_churn |
compare() |
ELF-only fingerprint rename matching — the accept path (every symbol genuinely renamed → one func_likely_renamed per pair). The ICU/LLVM cost driver (P11: rename detection, not symbol count) |
version_node_churn |
compare() |
Version-node migration fan-out — every export moves LIB_1.0 → LIB_2.0 → n symbol_moved_version_node findings (the LLVM 17→18 36,991-finding shape) |
versioned_rename_churn |
compare() (collapse on) |
Versioned-symbol-scheme detection and collapse over 2×n churn (ICU/OpenSSL u_*_NN) |
nested_types |
compare() |
Transitive type-ancestor closure |
opaque_filter |
compare() |
Opaque-handle size filter (the known O(candidates × functions) residual) |
suppression_audit |
SuppressionList.audit() |
Rule-vs-finding matching (O(rules × findings)) |
severity |
categorize_changes() |
Severity categorization of findings |
serialize |
snapshot_to_json → from_dict |
Snapshot serialize/load round-trip (dump-pipeline proxy) |
report_html |
generate_html_report() |
HTML document assembly |
report_sarif |
to_sarif_str() |
SARIF JSON assembly |
report_junit |
to_junit_xml() |
JUnit XML assembly |
Peak memory¶
Every measurement also records the peak tracked heap (peak_mb, via
tracemalloc) of the timed call. The inputs are built outside the traced
window, so the figure attributes only the call's own allocations. The memory
pass also runs cold: process-wide caches warmed by the timing loop (e.g. the
functools.lru_cache demanglers) are cleared first, so input-scaled cache
growth is counted rather than hidden behind a warm cache. A flat per-item time
alongside a rising peak_mb flags an intermediate O(n²) space blow-up that a
wall-clock-only gate would miss. Disable with --no-memory (timing only); gate
with --max-memory-mb <budget>.
Alongside it, each point records the process peak RSS (rss_mb, via
resource.getrusage). Unlike peak_mb, which only sees Python-heap
allocations, RSS also counts native memory — pyelftools parse buffers and
c++filt subprocess pages — which is what dominates real libraries (the field
eval observed ~330 MiB RSS at LLVM scale, invisible to tracemalloc).
ru_maxrss is a process high-water mark, so it is monotonic across sizes and
the largest/last value is the true peak (it overstates a single call's own
footprint, since inputs are built in-process); gate the peak with
--max-rss-mb <budget>. RSS is unavailable on Windows (resource is
Unix-only), where the column is simply absent.
Coverage beyond compare()¶
The original sweep (PR #331) only covered compare() post-processing. A
follow-up gap analysis extended it to the two other stages that build the
largest data structures from the finding set:
- Suppression audit (
suppression.py,SuppressionList.audit) tests every rule against every change — O(rules × findings). Thesuppression_auditscenario holds the rule count fixed (a project's ruleset is roughly fixed while its library grows) and scales findings, so it stays linear in findings; a regression that makes per-finding matching itself super-linear (e.g. recompiling a pattern per change) shows up as a rising exponent. - Reporting —
to_markdown/to_jsonwere already guarded byslowtests;report_htmlandreport_sarifextend that to the HTML and SARIF renderers, which assemble the largest output documents. Both are linear.
Measured scaling (after fixes)¶
Most scenarios are linear at the sizes a real library reaches (per-change cost
roughly flat); type_churn and enum_churn are mildly super-linear (~1.7) but
bounded and tracked:
Figures are indicative local timings (absolute seconds vary with runner speed —
the tail exponent is the portable signal). The first group times
compare(); the second group, added in PR #336, times the suppression and
reporting stages (see Coverage beyond compare()).
| Scenario | time @ size | tail exponent |
|---|---|---|
add_remove |
0.32 s @ n=4000 | ~0.9 (linear) |
var_churn |
0.06 s @ n=4000 | ~1.0 (linear) |
elf_namespace |
0.33 s @ n=4000 | ~1.1 (linear) |
pe_churn / macho_churn |
<0.05 s @ n=500 | ~1.0 (linear) |
wide_struct |
0.1–0.2 s @ n=500 | ~1.0 (linear) |
typedef_churn / union_churn / vtable_churn |
0.7–1.1 s @ n=4000 | ~1.0 (linear, after fix #8) |
enum_churn |
1.0 s @ n=4000 | ~1.0 (linear, after fix #8 — was ≈1.8) |
type_churn |
0.38 s @ n=4000 | ~1.0 (linear, after fix #9 — was ≈1.6) |
opaque_filter |
0.45 s @ n=1000 | ~1.2 (linear at tracked sizes after fix #8) |
rename_churn |
2.1 s @ n=1000, capped above | bounded |
fuzzy_rename_churn |
0.39 s @ n=4000 | ~1.0 (linear) |
version_node_churn |
0.86 s @ n=10000 (10 k moves) | ~1.0 (linear) |
versioned_rename_churn |
0.87 s @ n=8000 (16 k changes) | ~1.1–1.2 (mild) |
nested_types |
0.70 s @ n=400 | inherent for deep chains |
suppression_audit |
0.09 s @ n=2000 (fixed 40-rule set) | ~1.0 (linear in findings) |
severity |
<0.01 s @ n=1000 | ~1.0 (linear) |
serialize |
0.12 s @ n=1000 | ~1.0 (linear) |
report_html / report_sarif / report_junit |
≤0.04 s @ n≤2000 | ~1.0 (linear) |
CI integration¶
.github/workflows/performance.yml
runs the scaling benchmark and the slow performance tests. Now that every
compare() scenario is linear, the lane is gating:
- Triggers: weekly schedule, manual
workflow_dispatch(with size / budget inputs), and automatically on any PR that changes the detector core (abicheck/diff_*.py,checker.py,post_processing.py,demangle.py,binary_fingerprint.py,surface.py, the benchmark script, or the perf test). Adding theperformancelabel re-triggers the lane; for a PR that does not touch the detector core, run it on demand withworkflow_dispatch. - Armed budgets: the scaling step runs with
--max-exponent 1.4(the tail, largest-two-size slope) and--max-rss-mb 2048; theregressionjob blocks on a >50 % PR-vs-base slowdown (--regress-tolerance 0.5).continue-on-erroris dropped on both, so a catastrophic regression fails the lane. The thresholds are CLI flags so the budget lives in the workflow, not the script — loosen a threshold rather than re-addingcontinue-on-errorif normal drift ever flakes a lane. - The
--max-exponentgate is per-scenario opt-out:nested_typesis an inherently super-linear embedding chain, so it carriesgate_exponent=Falseand is exempted (its tail slope is still printed for visibility, just not gated). Every other scenario is gated. - Publishes the scaling table to the job summary and uploads the JSON.
slow regression guards also live in
tests/test_performance.py
— TestTypeChurnScaling (compare back to genuine O(n²)),
TestSuppressionAuditScaling (audit stays linear in findings), and the
HTML/SARIF cases in TestReporterScaling. They run in the existing slow lane
with generous thresholds, so a catastrophic regression fails fast without
flaking on normal drift.
Coverage gap analysis & remaining gaps¶
A second pass (continuation of PR #331) audited the whole pipeline for scaling risk and extended the harness to the highest-value uncovered paths plus per-call peak-memory tracking and PR-vs-base drift detection. Current status:
| Path | Status | Notes |
|---|---|---|
compare() post-processing |
✅ covered | Original PR #331 scenarios. |
| Suppression audit | ✅ covered | suppression_audit scenario + slow test. O(rules × findings); linear in findings for a fixed ruleset. |
| HTML / SARIF / JUnit reporting | ✅ covered | report_html / report_sarif / report_junit scenarios + slow tests; all linear. (to_markdown/to_json already guarded.) |
| Enum / typedef / union / wide-struct / vtable diffing | ✅ covered | enum_churn, typedef_churn, union_churn, wide_struct, vtable_churn. Sweeping typedef/union/vtable/enum across sizes (the original table only sampled n=500, so no exponent was ever computed) exposed a genuine ≈O(n²) in the affected-symbol enrichment — see fix #8; all four are linear after it, and opaque_filter dropped from ≈1.7 to ≈1.2 as a side effect (its cost was the enrichment, not _filter_opaque_size_changes). |
| PE/COFF & Mach-O diff arms | ✅ covered | pe_churn / macho_churn build pe=/macho= snapshots so diff_platform's PE/Mach-O detectors run. |
| Opaque-handle pointer-only check | ✅ covered | Was the O(candidates × functions) residual (_is_pointer_only_type, regex per pair, surfaced by type_churn ≈1.6); fix #9 linearized it via _opaque_usage_index (one indexed pass). Both type_churn and opaque_filter are now linear. |
| Versioned-symbol-scheme collapse (ICU/OpenSSL) | ✅ covered | versioned_rename_churn reproduces the field-eval P08 ICU 75→78 shape (16 k removed/added churn findings + the scheme-collapse pass). Profiling it surfaced a per-finding name re-tokenization in the namespace detectors (diff_namespaces._segments), now fast-pathed for plain names. ~1.1–1.2 tail exponent; the residual is the post-processing detector fan-out, not the scheme recogniser. |
| Severity categorization | ✅ covered | severity scenario over categorize_changes; linear. |
| Fuzzy rename matching (accept path) | ✅ covered | fuzzy_rename_churn — every symbol genuinely renamed → one func_likely_renamed per pair, the cost driver P11-refined identified (ICU 2134 renames = 94.5 s; rename detection, not symbol count, dominates). The pre-existing rename_churn only exercised the reject path (disjoint names, zero matches). Linear at ICU scale (≤8 k). |
| Version-node migration fan-out (LLVM bump) | ✅ covered | version_node_churn — every export moves LIB_1.0 → LIB_2.0, reproducing the LLVM 17→18 36,991-symbol_moved_version_node shape and the post-processing fan-out over it. Linear to 50 k. |
| Peak memory (all scenarios) | ✅ covered | tracemalloc peak_mb column + --max-memory-mb gate (cold-cache pass), plus process rss_mb (resource.getrusage) + --max-rss-mb gate — RSS catches native (pyelftools / c++filt) allocations tracemalloc cannot see (the ~330 MiB LLVM-scale figure). |
| Historical / PR-vs-base regression | ✅ covered (now gating) | --baseline/--regress-tolerance + the regression workflow job measure the base branch and PR head on the same runner and flag scenarios that got slower by more than the tolerance — catching gradual drift the per-run exponent misses. continue-on-error is dropped, so it now blocks. See Baseline regression. |
| Dump / snapshot creation (DWARF/PE/PDB) | ⚠️ partial | The synthetic harness can't run the real parsers. The ELF symbol-table parse and the DWARF debug-info parse (-g build) are now guarded by tests/test_perf_dump_scaling.py (integration, gcc-only) — DWARF being the dominant real-library dump cost (ICU 18.6 MB snapshot, openblas 23 MB / 9.5 s). The serialize scenario proxies the rest of the pipeline. PE/COFF + PDB parsing remains unbenchmarked — those need a committed binary or a synthetic byte-stream generator (no Linux-only toolchain produces them). |
| Appcompat HTML / stack analysis / appcompat filtering | ⚠️ not benchmarked | stack_checker runs one compare() per dependency (inherent). Appcompat filtering uses set-membership lookups (appcompat.py — O(1) per change, likely already fine) and appcompat_html.py is linear by inspection; neither is timed. |
| Bundle / multi-library & environment-matrix compare | ⚠️ not benchmarked | O(libraries) compares; per-library cost is covered, cross-library orchestration is not. |
Recommended next steps (in priority order)¶
- ~~Wire a budget gate~~ — done: the lane now runs
--max-exponent 1.4(nested_typesexempt viagate_exponent=False) and--max-rss-mb 2048, andcontinue-on-erroris dropped on both the scaling andregressionjobs. The--regress-tolerance 0.5PR-vs-base check also blocks now; loosen a threshold rather than re-addingcontinue-on-errorif runner variance flakes a lane. - Extend the dump/parse guard to PE/PDB — the ELF symbol-table and DWARF
parses are now covered (
tests/test_perf_dump_scaling.py,integration, gcc +-g); the PE/COFF and PDB parsers still need a committed binary or a synthetic byte-stream generator behind theintegrationmarker (no Linux-only toolchain emits them). - Benchmark the cross-library orchestration — bundle / environment-matrix compares are O(libraries) over an already-covered per-library cost, but the orchestration layer (and appcompat/stack fan-out) is still untimed.
- ~~Optimise the super-linear residuals~~ — done: fix #8 linearized the
enrichment (typedef/union/vtable/enum/opaque), fix #9 the opaque pointer-only
check (
type_churn). No quadraticcompare()path remains at tracked sizes.
Baseline regression¶
The per-run scaling exponent catches catastrophic blow-ups but not a gradual 20–30 % slowdown. To catch drift, the harness can compare against a baseline:
# On the base branch / a prior commit, capture a baseline:
python scripts/benchmark_scaling.py --json-out base.json
# On the PR head, measure and compare (fails if any shared scenario is >50% slower):
python scripts/benchmark_scaling.py --baseline base.json --regress-tolerance 0.5
Only scenarios present on both sides are compared (a scenario new in the PR
has no baseline and is skipped), and baseline times below a 50 ms noise floor are
ignored. The regression
workflow job automates this on PRs: it installs the base branch and the PR head
into separate venvs on the same runner, runs both, and prints the regressions to
the job summary. It now gates (a >50 % slowdown fails the job) — loosen
--regress-tolerance rather than re-adding continue-on-error if runner variance
proves noisy.
Scan level cost model: one cliff at L4¶
A real scan-level sweep on two UXL libraries (oneTBB v2021.12→.13, C++;
UMF v0.10→v0.11, C; raw data in validation/data/uxl_scan_results_2026-06.json)
shows the cost has one cliff, at the L4 AST-replay boundary, and the cheap
tier below it is dominated by the binary dump + always-on pattern scan, not by
the source layer:
| Level | Reaches | oneTBB (C++, 40 TUs) | UMF (C, 50 TUs) |
|---|---|---|---|
s0 diff classifier |
— (L0/L1 + pattern) | ~29 s | ~17 s |
s1 compile-DB |
+L3 | ~29 s | ~17 s |
s3 lexical |
(pattern only) | ~29 s | ~17 s |
s4 symbol/graph index |
+L3 +L5 | ~29 s | ~17 s |
s5 targeted AST |
+L4 (changed TUs) +L5 | ~222 s | ~22 s |
s6 full AST |
+L4 (all TUs) | ~215 s | ~21 s |
Rules of thumb:
- The cliff height is a C++ phenomenon. L4 cost = clang per-TU AST replay; it
scales with C++ template/STL instantiation depth, not
.soor TU count. Heavy C++ (oneTBB) jumps ~7× (29→222 s); plain C (UMF) barely moves (~1.3×, 17→21 s). Budget L4 by how templated the source is. - The cheap tier (s0–s4) is one price. All four cost the same — the floor is
the DWARF dump + lexical scan of the tree. Pick by coverage you need, not
cost:
s0≈s3(L0/L1 + pattern only),s1adds L3,s4adds the L5 reachability graph without paying for L4.s4is the structure sweet spot. s5is only cheaper thans6with a diff seed. Without--since/--changed-paththe changed-TU set is empty ands5replays every TU — same cost ass6. With a one-file seed, oneTBBs5dropped from 222 s to 11.5 s (~19×) for the identical verdict. This scoping applies only to thesource-changedcollect mode — i.e.s5and--mode pr. The other AST modes replay full scope regardless of any seed:--mode pr-deepresolves tograph-full, and--mode baseline/s6to full (source_replay.CI_MODE_TO_SCOPE:source-changed→changed,graph-full→full), so pinning those in CI will not produce the scoped speedup.auditcosts the same as the baseline modes — the wall-clock is L4/L5 collection of the new side, not the baseline diff.
The verdict was identical across all levels on both libraries: the authoritative L0/L1 binary diff sets the gate; L3–L5 add coverage/localization, not a different pass/fail. For a CI gate, the cheap tier suffices; spend on L4 only when you want source-body semantics or PR localization for humans.
L4 source-replay (dump-side) performance¶
The scaling harness above is pure-Python and times the compare pipeline. The
dump-side L4 source ABI replay (clang per-TU AST extraction) is a separate
cost, timed by eval/scaling.py
on real source trees (it needs clang + a built tree, so it is manual, not in CI).
Knobs and the reasoning behind them (abicheck/buildsource/source_replay.py):
ABICHECK_L4_JOBS— worker count for the per-TU extract pool. Auto =min(TUs, cpu_count, 8). An explicit override is now clamped tomax(8, 2×cpu_count)(logged when it fires) so a stray=64can't oversubscribe a host into thrash (eval/SCALING.mdalready saw jobs=8 on 4 CPUs regress). Set=1to force serial (determinism).ABICHECK_L4_EXECUTOR(threaddefault /process) — after clang returns, the extractor parses clang's large JSON AST dump and builds structural fingerprints: pure-Python, GIL-bound work. A thread pool parallelizes only the clang subprocess wait, so that post-processing serializes on the GIL — part of the ~60–83 % "serial fraction" ineval/SCALING.md.processruns the extract phase in aProcessPoolExecutor, parallelizing the AST work too (at the cost of pickling eachSourceAbiTuand per-process spawn). It is opt-in pending a measured win — compare the curves withpython eval/scaling.py --jobs 1,2,4 --executor processvsthread. The driver falls back to serial if a process pool can't start (sandbox, spawn import error), so it never aborts L4.ABICHECK_L4_CACHE_DIR— persists the per-TU cache (SourceAbiCache, content-addressed + per-included-file dependency-hash invalidation) acrossdump --sourcesruns. Previously the inline path passed no cache, so every dump re-extracted every TU; wiring this dir makes a cold run (evalE4: zstd 48.6 s) reuse the warm cache (3.4 s). Point it at a CI cache directory restored viaactions/cacheto start every CI run warm. The cache validation phase is serial, so the dependency digest is memoized per replay pass — a public header included by N TUs is hashed once, not N times.
Why not precompiled headers (PCH) / modules?¶
A natural idea to cut the repeated per-TU header parse is a PCH over the public
headers. It does not apply here: clang -Xclang -ast-dump=json does not
re-emit declarations that came from a PCH, so loading one would silently drop the
very header surface L4 exists to capture — a correctness bug, not a speedup. The
right levers for repeated-parse cost are therefore the per-TU cache and the
replay scope (changed/target), both already in place.