Skip to content

Part 8 — Detecting Breaks: Evidence, Tools, and Why One Method Is Never Enough

Series navigation: 0. Product Contract · 1. Foundations · 2. Symbol Contracts · 3. Type Layout · 4. C++ ABI · 5. Linker & ELF · 6. Transitive Breaks · 7. Designing for Stability · 8. Detecting Breaks

Parts 0–7 explained the mechanisms: what the compiler bakes into a binary, and which changes corrupt that contract. This part turns the telescope around and asks the engineering question: how do you actually catch each of those breaks before you ship?

Three things matter, and this page covers all three:

  1. The general approaches to ABI/API tracking — and the failure mode each one has when used alone.
  2. What evidence each break family requires — matching every family from the break-families table to the minimum input that makes it visible, with the example cases that prove it.
  3. Why classic single-method checkers (libabigail's abidiff, ABICC) are not sufficient — and, just as honestly, where any static tool stops, including abicheck.

Tool-track companion pages: this page teaches the concepts; the precise per-source capability matrix lives in Evidence & Detectability, measured accuracy numbers in Tool Comparison & Benchmarks, and the boundary of static checking in Limitations.


1. The general approaches to ABI/API tracking

Every team tracks compatibility somehow, even if only by hope. The approaches below are ordered roughly by how much they observe; each catches something the previous ones cannot, and each has a blind spot that motivates the next.

# Approach What it observes Catches Blind spot
1 Process discipline — SemVer policy, review checklists, "don't touch public headers" rules Human judgement Anything a reviewer happens to notice Everything a reviewer doesn't notice — layout shifts from an "internal" change, transitive leaks, toolchain flips. Unverifiable by construction.
2 Runtime swap testing — build an app against v1, run it against v2 One consumer's actual usage Real crashes in the paths the app exercises Surface the test app doesn't call (usually most of it); silent corruption that doesn't crash; needs a representative app per consumer.
3 Symbol-table diffingnm/readelf diff, or any tool run on stripped binaries (L0) Exported symbol names, versions, SONAME Removed/renamed symbols, C++ mangled-signature changes, linker metadata drift Everything that doesn't change a symbol name: struct layout, enum values, vtable order, C parameter types.
4 Debug-info diffing — DWARF/PDB-based tools (L1) Type layout as compiled: sizes, offsets, enum values, vtables The whole layout family from Part 3 and most of Part 4 Requires -g artifacts (release builds are usually stripped); largely blind to source-level API facts — access control, default arguments, explicit, hidden friends — which DWARF doesn't record or tools don't model.
5 Header/AST diffing — compiling public headers and comparing the AST (L2) The declared source contract Source-only API breaks, plus scoping: knowing which types are actually public Blind to binary truth: what was actually exported and with which SONAME/versions, and what flags the shipped binary was really built with.
6 Build- and source-aware overlay (L3/L4) Compile flags, default-argument values, inline/template bodies, uninstantiated templates Facts that never reach any shipped artifact — the source-only tail Highest setup cost; meaningless without the artifact layers underneath it to anchor the shipped-ABI verdict.

The pattern: each approach is a projection of the library onto one kind of evidence. None of the projections is the library. A checker is only complete to the extent that it overlays several projections and lets the strongest evidence win — which is exactly the five-layer evidence model abicheck implements, and why runtime testing (approach 2) still belongs in your release pipeline next to static checking: it is the only approach that observes behaviour.

1a. The hidden prerequisite of header/AST diffing: the compile context

Approach 5 (L2 header/AST) has a subtlety the table glosses: a header is not a self-contained fact, it is source code. To turn it into an AST the frontend must parse it the way your compiler does — with the include roots it #includes, the C++ standard it assumes (-std), and the -D feature macros that gate which declarations even exist. Get that context wrong and L2 does not fail loudly; it produces a different, plausible AST. Two consequences matter for compatibility:

  • L2 is what decides "public." The public/internal boundary — and therefore whether a removed symbol is a compatible internal cleanup or a breaking API removal — comes from the header AST. If L2 cannot be built, the scan only has the binary, so it must treat the export table as the surface and (correctly, by that narrower rule) flags internal removals as BREAKING. This "scope divergence" is a missing-context artifact, not a real break: with L2 those demote to COMPATIBLE. A field run of oneTBB / oneDNN / oneDAL hit exactly this — dnnl::impl::* and bundled DGETRF/SGETRF removals reported as breaking purely because the headers could not be parsed.
  • The wrong context manufactures phantom diffs. Parse at -std=c++17 a library built at -std=c++20 and concepts, char8_t, noexcept-in-type, and inline-namespace versions shift — L2 shows add/remove churn that no consumer would ever observe. Likewise a mismatched -D (a feature macro, or libstdc++'s _GLIBCXX_USE_CXX11_ABI dual-ABI switch) changes which declarations are visible at all.

This is why the source of the compile context matters as much as the frontend choice, and why the two frontends are only interchangeable when fed the same context:

Scan source What it supplies to L2 What it cannot supply alone
castxml (--ast-frontend castxml) runs your real g++/MSVC, so system includes + predefined macros + the compiler's default dialect come for free your project's own -I roots, -D, and the exact -std (still pass these)
clang (--ast-frontend clang) the alternative for clang-only hosts; now auto-probes the host GNU compiler for system includes so libstdc++ resolves like castxml same as above — auto-detection is system-headers only
-I / --gcc-options (CLI) per-run include roots, -std, -D reproducibility — a human/CI must retype them each run
.abicheck.yml compile: block the project's stable, reviewed include roots / std / defines per-invocation cross-compile specifics (those stay CLI)
compile database (compile_commands.json) the authoritative per-TU -I/-std/-D the library was actually built with (threading it into L2 is a planned step; today it feeds L3–L5)

The practical takeaway for abicheck scan: auto-detection makes the common case (find the C++ stdlib) work with no flags, but the project-specific context — include roots, dialect, feature macros — must come from a compile DB, the config compile: block, or explicit flags, or L2 (and the public/internal scoping that depends on it) is only as good as the context it was handed.


2. What it takes to find each break family

The table below extends the break-families table with the detection dimension: the minimum evidence that makes the family visible (L0 binary · L1 +debug info · L2 +headers · L3 +build data · L4 +sources), and whether a symbol-level or debug-info-level checker can see it at all. Per-case minimums are machine-readable in examples/ground_truth.json (min_evidence field) and measured in Benchmarking by evidence tier.

Break family Min evidence Symbol-only (L0) sees it? DWARF tools (L1) see it? Why — and representative cases
Symbol/function/variable removal L0 The symbol vanishes from .dynsym — every tool's home turf (case01, case12)
C++ signature/qualifier changes L1 ⚠️ partial Itanium mangling encodes parameters, const, static — so even a stripped binary shows a symbol vanished and a new one appeared. But classifying it as a qualifier change on the same method (rather than an unrelated removal + addition) takes debug info or headers (case21, case22 are measured at L1)
C signature changes L1/L2 C symbols are just the function name — foo(int)foo(long) keeps the identical symbol. Needs DWARF or headers (case02, case10)
Struct/class layout, packing, alignment L1/L2 No symbol changes when a field moves; layout lives in debug info and headers (case07, case40, case56)
Enum value reassignment L1/L2 Constants are compiled into callers; the library's symbols are untouched (case08, case20)
Vtable reordering L1/L2 Every symbol still exists — only the slot indexes moved (case09)
Source-only API breaks: access narrowed, explicit added, default argument removed, hidden friends L2 mostly ❌ DWARF doesn't reliably model these; they live in the declared AST (case34, case106, case123, case96)
ELF/linker metadata: SONAME, visibility, symbol versions, RPATH L0 Binary-only facts — which means header-only checkers (ABICC's XML mode) are the blind ones here (case05, case65)
Toolchain/build-flag drift: -std floor, ABI version, flag changes L1/L3 partly Compilers record their flags in DW_AT_producer, so a -g build exposes some drift; the rest needs the compile DB (case103). The libstdc++ dual-ABI flip is the notable exception: it renames mangled symbols (std::__cxx11::), so even a stripped binary betrays it at L0 (case104)
Header const/constexpr constant values L2 The value lives in the declared AST, not the binary — header comparison sees it (case124). Plain #define macros are not part of the AST — see the next row
Plain #define macro values, inline/template bodies, uninstantiated templates L4 These never reach the shipped binary or the header AST — only source/preprocessor evidence sees them. case122 is deliberately a no-change case: it marks the boundary of what even source analysis can prove about templates that were never instantiated
Multi-library release skew (bundle SONAME/dependency drift) release model Not a property of any single binary diff — needs a bundle-level comparison (multi-binary guide, bundle cases 84/90–93 in examples/)
Internal-only changes (should be NO_CHANGE) L2 FP ⚠️ FP ⚠️ The inverse problem: without header scoping, tools flag private detail:: churn as breaking. Evidence here removes false positives (case118120)

Two lessons hide in this table:

  • Evidence runs in both directions. More input doesn't just find more breaks — it dismisses false alarms. Header scoping is what lets a checker say "that struct changed, but it was never part of the public surface."
  • The staircase is real and measurable. Over the example catalog, a stripped binary alone reaches the correct verdict for about a third of cases; adding debug info takes it to ~81%; headers to ~99%; build/source data closes the rest (current numbers in the evidence-tier table).

3. Why an abidiff- or ABICC-class checker is not sufficient

This is a structural argument, not tool-bashing — both tools are good at what their evidence lets them see (details and per-case results in the Tool Comparison):

  1. Each is capped at one rung of the staircase. abidiff is DWARF-first (L0+L1): hand it the stripped release binary you actually ship and it degrades toward symbol-only; the source-only API family — access changes, default arguments, explicit, noexcept semantics — stays invisible even with debug info, because a header directory acts as a symbol filter there, not a full AST. ABICC leans the other way: its header/XML workflow sees the declared contract but not the binary truth (exports, SONAME, symbol versions), and its abi-dumper workflow inherits the DWARF ceiling. Neither overlays all the layers, so each one misses families the other catches — and both miss the L3/L4 tail (flag drift, inline bodies, uninstantiated templates).

  2. No public-surface scoping. Without resolving what is public, every internal detail:: struct edit shows up as a break. In practice that noise — not missed breaks — is what makes teams turn checkers off. The scoped-internal cases (118120) exist precisely to test that a checker can stay silent correctly.

  3. A binary verdict is not a release decision. "Compatible / incompatible" collapses distinctions that Part 0 showed are policy-relevant: a source-level API_BREAK ships fine for prebuilt binaries but breaks rebuilders; a COMPATIBLE_WITH_RISK noexcept change is fine unless a consumer relied on it. The 5-tier verdict and policy profiles exist because real release gates need that resolution — as do bundle-level comparison, application-scoped checks, and suppression workflows.

And where everything stops: no static tool — abicheck included — can prove behaviour. A function that keeps its signature and layout but starts returning different values is invisible to every approach in §1 except runtime testing. The honest boundary is documented in Limitations and What ABI tools cannot prove; treat static ABI checking as the part of release safety you can automate exhaustively, not as all of it.


4. Using the encyclopedia as a detection atlas

Every capability claim in this series is backed by a runnable fixture, and the mapping is maintained mechanically — CI checks that every ChangeKind is produced by a detector, documented, and (for the catalog) carries a verified verdict and minimum evidence tier:

  • Capability → meaning: the Change Kind Reference lists every detectable change kind with its classification.
  • Capability → proof: each example page names the change kinds it triggers, its verdict, and includes a Real Failure Demo; the expected results live in ground_truth.json, which the benchmark gates on.
  • Capability → required input: the min_evidence field per case, aggregated in the evidence-tier benchmark, tells you exactly which input you must provide before that break becomes visible — which is the practical answer to "what do I need to feed the checker in my CI?"

Where to go next