ADR-033: CI Rollout, Performance, Caching, and Validation Strategy¶
Date: 2026-06-09 Status: Accepted / Implemented (2026-06-12). Amended 2026-06-12 (ADR-028 source-tree model) — see Amendment below. Implementation status below. Decision maker: Nikolay Petrov
Context¶
Source/build-aware analysis (ADR-028..032) can easily become too expensive for normal CI if treated as all-or-nothing. abicheck's target audience includes projects that want ABI/API checks in day-to-day development. The rollout must prioritize:
- fast post-build scans;
- no mandatory instrumented rebuild;
- deterministic caching;
- useful partial analysis in PRs;
- full/deep analysis only for baselines, nightly jobs, or release gates;
- honest coverage reporting when evidence is partial.
Decision¶
D1. Implement as a complexity ladder¶
| Phase | Name | Required inputs | Output | CI suitability |
|---|---|---|---|---|
| 0 | Current artifact compare | binary + optional debug + headers | ABI snapshot/report | Existing default |
| 1 | Build context capture | compile DB and/or CMake/Ninja/Bazel query output | build evidence, build-option diff | PR/default feasible |
| 2 | Target ownership and source localization | build graph + header/debug locations | finding-to-target/source/build-file mapping | PR/default feasible |
| 3 | Source ABI replay | sources + compile contexts + public headers | source/API findings | PR changed mode, baseline target mode |
| 4 | Graph summary | BuildEvidence + L2/L4 facts |
graph-to-graph summary diff | PR if scoped; nightly preferred |
| 5 | External graph backend | Kythe/CodeQL/full graph DB | deep impact/call/reference queries | Nightly/release only |
| 6 | Instrumented compiler/build plugins | compiler flags/passes/wrappers | rich compiler-emitted graphs | Opt-in specialist mode |
Each phase must be usable independently. Adopting Phase 1 must not require Phase 3 or Phase 5.
D2. CI modes¶
abicheck:
evidence:
mode: off | build | source-changed | source-target | graph-summary | graph-full
strict: false
collect_raw: false
redact: true
| Mode | Behavior | Use |
|---|---|---|
off |
Current abicheck only | Existing users, fastest path |
build |
Collect build context and compare ABI-relevant flags | Default extension MVP |
source-changed |
Build context + source ABI replay only for changed public headers/TUs | PR mode |
source-target |
Source ABI replay for the targets producing compared binaries | Release baseline |
graph-summary |
Compact graph facts and source/binary mapping | Nightly, or PR on smaller projects |
graph-full |
External graph backend | Deep/nightly/release investigation |
These CI modes and the ADR-030 D7 source-replay scopes are two different knobs: the CI mode selects which evidence layers run, and internally sets the replay scope. The mapping is:
| CI evidence mode | Layers engaged | ADR-030 replay scope |
|---|---|---|
off |
none | off |
build |
L3 | off |
source-changed |
L3 + L4 | changed |
source-target |
L3 + L4 | target |
graph-summary |
L3 + L4 + L5 summary | changed (PR) or target (baseline) |
graph-full |
L3 + L4 + L5 full | target or full |
The remaining ADR-030 scopes (headers-only, full) have no dedicated CI
mode; they stay reachable through explicit replay configuration for users
who need them.
D3. PR mode is trigger/localizer, not authority¶
In PR workflows (ADR-025's model, extended with evidence):
- Use changed paths to decide whether to run or scope evidence collection.
- Always fail open: if classification is uncertain, run the full artifact comparison.
- Build-file-only changes trigger at least Phase 1 build-context comparison.
- Source localization maps artifact findings to changed hunks when provenance exists.
- Source-only prechecks may produce early
API_BREAK/risk signals, but artifact comparison remains the authoritative gate for shipped ABI (ADR-028 D3).
D4. Baseline registry stores evidence packs optionally¶
Extend baseline registry entries (ADR-022):
{
"snapshot": "libfoo.abi.json.zst",
"evidence_pack": "libfoo.evidence.tar.zst",
"evidence_hash": "sha256:...",
"coverage": {
"build_context": true,
"source_abi": "target",
"graph": "summary"
}
}
Storage policy:
| Artifact | Store by default? | Notes |
|---|---|---|
| ABI snapshot | yes | Existing behavior |
BuildEvidence normalized JSON |
yes, when collected | Small enough for baselines |
| Source ABI linked surface | yes, when collected | Store compressed |
| Per-TU source ABI dumps | optional/cache | Can be large; useful for incremental diff |
| Graph summary | yes, when collected | Compact by design |
Full graph DB / .kzip / CodeQL DB |
no by default | Large; store only in audit/deep mode |
| Raw artifacts | no by default in public CI | Enable in audit or private baseline mode (ADR-032 D7/D9) |
D5. Deterministic caching is required¶
| Cache | Key includes | Output |
|---|---|---|
BuildEvidence cache |
build-system raw inputs, compile DB hash, adapter version | normalized build evidence |
| Header/castxml cache | header content, transitive include metadata, build context | existing L2 output |
SourceAbiTu cache |
source/header content, compile context, extractor version | per-TU source ABI dump |
| Source ABI linked cache | TU dump hashes, binary exported symbols, public header set | linked source ABI surface |
| Graph summary cache | BuildEvidence hash, L2/L4 hashes, extractor version |
graph summary |
Cache invalidation must prefer false misses over false hits. A stale cache can produce incorrect ABI decisions.
D6. Performance targets are tiered, not absolute¶
Project size varies; define relative expectations:
| Operation | Expected cost profile |
|---|---|
| Phase 1 build-context normalization | low; mostly JSON/proto/text parsing |
| Build option diff | very low |
| CMake/Ninja query adapters | low when the build tree exists |
Bazel cquery/aquery |
medium; depends on workspace analysis cost |
Source replay changed |
medium, bounded by changed files/TUs |
Source replay target/full |
high without cache; acceptable for baseline/nightly |
| Graph summary | medium; bounded by the selected target |
| Full Kythe/CodeQL | high; not a PR default |
Commands must print coverage and timing summaries so users can tune mode selection.
D7. Evidence-aware policy controls¶
Extend policy profiles (ADR-010 / policy files):
policy:
source_only_findings: warn # ignore | warn | fail-api | fail-release
build_context_drift: warn # ignore | warn | fail-on-abi-relevant
graph_risk_findings: warn # ignore | warn | fail
require_evidence:
build_context: false
source_abi: false
graph_summary: false
Default policy:
- artifact-proven ABI breaks keep current behavior (verdicts and exit codes unchanged, ADR-009);
- a source-only API break is reported with an
API_BREAKverdict, and policy decides whether it fails a release; - build-context drift is a
COMPATIBLE_WITH_RISKsignal unless artifact changes confirm a break; - graph-only risks are informational/warnings by default.
D8. Validation strategy¶
Validation must prove two things:
- evidence collection improves correctness and false-positive handling;
- evidence collection never hides real artifact-backed breaks silently.
The second property extends the existing FP-rate gate philosophy (labelled corpus, zero-FP/zero-FN baselines) to evidence-assisted runs.
Test suites:
| Suite | Purpose |
|---|---|
| Build flag drift corpus | -D, -std, packing, visibility, sysroot, stdlib ABI toggles |
| Public/private surface corpus | surface ledger and leak guard from ADR-024 |
| Source-only API corpus | macros, default args, inline/template/constexpr changes (extends ADR-026 fixtures) |
| Generated file corpus | generated headers and missing dependencies |
| Cross-tool parity | compare selected outcomes with libabigail, ABI Dumper, Android header checker where feasible (ADR-019) |
| Cross-build-system fixtures | CMake/Ninja/Bazel/Make minimal projects |
| Large-project performance fixtures | cache hit/miss, changed-only scan, target scan |
| Security/redaction tests | command-line secrets and path redaction |
D9. Metrics¶
Track in CI and internal benchmarks:
coverage.build_context.present
coverage.source_abi.mode
coverage.graph.mode
extractor.duration_seconds
extractor.cache_hit_rate
findings.artifact_backed.count
findings.source_only.count
findings.build_context_drift.count
findings.demoted_by_surface.count
findings.suppressed_with_reason.count
false_positive_delta_vs_baseline
The most important product metric is not the number of findings; it is reviewable signal quality: fewer noisy non-public/private/build-context artifacts, more explainable true breaks.
D10. Documentation and UX rollout¶
Documentation presents this as optional depth levels:
Level 0: binary only
Level 1: binary + debug
Level 2: binary + debug + public headers
Level 3: + build context
Level 4: + source ABI replay
Level 5: + graph summary / external graph backend
After every run, users should be able to answer:
- What did abicheck compare?
- What evidence was missing?
- Which findings are artifact-backed?
- Which findings are source-only or build-context-only?
- Did any suppression or surface scoping demote a finding?
- Which build option or source file likely caused a change?
Consequences¶
Positive¶
- Lets projects adopt the extension gradually.
- Keeps PR scans fast and predictable.
- Makes baseline/deep modes available without burdening every run.
- Creates measurable validation gates before any source/build feature becomes a default.
- Preserves existing CI semantics unless users enable new evidence modes.
Negative / risks¶
- Multiple modes increase documentation and support burden.
- Partial evidence can confuse users unless reports are clear.
- Full graph/source modes may be expensive on monorepos.
- Baseline storage can grow if raw artifacts and per-TU dumps are retained.
Implementation plan¶
| Phase | Scope | Output |
|---|---|---|
| 1 | Add --collect-mode build and timing/coverage summary |
MVP CI mode |
| 2 | Baseline registry pack storage | Evidence reusable across comparisons |
| 3 | Changed-only source replay mode | PR source/API signals |
| 4 | Policy controls for source/build/graph risks | Tunable fail behavior |
| 5 | Graph summary mode | Explanation and graph diff |
| 6 | External backend/nightly examples | Kythe/CodeQL documentation |
| 7 | Performance and false-positive benchmark reports | Confidence before defaults change |
References¶
- ADR-017 — GitHub Action Design (017-github-action.md)
- ADR-022 — Baseline Registry and Snapshot Distribution (022-baseline-registry.md)
- ADR-025 — PR-Diff-Aware ABI Evaluation (025-pr-diff-source-evaluation.md)
- ADR-028 — Evidence Pack Architecture (028-source-build-evidence-pack.md)
- ADR-029 — Build Graph and Toolchain Context Capture (029-build-graph-toolchain-context-capture.md)
- ADR-030 — Source ABI Replay (030-source-abi-replay-and-linked-source-surface.md)
- ADR-031 — Source Graph Augmentation (031-source-implementation-graph-augmentation.md)
Amendment (2026-06-12): merge flow and collect demotion (see ADR-028)¶
abicheck merge a.json b.json -o out.jsoncombines independently-produced build-side and source-side dumps into one baseline, enabling parallel baseline preparation (the embedded single-artifact storage, ADR-028 D8, exists for this).collectis demoted to an advanced command; the CI evidence modes (D2) select the inputs/scopes fordump --sources/--build-infointernally.
Implementation status (2026-06-12)¶
| Decision | Status | Where |
|---|---|---|
| D1 complexity ladder | done | buildsource/ L0–L5 layers |
| D2 CI modes | done | dump --collect-mode (build = L3-only, source/graph = L3+L4+L5); pre-captured packs filtered to the layer set; collection_for_ci_mode() |
| D3 PR localizer | done | recommend_collect_mode() + abicheck recommend-collect-mode (build-file ⇒ build, source/header ⇒ source-changed); artifact compare stays authoritative |
| D4 baseline coverage block | done | BaselineMetadata.evidence_coverage persists {build_context, source_abi, graph} |
| D5 deterministic caching | done | per-TU SourceAbiCache (L4, the dominant cost) + content-addressed BuildEvidenceCache (L3); both false-miss-preferring. The cheap L5 graph fold is recomputed by design (D6 rates it low-cost) |
| D6 timing summary | done | compare evidence_metrics (stderr + JSON) |
| D7 evidence policy | done | evidence_policy block: source_only_findings/build_context_drift/graph_risk_findings/require_evidence + EVIDENCE_REQUIRED_MISSING kind |
| D8 validation | done | FP-rate gate exposes D9 deltas; corpus tests per layer |
| D9 metrics | done | compare buckets (post-suppression), cache_hit_rate (collect), false_positive_delta_vs_baseline (FP gate) |
| D10 docs | done | docs/concepts/build-source-data.md, docs/user-guide/policies.md |
| Phases 1–7 | done | incl. scripts/evidence_benchmark.py (Phase 7 perf + FP report) |