ADR-003: Data Source Architecture — Checks, Instruments, and Binary Types¶
Date: 2026-03-17 Status: Accepted — implemented Decision maker: Nikolay Petrov
Context¶
abicheck collects ABI data from three independent layers:
| Layer | Source | Tool | What it provides |
|---|---|---|---|
| L0: Binary metadata | ELF/PE/Mach-O | pyelftools, pefile, macholib | Exported symbols, versioning, SONAME, DT_NEEDED, binding, visibility |
| L1: Debug info | DWARF (.debug_info) | pyelftools | Struct layouts (size, field offsets, alignment), enum values, calling conventions |
| L2: Header AST | C/C++ headers | castxml | Function signatures, parameter types, return types, typedefs, #define constants |
Today, L2 (headers + castxml) is the primary type source. L1 (DWARF) is used only
for cross-checking L2 data (struct sizes, field offsets). Without headers, abicheck
falls back to elf_only_mode — only L0 symbol-level checks run, and DWARF data is
largely unused for type comparison.
This creates a gap: a binary with full DWARF but no headers gets only symbol-level analysis. DWARF contains complete type information (struct definitions, function prototypes, enum values, inheritance, vtables) that is currently ignored.
Current detector → data source mapping¶
30 detectors in compare()
├── 24 AST detectors (L2) → old.functions, old.types, old.enums, old.typedefs, old.constants
│ └── Only fire when elf_only_mode=False (headers were provided)
├── 3 binary metadata detectors (L0) → old.elf, old.pe, old.macho
│ └── Always fire (unconditional)
├── 2 DWARF detectors (L1) → old.dwarf, old.dwarf_advanced
│ └── Cross-check only: filters to types already known from L2
└── 1 fallback detector → old.elf.symbols
└── Fires when elf_only_mode=True (no headers)
Problem: In elf_only_mode, 24 out of 30 detectors are skipped. The DWARF
detector also skips because it filters against the (empty) L2 type list.
Decision¶
1. Promote DWARF to a primary data source (L1 → L1+L2 capable)¶
Create abicheck/dwarf_snapshot.py — a DwarfSnapshotBuilder that constructs
a full AbiSnapshot from DWARF .debug_info alone:
def build_snapshot_from_dwarf(elf_path: str, elf_meta: ElfMetadata) -> AbiSnapshot:
"""Build complete AbiSnapshot from DWARF, no headers required."""
This populates the same AbiSnapshot fields as castxml: functions, variables,
types, enums, typedefs. The comparison engine doesn't know or care whether
the snapshot came from castxml or DWARF.
2. Updated fallback chain in dumper.py¶
dump(binary_path, headers=None):
│
├── L0: Binary metadata (always)
│ ELF → parse_elf_metadata() → ElfMetadata
│ PE → parse_pe_metadata() → PeMetadata
│ Mach-O → parse_macho_metadata() → MachoMetadata
│
├── L1: Debug info (when present)
│ DWARF → parse_dwarf() → DwarfMetadata + AdvancedDwarfMetadata
│ BTF → parse_btf() → BtfMetadata (future, ADR-007)
│ PDB → parse_pdb() → PdbMetadata (PE only)
│
├── L2: Header AST (when headers provided)
│ castxml → _castxml_dump() → functions, types, enums, typedefs, constants
│
└── Snapshot assembly:
headers provided?
├── YES → L2 primary + L1 cross-check (current behavior)
│ elf_only_mode = False
│ All 30 detectors fire
│
└── NO → L1 available?
├── YES → DwarfSnapshotBuilder (NEW)
│ elf_only_mode = False
│ All 24 AST detectors fire (DWARF-derived types)
│ L1 cross-check skipped (same source)
│ Warning: #define constants and default params unavailable
│
└── NO → L0 only
elf_only_mode = True
Only L0 detectors fire (symbol-level)
Warning: type info unavailable
3. Detector data source matrix¶
This table defines which checks come from which data source, and what's available in each mode:
| Detector | Data Source | Headers mode | DWARF-only mode | Symbols-only mode |
|---|---|---|---|---|
| functions (added/removed/changed) | L2 (castxml) or L1 (DWARF) | Yes | Yes | — |
| variables (added/removed/type changed) | L2 or L1 | Yes | Yes | — |
| types (size, fields, bases, vtable) | L2 or L1 | Yes | Yes | — |
| enums (members, values) | L2 or L1 | Yes | Yes | — |
| typedefs | L2 or L1 | Yes | Yes | — |
| method_qualifiers (const, static, access) | L2 or L1 | Yes | Yes | — |
| unions (field changes) | L2 or L1 | Yes | Yes | — |
| param_defaults | L2 only | Yes | — | — |
| constants (#define values) | L2 only | Yes | — | — |
| template_inner_types | L2 or L1 (partial) | Yes | Partial | — |
| elf (soname, needed, versions, symbols) | L0 | Yes | Yes | Yes |
| pe (exports, imports, machine) | L0 | Yes | Yes | Yes |
| macho (exports, compat_version, deps) | L0 | Yes | Yes | Yes |
| dwarf (struct layout cross-check) | L1 | Yes (cross-check) | — (same source) | — |
| advanced_dwarf (calling conv, packing) | L1 | Yes | Yes | — |
| elf_deleted_fallback | L0 | — | — | Yes |
| reserved_fields | L2 or L1 | Yes | Yes | — |
| field_renames / enum_renames | L2 or L1 | Yes | Yes | — |
| pointer_levels / param_restrict | L2 or L1 | Yes | Yes | — |
4. DWARF type extraction — what DWARF provides and what it doesn't¶
| ABI element | DWARF availability | Notes |
|---|---|---|
| Function signatures | DW_TAG_subprogram + DW_TAG_formal_parameter |
Full: name, return type, param types |
| Struct/class layout | DW_TAG_structure_type + DW_TAG_member |
Full: size, field offsets, alignment |
| Enum definitions | DW_TAG_enumeration_type + DW_TAG_enumerator |
Full: names, values, underlying type |
| Variables | DW_TAG_variable with DW_AT_external |
Full: name, type, linkage |
| Typedefs | DW_TAG_typedef |
Full: name → base type |
| Inheritance | DW_TAG_inheritance |
Full: base classes, access, virtuality |
| Vtable entries | DW_AT_vtable_elem_location |
Partial: depends on compiler |
| Templates | DW_TAG_template_type_parameter |
Full: template parameter types |
#define constants |
NOT IN DWARF | Preprocessor — headers only |
| Default param values | NOT IN DWARF | C++ frontend — headers only |
| Inline function bodies | No exported symbol | Out of scope |
5. Visibility filtering in DWARF-only mode¶
DWARF contains all types and functions (including static/internal). We must filter to only ABI-relevant items:
# Intersection: DWARF functions × ELF exported symbols
exported = {s.name for s in elf_meta.symbols if s.binding in ('GLOBAL', 'WEAK') and s.defined}
for func in dwarf_functions:
if func.linkage_name in exported or func.name in exported:
func.visibility = Visibility.PUBLIC
else:
continue # skip internal functions
Same for variables: only include DW_TAG_variable with DW_AT_external=True
that appear in the ELF dynamic symbol table.
For types: include types reachable from exported function signatures and exported variable types. Transitively follow type references.
6. CLI changes¶
# Current behavior unchanged:
abicheck dump libfoo.so -H /usr/include/foo/ # Headers mode (castxml primary)
abicheck dump libfoo.so # Auto-detect: DWARF if present, else symbols-only
# New explicit flags:
abicheck dump libfoo.so --dwarf-only # Force DWARF even when headers available
abicheck compare old.so new.so --dwarf-only # Compare using DWARF-derived snapshots
# Diagnostic:
abicheck dump libfoo.so --show-data-sources # Print which layers are available
--show-data-sources output example:
Data sources for libfoo.so:
L0 Binary metadata: ELF (x86_64, SONAME=libfoo.so.1, 47 exported symbols)
L1 Debug info: DWARF 4 (142 types, 89 functions, 23 enums)
L2 Header AST: not available (no -H provided)
Using: DWARF-only mode (24/30 detectors active)
Missing: #define constants, default parameter values
7. Snapshot interchangeability¶
DWARF-derived and castxml-derived snapshots produce identical JSON schema.
This means:
- abicheck dump lib.so > dwarf.json and abicheck dump lib.so -H inc/ > ast.json
are both valid inputs to abicheck compare
- You can compare a DWARF snapshot against an AST snapshot (cross-mode comparison)
- The schema_version field remains the same
8. Per-platform data source availability¶
| Platform | L0 (binary) | L1 (debug) | L2 (headers) | Typical mode |
|---|---|---|---|---|
| Linux ELF | pyelftools | DWARF (pyelftools) | castxml | All three |
| Linux ELF (stripped) | pyelftools | — | castxml | L0+L2 |
| Linux ELF (no headers) | pyelftools | DWARF | — | L0+L1 (NEW) |
| Linux ELF (bare) | pyelftools | — | — | L0 only |
| Windows PE | pefile | PDB (partial) | castxml | L0+L2 |
| macOS Mach-O | macholib | DWARF (pyelftools) | castxml | All three |
| Kernel modules | pyelftools | BTF/DWARF | — | L0+L1 (future) |
Consequences¶
Positive¶
- 24 detectors become available for header-less binaries (vs 6 today)
- Removes castxml requirement for the majority of use cases
- Same
AbiSnapshotmodel — zero changes to checker, reporter, suppression - Interchangeable JSON snapshots: DWARF ↔ castxml ↔ mixed comparisons work
- Clear mental model: L0/L1/L2 layers with documented coverage per detector
Negative¶
- DWARF parsing for full type extraction is slower than castxml (~5-20×)
- Two code paths to build
AbiSnapshot— need validation that they produce equivalent results #defineconstants and default params are L2-only (warn user)- pyelftools DWARF 5 has gaps (string offsets, macro info)
Implementation Plan¶
| Phase | Scope | Effort |
|---|---|---|
| 1 | DwarfSnapshotBuilder — structs, enums, typedefs |
3-5 days |
| 2 | Function/variable signatures with full type resolution | 3-5 days |
| 3 | C++ features: inheritance, vtable, templates from DWARF | 3-5 days |
| 4 | Visibility filtering (DWARF × ELF symbol intersection) | 1-2 days |
| 5 | CLI: auto-detection, --dwarf-only, --show-data-sources |
1-2 days |
Extension: Binary Fingerprint Rename Detection (Exploratory)¶
Date: 2026-03-23 Status: Exploratory prototype implemented
Context¶
In elf_only_mode (L0 only, no DWARF or headers), symbol renames appear as
"removed + added" pairs — noisy churn that obscures real ABI changes. When a
library renames libfoo_v1_create() → libfoo_create() without changing the
code, the diff engine reports one BREAKING removal and one COMPATIBLE addition
instead of a single "renamed" signal.
Approach¶
Lightweight binary fingerprinting using data already available in L0:
- Function size fingerprinting: Use
st_sizefrom.dynsymto match removed/added symbol pairs with identical code sizes. - Code hash fingerprinting: When the binary file is available (not just
a serialized snapshot), read the function's code bytes from
.textand compute SHA-256 for exact matching. - Section-level triage: Compare
.text/.rodata/.datasection hashes for a coarse "did the binary change significantly" signal.
Implementation¶
abicheck/binary_fingerprint.py— standalone module with:compute_function_fingerprints(binary_path)→ code-hash fingerprintsmatch_renamed_functions(old_fps, new_fps)→ 3-pass matching (exact, size-only, fuzzy within 5% tolerance)compute_section_summary(binary_path)→ section-level triagefingerprint_renamesdetector registered indiff_symbols.py— fires only inelf_only_modewhen both snapshots have ELF metadata.- New
FUNC_LIKELY_RENAMEDchange kind (verdict:COMPATIBLE_WITH_RISK).
Scope boundaries (not in scope)¶
- Full disassembly or CFG extraction
- BinDiff/Ghidra integration
- Instruction-level analysis
- Architecture-specific knowledge
Next steps¶
If the prototype shows value (measurable reduction in false removed/added
pairs on real-world libraries), write a full ADR and integrate into the
post-processing pipeline to suppress redundant FUNC_REMOVED + FUNC_ADDED
pairs when a FUNC_LIKELY_RENAMED exists for the same symbol pair.
| 6 | Validation: DWARF vs castxml snapshot equivalence on test suite | 2-3 days |
Extension: clang as an alternative L2 frontend (implemented)¶
Date: 2026-06-15
Status: Implemented — first slice landed. Surfaced by the UXL field run
(validation/uxl-scan-levels-timing-2026-06.md, P1).
Context¶
L2 (Header AST) has a single producer: castxml (_castxml_dump →
dumper_castxml.py). Two field-observed problems:
- castxml-only blocks clang-only hosts.
dump --headers/scan -Hhard-fail withcastxml not foundon the many dev/CI images that ship clang but not castxml. With no L2 there is no header-aware public-surface scoping, and all four ADR-035 D4 cross-source checks skip (no public-header provenance) — a whole feature class silently off. - castxml's bundled clang lags the system toolchain and aborts on
current-standard stdlib headers (e.g. the C++23
bf16literal in recent libstdc++ — see thecase80/case89known_gaps), before any detector runs.
ADR-001 originally rejected LLVM/clang tooling (Option C) as too heavy
(~500 MB). That rationale is now stale: clang is already a dependency —
the L4 source-ABI replay extractor uses clang -ast-dump=json
(ADR-030 D3, buildsource/source_extractors/), behind the ADR-032 extractor
interface.
Direction¶
Add a clang L2 backend that produces the same AbiSnapshot fields as
castxml (functions, types, enums, typedefs, constants) from
clang -ast-dump=json (or libclang) over the public headers, selected by a
backend knob (auto: prefer the available/most-capable frontend, mirroring the
L4 --source-abi-extractor auto choice). This revisits ADR-001 Option C
for L2 only, scoped to a header parse — not a full LLVM rewrite.
Snapshot interchangeability (§7) is the constraint: a clang-derived L2 snapshot must be schema-equivalent to a castxml-derived one, so the two are a parity oracle for each other (same pattern as the DWARF↔castxml and libabigail/ABICC parity gates). Using both also maximises host coverage: clang where castxml is absent or chokes; castxml where the clang JSON-AST schema (which drifts across clang releases) is unsupported.
Scope boundaries (not in scope)¶
- Replacing castxml — it stays the default and the schema reference.
- A new
AbiSnapshotfield or detector — this is a new producer of existing fields only.
What landed (first slice)¶
A clang -ast-dump=json → AbiSnapshot parser, abicheck/dumper_clang.py
(_ClangAstParser), a sibling to dumper_castxml._CastxmlParser that exposes
the identical parse_functions/parse_variables/parse_types/parse_enums/
parse_typedefs/parse_constants surface. It is selected by a backend knob:
dumper.dump(..., header_backend=...)andservice.run_dump/resolve_inputthread aheader_backendofauto|castxml|clang;- the CLI exposes
--ast-frontendondumpandcompare; - the
ABICHECK_AST_FRONTENDenv var sets the global default (so a clang-only CI image flips the default with no flag), consulted bydumper._resolve_header_backend; autoprefers castxml (the schema reference), then falls back to clang when only clang is onPATH— closing the P1 hard-fail on clang-only hosts.
dumper._header_ast_parser is the single factory both frontends sit behind, so
the per-format _dump_elf/_dump_macho/_dump_pe builders consume either
parser uniformly. The clang JSON cache lives beside the castxml XML cache
(~/.cache/abi_check/clang/*.json), keyed on the backend so the two never
collide.
Coverage trade-off. clang's JSON AST is syntactic — it does not compute
record layout — so a clang-derived RecordType carries field names/types,
bases, and access but not size_bits/offset_bits/vtable slots (left
None/empty; the layout detectors skip an unknown-vs-unknown comparison and
DWARF (L1) stays the layout authority). Everything the source-API and
public-surface-scoping detectors need — signatures, noexcept/const/
explicit, enum values, typedef targets, public constant values — is produced,
which is exactly what the D4 cross-source checks consume.
Next steps¶
Deepen layout parity (a -fdump-record-layouts pass to recover
size_bits/offset_bits/vtable order) and grow the clang↔castxml
snapshot-equivalence check (test_clang_header_backend_integration.py) into a
gate over the full example suite.