Skip to content

ADR-015: Snapshot Serialization and Schema Versioning

Date: 2026-03-18 Status: Accepted — implemented Decision maker: Nikolay Petrov


Context

abicheck dump produces a JSON snapshot file (.abi.json) that captures the complete ABI surface of a library. These snapshots serve multiple purposes:

  • Offline comparison: abicheck compare old.abi.json new.abi.json without needing the original binaries or headers
  • Baseline storage: Check snapshots into version control as ABI baselines
  • Cross-mode comparison: A DWARF-derived snapshot can be compared against a castxml-derived snapshot (ADR-003)
  • CI caching: Generate once, compare many times

The snapshot format is a user-facing contract. Changes to the format can break stored baselines and downstream tooling.


Decision

1. AbiSnapshot as the canonical interchange model

All pipeline stages — dumper, checker, reporter — operate on the same AbiSnapshot dataclass. Serialization converts this dataclass to/from JSON.

@dataclass
class AbiSnapshot:
    library: str
    version: str
    functions: list[Function]
    variables: list[Variable]
    types: list[RecordType]
    enums: list[EnumType]
    typedefs: dict[str, str]        # default: {}
    constants: dict[str, str]       # default: {} (populated from header #defines)
    elf: ElfMetadata | None
    pe: PeMetadata | None
    macho: MachoMetadata | None
    dwarf: DwarfMetadata | None
    dwarf_advanced: AdvancedDwarfMetadata | None
    platform: str | None         # "elf" | "pe" | "macho"
    language_profile: str | None # "c" | "cpp" | "sycl"
    elf_only_mode: bool
    dependency_info: DependencyInfo | None

2. Integer schema versioning

SCHEMA_VERSION: int = 6

Version history:

Version Change PR
1 Initial format (no schema_version field)
2 schema_version field added PR #89
3 pe and macho metadata fields added (multi-format support)
4 Provenance metadata (git_commit, git_tag, created_at, build_id)
5 build_mode capture (compiler/stdlib/std normalization)
6 Declaration provenance: source_header + origin on functions/variables/types/enums

Integer versioning was chosen over semver because:

  • Snapshot format changes are always backward-incompatible (new fields change the meaning of existing data)
  • There is no concept of "minor" or "patch" format changes — either the schema is compatible or it isn't
  • Monotonic integers are simpler to compare (if version < 3: migrate(...))

3. Backward compatibility rules

Reading old snapshots: Snapshots without a schema_version field are treated as v1. The deserializer handles missing fields by using dataclass defaults (empty lists, None values).

Reading future snapshots: If schema_version > SCHEMA_VERSION, emit a warning suggesting the user upgrade abicheck. The deserializer attempts to read the snapshot anyway — forward compatibility is best-effort.

Writing: Always writes the current SCHEMA_VERSION. There is no option to write in an older format.

3a. Declaration provenance fields (v6)

Function, Variable, RecordType, and EnumType each carry two provenance fields:

  • source_header — the defining header path, derived from the existing source_location with any trailing :line / :line:col stripped. Always populated when a source location is available; it is descriptive metadata.
  • origin — a ScopeOrigin classification of source_header against the user-provided public-header set. This is the Origin axis of ADR-024's two-axis Linkage × Origin surface model:
Value Meaning
public_header Header matches a --public-header / --public-header-dir input
private_header A project header outside the public set
system_header A toolchain/system header (/usr/include, MSVC, Xcode SDK, …)
generated A machine-generated header (moc_*, *.pb.h, generated/, …)
export_only Exported by the binary but absent from any header (no provenance)
unknown No public set was provided, or no source location was available

Classification is opt-in (decision D4): without --public-header / --public-header-dir, every origin is unknown and downstream behaviour is unchanged. Matching is done on path segments (suffix / basename / directory containment) so absolute build-tree prefixes that never appear on the command line (e.g. /build/abc/src/include/api.h) still resolve against include/api.h (decision D3).

4. Serialization mechanics

Serialization (snapshot_to_dict()): 1. dataclasses.asdict() converts the snapshot tree to a plain dict 2. _sets_to_lists() recursively converts sets to sorted lists (JSON has no set type) 3. Enum values are converted to their string representation 4. Internal cache fields (_func_by_mangled, _var_by_mangled, _type_by_name) are reset to None before serialization 5. schema_version is embedded at the top level

Deserialization (snapshot_from_dict()): 1. Inspect schema_version (default to 1 if absent) 2. Reconstruct typed objects: Function, Variable, RecordType, EnumType, etc. 3. Reconstruct enum instances (SymbolBinding, SymbolType, Visibility, etc.) from string values 4. Platform-specific metadata reconstructed via _elf_from_dict(), _pe_from_dict(), _macho_from_dict(), _dwarf_from_dict(), _dwarf_advanced_from_dict()

5. JSON determinism

To ensure reproducible snapshots (important for diffing baselines in version control):

  • Sets are converted to sorted lists
  • Dict keys are naturally ordered by json.dumps(sort_keys=True)
  • Floating-point values are avoided in the schema

6. Cross-mode snapshot equivalence

A snapshot produced from DWARF data (--dwarf-only) and a snapshot produced from castxml headers produce the same JSON schema. The checker.compare() function treats them identically. This enables:

# Generate snapshots from different sources
abicheck dump lib.so --dwarf-only > dwarf.abi.json
abicheck dump lib.so -H include/  > ast.abi.json

# Cross-compare works
abicheck compare dwarf.abi.json ast.abi.json

Fields that only one source can populate differ in their empty representation based on their type:

  • Optional fields (T | None, e.g., elf, pe, macho, dwarf_advanced): null in JSON when the source doesn't provide them
  • Collection fields (dict / list, e.g., constants, typedefs, functions): empty {} or [] when the source doesn't populate them

For example, a DWARF-only snapshot has constants: {} (empty dict — no header parsing to extract #define values), pe: null (wrong platform), and dwarf_advanced populated with DWARF-specific data. Consumers should handle both null and empty-collection cases.


Consequences

Positive

  • Offline comparison without original binaries or headers
  • Baselines can be checked into version control
  • Cross-mode comparison (DWARF vs castxml) works transparently
  • Deterministic JSON enables meaningful diffs of snapshot files
  • Simple integer versioning avoids semver complexity

Negative

  • Schema version bumps break stored baselines (users must regenerate)
  • Forward compatibility is best-effort — new fields may be silently ignored
  • dataclasses.asdict() with post-processing is slower than custom serialization (acceptable for file sizes in practice)
  • No compression — snapshots for large libraries can be several MB

References

  • abicheck/serialization.pySCHEMA_VERSION, snapshot_to_dict(), snapshot_from_dict()
  • abicheck/model.pyAbiSnapshot dataclass
  • ADR-003 — Data source architecture (DWARF vs castxml snapshot equivalence)