Architecture

A Declarative, Content-Driven Architecture for CCDA to OMOP Transformation: Design Rationale and Advantages over Traditional ETL Pipelines

Abstract

Clinical Document Architecture (CCDA) files are widely used for electronic health record exchange, yet their hierarchical and variable structure presents challenges for transformation into relational formats such as the Observational Medical Outcomes Partnership (OMOP) Common Data Model. Traditional ETL pipelines rely on sequential SQL transformations and table staging, which often obscure provenance and complicate iteration over heterogeneous data. We describe a lightweight, Python-based, rule-driven engine that leverages parse configurations, content-driven extraction, fallback logic, transformation rules, and content-based hashing to emit OMOP rows. This architecture simplifies maintenance, accelerates iteration, and improves robustness while preserving hierarchical context, in contrast to conventional pipelines.

1. Introduction

Mapping CCDA documents to OMOP tables requires handling nested structures, optional elements, multiple template variations, and vendor-specific quirks. Traditional pipeline approaches, dominated by SQL and batch transformations, are familiar to data analysts but often introduce complexity, flatten hierarchical data prematurely, and slow iterative development. In contrast, software engineering principles suggest separating engine logic from data-driven mapping specifications to create maintainable, flexible systems.

This paper presents a declarative, lightweight engine for CCDA parsing, highlights its design rationale, and contrasts it with conventional ETL pipeline approaches.

2. System Architecture

The architecture is composed of three core layers.

2.1 Mapping Configurations

Each OMOP domain (e.g., condition_occurrence, drug_exposure) has one or more parse configurations. Each parse configuration defines:

CCDA sections to parse
Locales or structural variants within sections
XPath extraction rules for each locale
Fallback options (coalescing) for optional or inconsistent elements
Transformation rules, such as date normalization or terminology mapping

This declarative specification captures the mapping logic in a compact, readable, and maintainable format.

2.2 Parsing Engine

The engine iterates over sections and locales in the CCDA document:

Executes XPath queries for each rule
Applies fallback logic to select the first non-null value
Applies transformations, including date normalization and code translation
Emits rows for the target OMOP domain

2.3 Row Emission and Linking

To maintain referential integrity without requiring intermediate tables or temporary IDs, the engine uses content-based hashing:

Each row’s primary key is derived from a hash of its content
Foreign keys are derived consistently from related content hashes
This ensures reproducibility and deterministic linkage across multiple locales in a document

3. Advantages Over Traditional SQL Pipeline Thinking

Aspect	Traditional ETL Pipeline	Declarative Parsing Engine
Visibility	Logic spread across multiple staging tables	Mapping rules and engine clearly separate; mappings readable
Iteration Speed	Slow; must rebuild or rerun batches	Fast; single-document execution and debugging
Hierarchical Context	Flattened early; often loses provenance	Maintains CCDA structure until row emission
Handling Variability	Complex joins and CASE statements	Coalesce/fallback logic declaratively handles multiple templates
Maintenance	Changes require procedural edits	Changes often involve configuration edits only
Scalability	Batch-oriented; table-first	Document-at-a-time; parallelizable without changing engine

4. Discussion

This architecture embodies several software engineering principles:

Data-driven design: rules and mappings encode variation, minimizing repeated code
Declarative mapping: separates “what to extract” from “how to extract it”
Iteration-friendly: enables rapid testing and debugging on single documents
Content-addressed linking: avoids fragile IDs and maintains integrity
Transparency: dense configuration provides an immediate view of the mapping without inspecting engine internals

While SQL pipelines are familiar to analysts and align with warehouse practices, they can obscure complexity, particularly when input data is hierarchical and heterogeneous. The engine approach better matches the document structure of CCDA and allows for incremental adaptation to vendor differences.

5. Conclusion

The CCDA → OMOP mapping engine provides a lightweight, declarative framework that balances maintainability, robustness, and speed of iteration. It contrasts with traditional SQL pipeline thinking by preserving hierarchical context, centralizing mapping rules, and handling variable document structures elegantly. Its content-driven PK/FK linking ensures deterministic row relationships without requiring staging tables or arbitrary IDs. This architecture is broadly applicable to clinical ETL work and provides a template for other hierarchical-to-relational transformations in healthcare informatics.

References

Rob Pike. Notes on Programming in Go — on emphasizing data structures over repeated code patterns.
HL7 Clinical Document Architecture (CCDA) Standard. Health Level Seven International, 2015.
Observational Medical Outcomes Partnership (OMOP) Common Data Model. OHDSI, 2021.
Inmon, W. H. Building the Data Warehouse, 4th Edition. Wiley, 2005.
Kimball, R., Ross, M. The Data Warehouse Toolkit, 3rd Edition. Wiley, 2013.
Cormen, T., Leiserson, C., Rivest, R., Stein, C. Introduction to Algorithms, 4th Edition. MIT Press, 2022 — for graph and hashing principles.