Architecture

A Declarative, Content-Driven Architecture for CCDA to OMOP Transformation: Design Rationale and Advantages over Traditional ETL Pipelines

Abstract

Clinical Document Architecture (CCDA) files are widely used for electronic health record exchange, yet their hierarchical and variable structure presents challenges for transformation into relational formats such as the Observational Medical Outcomes Partnership (OMOP) Common Data Model. Traditional ETL pipelines rely on sequential SQL transformations and table staging, which often obscure provenance and complicate iteration over heterogeneous data. We describe a lightweight, Python-based, rule-driven engine that leverages parse configurations, content-driven extraction, fallback logic, transformation rules, and content-based hashing to emit OMOP rows. This architecture simplifies maintenance, accelerates iteration, and improves robustness while preserving hierarchical context, in contrast to conventional pipelines.

1. Introduction

Mapping CCDA documents to OMOP tables requires handling nested structures, optional elements, multiple template variations, and vendor-specific quirks. Traditional pipeline approaches, dominated by SQL and batch transformations, are familiar to data analysts but often introduce complexity, flatten hierarchical data prematurely, and slow iterative development. In contrast, software engineering principles suggest separating engine logic from data-driven mapping specifications to create maintainable, flexible systems.

This paper presents a declarative, lightweight engine for CCDA parsing, highlights its design rationale, and contrasts it with conventional ETL pipeline approaches.

2. System Architecture

The architecture is composed of three core layers.

2.1 Mapping Configurations

Each OMOP domain (e.g., condition_occurrence, drug_exposure) has one or more parse configurations. Each parse configuration defines:

  • CCDA sections to parse

  • Locales or structural variants within sections

  • XPath extraction rules for each locale

  • Fallback options (coalescing) for optional or inconsistent elements

  • Transformation rules, such as date normalization or terminology mapping

This declarative specification captures the mapping logic in a compact, readable, and maintainable format.

2.2 Parsing Engine

The engine iterates over sections and locales in the CCDA document:

  1. Executes XPath queries for each rule

  2. Applies fallback logic to select the first non-null value

  3. Applies transformations, including date normalization and code translation

  4. Emits rows for the target OMOP domain

2.3 Row Emission and Linking

To maintain referential integrity without requiring intermediate tables or temporary IDs, the engine uses content-based hashing:

  • Each row’s primary key is derived from a hash of its content

  • Foreign keys are derived consistently from related content hashes

  • This ensures reproducibility and deterministic linkage across multiple locales in a document

3. Advantages Over Traditional SQL Pipeline Thinking

Aspect

Traditional ETL Pipeline

Declarative Parsing Engine

Visibility

Logic spread across multiple staging tables

Mapping rules and engine clearly separate; mappings readable

Iteration Speed

Slow; must rebuild or rerun batches

Fast; single-document execution and debugging

Hierarchical Context

Flattened early; often loses provenance

Maintains CCDA structure until row emission

Handling Variability

Complex joins and CASE statements

Coalesce/fallback logic declaratively handles multiple templates

Maintenance

Changes require procedural edits

Changes often involve configuration edits only

Scalability

Batch-oriented; table-first

Document-at-a-time; parallelizable without changing engine

4. Discussion

This architecture embodies several software engineering principles:

  1. Data-driven design: rules and mappings encode variation, minimizing repeated code

  2. Declarative mapping: separates “what to extract” from “how to extract it”

  3. Iteration-friendly: enables rapid testing and debugging on single documents

  4. Content-addressed linking: avoids fragile IDs and maintains integrity

  5. Transparency: dense configuration provides an immediate view of the mapping without inspecting engine internals

While SQL pipelines are familiar to analysts and align with warehouse practices, they can obscure complexity, particularly when input data is hierarchical and heterogeneous. The engine approach better matches the document structure of CCDA and allows for incremental adaptation to vendor differences.

5. Conclusion

The CCDA → OMOP mapping engine provides a lightweight, declarative framework that balances maintainability, robustness, and speed of iteration. It contrasts with traditional SQL pipeline thinking by preserving hierarchical context, centralizing mapping rules, and handling variable document structures elegantly. Its content-driven PK/FK linking ensures deterministic row relationships without requiring staging tables or arbitrary IDs. This architecture is broadly applicable to clinical ETL work and provides a template for other hierarchical-to-relational transformations in healthcare informatics.

References

  1. Rob Pike. Notes on Programming in Go — on emphasizing data structures over repeated code patterns.

  2. HL7 Clinical Document Architecture (CCDA) Standard. Health Level Seven International, 2015.

  3. Observational Medical Outcomes Partnership (OMOP) Common Data Model. OHDSI, 2021.

  4. Inmon, W. H. Building the Data Warehouse, 4th Edition. Wiley, 2005.

  5. Kimball, R., Ross, M. The Data Warehouse Toolkit, 3rd Edition. Wiley, 2013.

  6. Cormen, T., Leiserson, C., Rivest, R., Stein, C. Introduction to Algorithms, 4th Edition. MIT Press, 2022 — for graph and hashing principles.