Architecture ============ *A Declarative, Content-Driven Architecture for CCDA to OMOP Transformation: Design Rationale and Advantages over Traditional ETL Pipelines* Abstract -------- Clinical Document Architecture (CCDA) files are widely used for electronic health record exchange, yet their hierarchical and variable structure presents challenges for transformation into relational formats such as the Observational Medical Outcomes Partnership (OMOP) Common Data Model. Traditional ETL pipelines rely on sequential SQL transformations and table staging, which often obscure provenance and complicate iteration over heterogeneous data. We describe a lightweight, Python-based, rule-driven engine that leverages parse configurations, content-driven extraction, fallback logic, transformation rules, and content-based hashing to emit OMOP rows. This architecture simplifies maintenance, accelerates iteration, and improves robustness while preserving hierarchical context, in contrast to conventional pipelines. 1. Introduction --------------- Mapping CCDA documents to OMOP tables requires handling nested structures, optional elements, multiple template variations, and vendor-specific quirks. Traditional pipeline approaches, dominated by SQL and batch transformations, are familiar to data analysts but often introduce complexity, flatten hierarchical data prematurely, and slow iterative development. In contrast, software engineering principles suggest separating engine logic from data-driven mapping specifications to create maintainable, flexible systems. This paper presents a declarative, lightweight engine for CCDA parsing, highlights its design rationale, and contrasts it with conventional ETL pipeline approaches. 2. System Architecture ---------------------- The architecture is composed of three core layers. 2.1 Mapping Configurations ~~~~~~~~~~~~~~~~~~~~~~~~~~ Each OMOP domain (e.g., ``condition_occurrence``, ``drug_exposure``) has one or more parse configurations. Each parse configuration defines: - CCDA sections to parse - Locales or structural variants within sections - XPath extraction rules for each locale - Fallback options (coalescing) for optional or inconsistent elements - Transformation rules, such as date normalization or terminology mapping This declarative specification captures the mapping logic in a compact, readable, and maintainable format. 2.2 Parsing Engine ~~~~~~~~~~~~~~~~~~ The engine iterates over sections and locales in the CCDA document: 1. Executes XPath queries for each rule 2. Applies fallback logic to select the first non-null value 3. Applies transformations, including date normalization and code translation 4. Emits rows for the target OMOP domain 2.3 Row Emission and Linking ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To maintain referential integrity without requiring intermediate tables or temporary IDs, the engine uses content-based hashing: - Each row's primary key is derived from a hash of its content - Foreign keys are derived consistently from related content hashes - This ensures reproducibility and deterministic linkage across multiple locales in a document 3. Advantages Over Traditional SQL Pipeline Thinking ----------------------------------------------------- .. list-table:: :header-rows: 1 :widths: 20 40 40 * - Aspect - Traditional ETL Pipeline - Declarative Parsing Engine * - Visibility - Logic spread across multiple staging tables - Mapping rules and engine clearly separate; mappings readable * - Iteration Speed - Slow; must rebuild or rerun batches - Fast; single-document execution and debugging * - Hierarchical Context - Flattened early; often loses provenance - Maintains CCDA structure until row emission * - Handling Variability - Complex joins and CASE statements - Coalesce/fallback logic declaratively handles multiple templates * - Maintenance - Changes require procedural edits - Changes often involve configuration edits only * - Scalability - Batch-oriented; table-first - Document-at-a-time; parallelizable without changing engine 4. Discussion ------------- This architecture embodies several software engineering principles: 1. **Data-driven design**: rules and mappings encode variation, minimizing repeated code 2. **Declarative mapping**: separates "what to extract" from "how to extract it" 3. **Iteration-friendly**: enables rapid testing and debugging on single documents 4. **Content-addressed linking**: avoids fragile IDs and maintains integrity 5. **Transparency**: dense configuration provides an immediate view of the mapping without inspecting engine internals While SQL pipelines are familiar to analysts and align with warehouse practices, they can obscure complexity, particularly when input data is hierarchical and heterogeneous. The engine approach better matches the document structure of CCDA and allows for incremental adaptation to vendor differences. 5. Conclusion ------------- The CCDA → OMOP mapping engine provides a lightweight, declarative framework that balances maintainability, robustness, and speed of iteration. It contrasts with traditional SQL pipeline thinking by preserving hierarchical context, centralizing mapping rules, and handling variable document structures elegantly. Its content-driven PK/FK linking ensures deterministic row relationships without requiring staging tables or arbitrary IDs. This architecture is broadly applicable to clinical ETL work and provides a template for other hierarchical-to-relational transformations in healthcare informatics. References ---------- 1. Rob Pike. *Notes on Programming in Go* — on emphasizing data structures over repeated code patterns. 2. HL7 Clinical Document Architecture (CCDA) Standard. Health Level Seven International, 2015. 3. Observational Medical Outcomes Partnership (OMOP) Common Data Model. OHDSI, 2021. 4. Inmon, W. H. *Building the Data Warehouse*, 4th Edition. Wiley, 2005. 5. Kimball, R., Ross, M. *The Data Warehouse Toolkit*, 3rd Edition. Wiley, 2013. 6. Cormen, T., Leiserson, C., Rivest, R., Stein, C. *Introduction to Algorithms*, 4th Edition. MIT Press, 2022 — for graph and hashing principles.