ccda_to_omop.data_driven_parse module
Table-Driven ElementTree parsing in Python
This version puts the paths into a data structure and explores using one function driven by the data. - The mapping_dict is hard-coded here. An next step would be to read that in from a file. - Value transformation is stubbed out waiting for vocabularies to be loaded, and to
figure out how to use them once there.
Chris Roeder
- Call Graph:
process_file - parse_doc
parse_configuration_from_file
parse_config_from_single_root - do_none_fields - do_constant_fields - do_basic_fields - do_derived_fields - do_domain_fields - do_hash_fields - do_priority_fields
Config dictionary structure: dict[str, dict[str, dict[str, str ] ] ] metadata = {
- config_dict = {
- field_details_dict = {
attribute: value
}
}
} So there are many config_dicts, each roughly for a domain. You may have more than one per domain when there are more than a single location for a domain. Each config_dict is made up of many fields for the OMOP table it creates. There are non-output fields used as input to derived fields, like the vocabulary and code used to find the concept_id. Each field_spec. has multiple attributes driving that field’s retrieval or derivation.
PK_dict :dict[str, Any] key is the field_name, any is the value. Value can be a string, int, None or a list of same.
output_dict :dict[str, Any] omop_dict : dict[str, list[Any] for each config you have a list of records
XML terms used specifically: - element is a thing in a document inside angle brackets like <code code=”1234-5” codeSystem=”LOINC”/ - attributes are code and codeSystem in the above example - text is when there are both start and end parts to the element like <text>foobar</text>. “foobar” is
the text in an element that has a tag = ‘text’
tag see above
- ccda_to_omop.data_driven_parse.create_hash(input_string) int64 | None[source]
matches common SQL code when that code also truncates to 13 characters SQL: cast(conv(substr(md5(test_string), 1, 15), 16, 10) as bigint) as hashed_value 32 bit
- ccda_to_omop.data_driven_parse.create_hash_too_long(input_string)[source]
64 bit is 16 hex characters, output is way longer…
- ccda_to_omop.data_driven_parse.do_basic_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str], pk_dict: dict[str, list[Any]])[source]
Extract FIELD and PK values from the XML element and write them into output_dict.
PK values are also appended to pk_dict so downstream FK fields can reference them. String values are whitespace-normalized and truncated to the configured max length.
- ccda_to_omop.data_driven_parse.do_constant_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str])[source]
Write CONSTANT fields from config into output_dict, truncating strings to the allowed length.
- ccda_to_omop.data_driven_parse.do_derived2_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None | list]], error_fields_set: set[str])[source]
Compute DERIVED2 fields using functions that receive the full output_dict and argument_list.
Unlike DERIVED, the called function is responsible for fetching its own values from output_dict using the key_list in argument_list, allowing a variable number of inputs.
- ccda_to_omop.data_driven_parse.do_derived_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str], pk_dict: dict[str, list[Any]])[source]
Do/compute derived values now that their inputs should be available in the output_dict Except for a special argument named ‘default’, when the value is what is other wise the field to look up in the output dict.
This set-up is for functions that expect explicit named arguments. This code here adds values for those arguments to the the dictionary passed to the function. It’s tempting to want to pass a list of arguments, but that’s not how this function works.
Also a PK
- ccda_to_omop.data_driven_parse.do_filename_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str], filename: str)[source]
Write the source filename into any FILENAME-typed fields in output_dict.
- ccda_to_omop.data_driven_parse.do_foreign_key_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str], pk_dict: dict[str, list[Any]])[source]
When a configuration has an FK field, it uses the tag in that configuration to find corresponding values from PK fields. This mechanism is intended for PKs uniquely identified in a CCDA document header for any places in the sections it would be used as an FK. This is typically true for person_id and visit_occurrence_id, but there are exceptions. In particular, some documents have multiple encounters, so you can’t just naively choose the only visit_id because there are many.
Choosing the visit is more complicated, because it requires a join (on date ranges) between the domain table and the encounters table, or portion of the header that has encompassingEncounters in it. This code, the do_foreign_key_fields() function operates in too narrow a context for that join. These functions are scoped down to processing a single config entry for a particular OMOP domain. The output_dict, parameter is just for that one domain. It wouldn’t include the encounters. For example, the measurement_results.py file has a configuration for parsing OMOP measurement rows out of an XML file. The visit.py would have been previosly processed and it’s rows stashed away elsewhere in the parse_doc() function whose scope is large enough to consider all the configurations. So the visit choice/reconciliation must happen from there.
TL;DR not all foreign keys are resolved here. In particular, domain FK references, visit_occurrence_id, in cases where more than a single encounter has previously been parsed, are not, can not, be resolved here. See the parse_doc() function for how it is handled there.
- ccda_to_omop.data_driven_parse.do_hash_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str], pk_dict: dict[str, list[Any]])[source]
Compute HASH fields by hashing a list of named input fields into a single ID.
Similar to DERIVED but takes a list of field names rather than individually named arguments. The resulting hash is also stored in pk_dict so it can be used as a PK/FK reference. Note: hash IDs are 64-bit but OMOP integer columns are typically 32-bit — use with care. See the code for data_type-based conversion logic.
where a different kind of hash is beat into an integer.
ALSO A PK
- ccda_to_omop.data_driven_parse.do_none_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str])[source]
Set fields whose config_type is None to None in output_dict.
- ccda_to_omop.data_driven_parse.do_priority_fields(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str], pk_dict: dict[str, list[Any]]) dict[str, list][source]
- ARGS expected in config:
‘config_type’: ‘PRIORITY’, ‘defult’: 0, in case there is no non-null value in the priority change and we don’t want a null value in the end. ‘order’: 17
Returns the list of priority_names so the chosen one (first non-null) can be added to output fields Also, adds this field to the PK list? This is basically what SQL calls a coalesce.
Within the config_dict, find all fields tagged with priority and group them by their priority names in a dictionary keyed by that name Ex. { ‘person_id’: [ (‘person_id_ssn’, 1), (‘person_id_unknown’, 2) ] Sort them, choose the first one that is not None.
NB now there is a separate config_type PRIORITY to compliment the priority attribute. So you might have person_id_npi, person_id_ssn and person_id_hash tagged with priority attributes to create a field person_id, but then also another field, just plain person_id. The point of it is to have a unique place to put that field’s order attribute. The code here (and in the ordering code later) must be aware of a that field in the config_dict (where it isn’t used) …and not clobber it. It’s an issue over in the sorting/ordering.
- ccda_to_omop.data_driven_parse.get_extract_order_fn(dict)[source]
Return a sort-key function that reads the ‘order’ attribute from a config dict entry.
Fields without an ‘order’ attribute sort last (sys.maxsize). Intended for use with sorted() when ordering output fields.
- ccda_to_omop.data_driven_parse.get_filter_fn(dict)[source]
Return a predicate function that is True only for fields that have a non-None ‘order’ attribute.
- ccda_to_omop.data_driven_parse.make_distinct(rows)[source]
rows is a list of records/dictionaries returns another such list, but uniqued
- ccda_to_omop.data_driven_parse.parse_config_for_single_root(root_element, root_path, config_name, config_dict: dict[str, dict[str, str | None]], error_fields_set: set[str], pk_dict: dict[str, list[Any]], filename: str) dict[str, None | str | float | int | int32 | int64 | datetime | date] | None[source]
- Parses for each field in the metadata for a config out of the root_element passed in.
You may have more than one such root element, each making for a row in the output.
If the configuration includes a field of config_type DOMAIN, the value it generates will be compared to the domain specified in the config in expected_domain_id. If they are different, null is returned. This is how OMOP “domain routing” is implemented here.
Returns output_dict, a record, a single row for the domain involved.
- ccda_to_omop.data_driven_parse.parse_config_from_xml_file(tree, config_name, config_dict: dict[str, dict[str, str | None]], filename, pk_dict: dict[str, list[Any]]) list[dict[str, None | str | float | int | int32 | int64 | datetime | date] | None] | None[source]
Basically returns a list of rows for one domain that a parse configuration, config_name, creates.
The main logic is here. Given a tree from ElementTree representing a CCDA document (ClinicalDocument, not just file), parse the different domains out of it (1 config each), linking PK and FKs between them.
Returns a list, output_list, of dictionaries, output_dict, keyed by field name, containing a list of the value and the path to it:
- [ { field_1(value, path), field_2: (value, path)},
{ field_1: (value, path)}, {field_2: (value, path)} ]
It’s a list of because you might have more than one instance of the root path, like when you get many observations.
arg: tree, this is the lxml.etree parse of the XML file arg: config_name, this is a key into the first level of the metadata, an often a OMOP domain name arg: config_dict, this is the value of that key in the dict arg: filename, the name of the XML file, for logging arg: pk_dict, a dictionary for Primary Keys, the keys here are field names and
their values are their values. It’s a sort of global space for carrying PKs to other parts of processing where they will be used as FKs. This is useful for things like the main person_id that is part of the context the document creates.
- ccda_to_omop.data_driven_parse.parse_doc(file_path, metadata: dict[str, dict[str, dict[str, str]]], parse_config: str) dict[str, list[dict[str, None | str | float | int | int32 | int64 | datetime | date] | None] | None][source]
Parses many meta configs from a single file, collects them in omop_dict. - file_path - metadata - parse_config the name of a single config to run, all if None. Returns omop_dict, a dict keyed by configuration names,
each a list of record/row dictionaries.
- ccda_to_omop.data_driven_parse.parse_field_from_dict(field_details_dict: dict[str, str], root_element, config_name, field_tag, root_path) None | str | float | int | int32 | int64 | datetime | date | list[source]
Retrieves a value for the field descrbied in field_details_dict that lies below the root_element. Domain and field_tag are here for error messages.
- ccda_to_omop.data_driven_parse.parse_string(ccda_string, file_path, metadata: dict[str, dict[str, dict[str, str]]]) dict[str, list[dict[str, None | str | float | int | int32 | int64 | datetime | date] | None] | None][source]
Parses many meta configs from a string instead of a single file, collects them in omop_dict.
Returns omop_dict, a dict keyed by configuration names, each a list of record/row dictionaries.
- ccda_to_omop.data_driven_parse.print_omop_structure(omop: dict[str, list[dict[str, None | str | float | int | int32 | int64 | datetime | date]]], metadata: dict[str, dict[str, dict[str, str]]])[source]
Print the full parsed OMOP structure returned by parse_doc() or parse_domain_from_dict().
- ccda_to_omop.data_driven_parse.process_and_save_file(input_file_path, output_file_path, print_output)[source]
- ccda_to_omop.data_driven_parse.process_file(filepath: str, print_output: bool, parse_config: str)[source]
Parse one CCDA XML file and optionally print the resulting OMOP structure.
parse_config is the top-level metadata key to use (e.g. ‘OBSERVATION-from-Procedure’). For production use that returns DataFrames, see layer_datasets.py instead.
- ccda_to_omop.data_driven_parse.process_single_file_single_config(input_file_path, output_file_path, print_output, key)[source]
- ccda_to_omop.data_driven_parse.sort_output_and_omit_dict(output_dict: dict[str, None | str | float | int | int32 | int64 | datetime | date], config_dict: dict[str, dict[str, str | None]], config_name)[source]
Sorts the ouput_dict by the value of the ‘order’ fields in the associated config_dict. Fields without a value, or without an entry used to come last, now are omitted.
- ccda_to_omop.data_driven_parse.validate_ccda_document(file_path, tree) list[str][source]
Validate that a parsed lxml tree looks like a conformant CCDA document.
Checks performed: - Root element is ClinicalDocument in the HL7 v3 namespace - Document contains at least one structuredBody/component section
- Parameters:
file_path – Path to the source file (used in error messages only).
tree – lxml ElementTree returned by ET.parse().
- Returns:
List of human-readable error strings. Empty list means the document passed all checks.