Overview¶

What is CDE Harmonization?¶

The CDE Harmonization project addresses a critical challenge in clinical research: the lack of interoperability among Common Data Elements (CDEs) used across different studies and repositories.

The Problem¶

Clinical datasets hold enormous potential for advancing biomedical research, but their value is undermined by:

Semantic heterogeneity: The same concept (e.g., “heart attack” vs. “myocardial infarction”) is expressed differently across datasets
Structural variability: Equivalent data elements appear in incompatible formats (free text, categorical, numeric)
CDE proliferation: Multiple CDEs often exist for the same concept, each with different units, contexts, or measurement types
Non-CDE variables: Countless datasets capture variables without using formal CDEs, creating additional harmonization challenges
Fragmented CDE repositories: Over a dozen repositories (NIH CDE Repository, PhenX, caDSR, etc.) with overlapping but inconsistent CDEs
Lack of computability: Most CDEs are free text without formal schemas, ontology bindings, or machine-readable mappings

Without systematic harmonization, researchers cannot reliably integrate data across studies or leverage datasets for AI/ML applications.

Our Approach¶

We are developing a comprehensive framework to make CDEs and study variables interoperable:

Integration and AI-Assisted Curation¶

Ingest CDEs from major repositories (NIH CDE Repository, PhenX, caDSR, HEAL, RADx, etc.) and study-specific variables
Convert to LinkML schemas: Transform heterogeneous CDEs into a unified, computable format called Condor microschemas
Build Common Value Sets: Standardize permissible values (e.g., “Yes/No”, Likert scales) across CDEs
AI-assisted mapping: Use large language models and semantic embeddings to generate mappings between CDEs and to ontologies (LOINC, HPO, Mondo, etc.)
Human-in-the-loop curation: Expert curators validate AI-generated mappings via GitHub-based workflows using SSSOM (Simple Standard for Sharing Ontological Mappings)

Community Tools and Integration¶

Semantic clustering: Use embeddings to group similar CDEs, enabling discovery and recommendation
API and widgets: Provide tools for semantic search and CDE selection, integrated into REDCap, CEDAR, and CDE repositories
Prospective harmonization: Generate data collection forms directly from schematized CDEs to ensure new data is “born interoperable”
Retrospective harmonization: Enable alignment of existing heterogeneous datasets through crosswalk tables and schema mappings
Collaboration with repositories: Work with governance committees to integrate our framework into standard CDE development workflows
FAIR data release: Publish harmonized CDEs at stable URLs with versioned snapshots

Repository Contents¶

Raw CDEs: Collected from 10+ repositories, preserving native formats with full provenance
Condor microschemas: LinkML representations of CDEs with ontology bindings and cross-CDE mappings
Common Value Sets: Reusable library of standardized permissible values
Crosswalk tables: Schema mappings for retrospective data harmonization
Curation workflows: AI-assisted pipelines for generating and validating mappings

Large files are stored in the Monarch Google Cloud bucket.

Impact¶

This work will create a virtuous cycle of CDE interoperability, enabling:

Cross-study integration: Harmonized datasets that can be jointly analyzed across initiatives
AI-readiness: Semantically annotated, schema-validated data suitable for machine learning
Prospective standardization: Tools that help researchers select and use harmonized CDEs from the start
Retrospective harmonization: Methods to align existing heterogeneous datasets
Computational modeling: Connections between clinical data and physiological simulation models

By making CDEs computable and FAIR (Findable, Accessible, Interoperable, Reusable), we will unlock the full value of clinical research data for discovery, validation, and translation.