Collecting and Converting CDEs¶
This guide shows how to collect CDEs from major repositories and convert them into computable LinkML schemas.
Overview¶
The CDE harmonization project integrates CDEs from multiple repositories:
NIH CDE Repository: NIH/NLM Common Data Elements
PhenX Toolkit: Phenotype and exposure assessment measures
caDSR: NCI Cancer Data Standards Registry
RADx-UP: COVID-19 underserved populations data elements
HEAL: Pain and opioid research data elements
The cde2linkml
tool converts these heterogeneous formats into unified LinkML schemas that capture:
Identifiers and provenance
Permissible values and datatypes
Metadata and context
Relationships to other CDEs
Converting NIH/NLM CDEs¶
Input Format¶
NIH/NLM CDEs are provided in CSV format with columns for:
CDE name
Description
Data type
Permissible values
Ontology mappings
Conversion Command¶
cde2linkml --nih-nlm \
--input-folder data/nlm \
--output-folder linkml
Output¶
The command generates a LinkML schema with:
Classes for each project or survey
Slots for each data element
Enumerations for categorical values
Type definitions and constraints
Converting PhenX CDEs¶
Input Format¶
PhenX CDEs come from REDCap data dictionaries, which include:
Variable names
Field labels
Field types
Validation rules
Branching logic
Conversion Command¶
cde2linkml --phenx \
--input-folder data/phenx-redcap \
--output-folder linkml
Special Considerations¶
PhenX conversions handle:
REDCap-specific field types (radio, dropdown, checkbox)
Branching logic translation
Calculated fields
Multiple choice questions
Converting RADx-UP CDEs¶
Input Format¶
RADx-UP uses CSV data dictionaries with:
Variable names
Descriptions
Data types
Value ranges
Conversion Command¶
cde2linkml --radx-up \
--input-folder data/radx-up \
--output-folder linkml
Customizing Conversions¶
Adding Ontology Mappings¶
Edit the generated schema to add ontology mappings:
slots:
blood_pressure:
description: Systolic blood pressure
range: integer
unit:
ucum_code: mm[Hg]
exact_mappings:
- LOINC:8480-6
Creating Enumerations¶
Define reusable value sets:
enums:
YesNoUnknown:
permissible_values:
Yes:
meaning: NCIT:C49488
No:
meaning: NCIT:C49487
Unknown:
meaning: NCIT:C17998
Validation¶
Validate the generated schema:
linkml-validate linkml/nih_nlm_schema.yaml
Next Steps¶
See the Project Roadmap for planned features including AI-assisted mapping, human curation workflows, and data harmonization.