Collecting and Converting CDEs

This guide shows how to collect CDEs from major repositories and convert them into computable LinkML schemas.

Overview

The CDE harmonization project integrates CDEs from multiple repositories:

  • NIH CDE Repository: NIH/NLM Common Data Elements

  • PhenX Toolkit: Phenotype and exposure assessment measures

  • caDSR: NCI Cancer Data Standards Registry

  • RADx-UP: COVID-19 underserved populations data elements

  • HEAL: Pain and opioid research data elements

The cde2linkml tool converts these heterogeneous formats into unified LinkML schemas that capture:

  • Identifiers and provenance

  • Permissible values and datatypes

  • Metadata and context

  • Relationships to other CDEs

Converting NIH/NLM CDEs

Input Format

NIH/NLM CDEs are provided in CSV format with columns for:

  • CDE name

  • Description

  • Data type

  • Permissible values

  • Ontology mappings

Conversion Command

cde2linkml --nih-nlm \
  --input-folder data/nlm \
  --output-folder linkml

Output

The command generates a LinkML schema with:

  • Classes for each project or survey

  • Slots for each data element

  • Enumerations for categorical values

  • Type definitions and constraints

Converting PhenX CDEs

Input Format

PhenX CDEs come from REDCap data dictionaries, which include:

  • Variable names

  • Field labels

  • Field types

  • Validation rules

  • Branching logic

Conversion Command

cde2linkml --phenx \
  --input-folder data/phenx-redcap \
  --output-folder linkml

Special Considerations

PhenX conversions handle:

  • REDCap-specific field types (radio, dropdown, checkbox)

  • Branching logic translation

  • Calculated fields

  • Multiple choice questions

Converting RADx-UP CDEs

Input Format

RADx-UP uses CSV data dictionaries with:

  • Variable names

  • Descriptions

  • Data types

  • Value ranges

Conversion Command

cde2linkml --radx-up \
  --input-folder data/radx-up \
  --output-folder linkml

Customizing Conversions

Adding Ontology Mappings

Edit the generated schema to add ontology mappings:

slots:
  blood_pressure:
    description: Systolic blood pressure
    range: integer
    unit:
      ucum_code: mm[Hg]
    exact_mappings:
      - LOINC:8480-6

Creating Enumerations

Define reusable value sets:

enums:
  YesNoUnknown:
    permissible_values:
      Yes:
        meaning: NCIT:C49488
      No:
        meaning: NCIT:C49487
      Unknown:
        meaning: NCIT:C17998

Validation

Validate the generated schema:

linkml-validate linkml/nih_nlm_schema.yaml

Next Steps

See the Project Roadmap for planned features including AI-assisted mapping, human curation workflows, and data harmonization.