Project Roadmap¶
This document outlines the current status and planned development for the CDE Harmonization project.
Current Status¶
✅ Completed¶
CDE Collection
Ingested CDEs from multiple repositories:
NIH/NLM CDE Repository
PhenX Toolkit
caDSR (NCI Cancer Data Standards Registry)
RADx-UP
Raw data stored with full provenance
Beta LinkML Conversion
Prototype
cde2linkml
tool converts CDEs to LinkML schemasCaptures identifiers, permissible values, datatypes, and metadata
Generated schemas for NIH NLM, PhenX, and RADx-UP CDEs
Data Infrastructure
Database views for querying CDE metadata
CSV exports of flattened CDE data
Planned Development¶
🔨 In Progress¶
LinkML Schema Refinement
Improve LinkML schema quality and consistency
Add more metadata and provenance tracking
Standardize slot and class naming conventions
📋 Planned Features¶
AI-Powered Mapping Generation
Use large language models (LLMs) to generate SSSOM mappings
Map CDEs to standard ontologies (LOINC, HPO, Mondo, NCIT, OBA)
Identify equivalent CDEs across repositories
Generate confidence scores for AI suggestions
Human-in-the-Loop Curation Workflow
GitHub-based review process for AI-generated mappings
Expert validation of ontology bindings
Transparent provenance tracking
Collaborative curation with multiple reviewers
Common Value Sets
Build library of standardized permissible values
Map value sets to ontology terms
Enable reuse across CDEs (e.g., Yes/No/Unknown, Likert scales)
Retrospective Data Harmonization
Generate schema crosswalks for aligning datasets
Transform data using validated mappings
Assess harmonization quality
Prospective CDE Tools
Semantic clustering to recommend CDEs
API integration with CDE repositories
REDCap integration for harmonized data collection
Timeline¶
Detailed timeline to be determined based on funding and resources.
Technology Stack¶
Current:
LinkML for schema modeling
Python for data processing
SSSOM for mapping representation
Git/GitHub for version control
Planned:
CurateGPT or similar AI tools for mapping generation
OAK (Ontology Access Kit) for ontology access
Semantic embedding models for CDE similarity
Get Involved¶
This project is under active development. Contributions and feedback are welcome:
Report issues or suggest features on GitHub
Contribute CDE mappings or value sets
Help review and validate AI-generated mappings (once available)