Using the Cancer Data Aggregator (CDA) API
This document explains how Oncopacket interacts with the Cancer Data Aggregator (CDA) API to extract cancer research data and transform it into GA4GH Phenopackets.
Overview of CDA API Integration
Oncopacket utilizes the CDA Python library (cdapython
) to access data from the NCI Cancer Research Data Commons. The CDA provides a unified API that aggregates data from multiple NCI data repositories, including:
- Genomic Data Commons (GDC)
- Proteomic Data Commons (PDC)
- Imaging Data Commons (IDC)
- Clinical Trial Data Commons (CTDC)
Accessing Specific Data Elements
Cancer Stage Information
While CDA's diagnoses.tumor_stage
endpoint doesn't return data, Oncopacket directly accesses GDC API endpoints for cancer staging information:
diagnoses.ajcc_pathologic_stage
diagnoses.ajcc_clinical_stage
diagnoses.ann_arbor_pathologic_stage
diagnoses.ann_arbor_clinical_stage
These endpoints are constructed based on the GDC schema document: https://github.com/NCI-GDC/gdcdictionary/blob/develop/src/gdcdictionary/schemas/diagnosis.yaml
Vital Status Information
For vital status, Oncopacket uses the demographic.vital_status
endpoint.
Data Transformation Process
- Extract: Fetch data from CDA tables
- Transform: Convert CDA data models to Oncopacket model classes
- Map: Use ontology mappers to standardize terminology
- Load: Populate GA4GH Phenopacket structures
The CDA factory classes in Oncopacket handle these transformation processes, creating a streamlined pipeline from CDA data to standardized phenopackets.
Updates and Changes
Note: This documentation reflects the CDA API implementation as of April 10, 2024. As the CDA API evolves, Oncopacket's interaction with it may change.