Project Overview
Oncopacket is a Python library designed to harmonize genetic and clinical cancer data from the National Cancer Institute (NCI) into GA4GH Phenopackets, an ISO standard for representing clinical case data.
Goals and Objectives
The primary goal of Oncopacket is to facilitate the integration of cancer research data by:
- Converting data from the Cancer Research Data Commons (CRDC) to the GA4GH Phenopacket Schema
- Enabling downstream analysis using a standardized data format
- Creating a foundation for interoperability with other data sources
Related Projects
The pyphetools project has a comparable code base targeted at rare disease, while Oncopacket focuses specifically on cancer data.
Data Sources
Oncopacket currently uses:
- The Cancer Data Aggregator (CDA) API to access most cancer data elements
- Direct access to the Genomic Data Commons (GDC) API for specific elements like variant data, cancer stage, and vital status
Project Status and Components
Oncopacket transforms clinical and genomic data from 12 different cancer types into standardized Phenopackets. The current implementation includes:
Component | Description | Status |
---|---|---|
CdaIndividualFactory | Converts subject data to Individual objects | Complete |
CdaDiseaseFactory | Converts diagnosis data to Disease objects | Complete |
CdaBiosampleFactory | Handles biological sample data | Complete |
CdaMutationFactory | Transforms mutation/variant data | Complete |
CdaMedicalactionFactory | Processes interventions and treatments | Complete |
Current Accomplishments
Oncopacket has been used to generate phenopackets for 23,650 individuals across 12 cancer types, with 7,816 of those having detailed mutational data. These datasets are available in a Zenodo repository.
Future Directions
Future development efforts include:
- Extending the code to incorporate more data elements from CRDC
- Adding support for additional external data sources beyond NCI
- Enhancing the mapping capabilities for oncology terms and concepts
- Developing more comprehensive analysis tools that leverage the phenopacket format
Project Tracking
The development is tracked on the Oncopacket GitHub Project Board.
The code repository is available at GitHub.