Skip to content

Project Overview

Oncopacket is a Python library designed to harmonize genetic and clinical cancer data from the National Cancer Institute (NCI) into GA4GH Phenopackets, an ISO standard for representing clinical case data.

Goals and Objectives

The primary goal of Oncopacket is to facilitate the integration of cancer research data by:

  1. Converting data from the Cancer Research Data Commons (CRDC) to the GA4GH Phenopacket Schema
  2. Enabling downstream analysis using a standardized data format
  3. Creating a foundation for interoperability with other data sources

The pyphetools project has a comparable code base targeted at rare disease, while Oncopacket focuses specifically on cancer data.

Data Sources

Oncopacket currently uses:

  1. The Cancer Data Aggregator (CDA) API to access most cancer data elements
  2. Direct access to the Genomic Data Commons (GDC) API for specific elements like variant data, cancer stage, and vital status

Project Status and Components

Oncopacket transforms clinical and genomic data from 12 different cancer types into standardized Phenopackets. The current implementation includes:

Component Description Status
CdaIndividualFactory Converts subject data to Individual objects Complete
CdaDiseaseFactory Converts diagnosis data to Disease objects Complete
CdaBiosampleFactory Handles biological sample data Complete
CdaMutationFactory Transforms mutation/variant data Complete
CdaMedicalactionFactory Processes interventions and treatments Complete

Current Accomplishments

Oncopacket has been used to generate phenopackets for 23,650 individuals across 12 cancer types, with 7,816 of those having detailed mutational data. These datasets are available in a Zenodo repository.

Future Directions

Future development efforts include:

  1. Extending the code to incorporate more data elements from CRDC
  2. Adding support for additional external data sources beyond NCI
  3. Enhancing the mapping capabilities for oncology terms and concepts
  4. Developing more comprehensive analysis tools that leverage the phenopacket format

Project Tracking

The development is tracked on the Oncopacket GitHub Project Board.

The code repository is available at GitHub.