Input data

The gpsea analysis needs to be provided with a standardized form of genotype and phenotype data. The analyses require an instance of Cohort that consists of individuals in form of a Patient class.

See also

See the cohort-exploratory section for more info.

The first step of the gpsea analysis involves standardization of the genotype and phenotype data and performing functional annotation of the variants. Here we describe how to prepare a Cohort for the exploratory and downstream analysis.

Create a cohort from GA4GH phenopackets

The easiest way to input data into gpsea is to use the GA4GH Phenopacket Schema phenopackets. gpsea provides an out-of-the-box solution for loading a cohort from a folder of phenopacket JSON files.

Create cohort creator

Next, let’s prepare a CohortCreator that will turn a phenopacket collection into a Cohort. The cohort creator also performs an input validation. The validation needs Human Phenotype Ontology data. Let’s start with loading Human Phenotype Ontology, a requisite for the input Q/C steps. We’ll use the amazing hpo-toolkit library which is installed along with the standard gpsea installation:

>>> import hpotk
>>> store = hpotk.configure_ontology_store()
>>> hpo = store.load_minimal_hpo(release='v2024-07-01')

The easiest way to get the CohortCreator is to use the configure_caching_cohort_creator() convenience method:

>>> from gpsea.preprocessing import configure_caching_cohort_creator

>>> cohort_creator = configure_caching_cohort_creator(hpo)

Note

The default CohortCreator will call Variant Effect Predictor and Uniprot APIs to perform the functional annotation and protein annotation, and the responses will be cached in the current working directory to reduce the network bandwidth. See the configure_caching_cohort_creator() pydoc for more options.

Load phenopackets

We can create a cohort starting from a collection of Phenopacket objects provided by Python Phenopackets library. For the purpose of this example, we will load a cohort of patients with pathogenic mutations in RERE gene included in the release 0.1.18 of Phenopacket Store. We use Phenopacket Store Toolkit (ppktstore in the code) to reduce the boilerplate code needed to load the phenopackets:

>>> from ppktstore.registry import configure_phenopacket_registry
>>> registry = configure_phenopacket_registry()
>>> with registry.open_phenopacket_store(release='0.1.18') as ps:
...     phenopackets = tuple(ps.iter_cohort_phenopackets('RERE'))
>>> len(phenopackets)
19

We loaded 19 phenopackets. Now we can turn the phenopackets into a Cohort using the cohort_creator and the load_phenopackets() loader function:

>>> from gpsea.preprocessing import load_phenopackets
>>> cohort, qc_results = load_phenopackets(phenopackets, cohort_creator)  
Individuals Processed: ...
>>> len(cohort)
19

The cohort includes all 19 individuals. On top of the cohort, the loader function also provides Q/C results qc_results. We call summarize() to display the Q/C summary:

>>> qc_results.summarize()  
Validated under none policy
No errors or warnings were found

Alternative phenopacket sources

In case you do not already have a Phenopacket collection at your fingertips, GPSEA provides a few other convenience functions for loading phenopackets from JSON files.

The load_phenopacket_files() function can be used to load a bunch of phenopacket JSON files:

>>> from gpsea.preprocessing import load_phenopacket_files
>>> pp_files = ('path/to/phenopacket1.json', 'path/to/phenopacket2.json')
>>> cohort, qc_results = load_phenopacket_files(pp_files, cohort_creator)  

or you can load an entire directory of JSON files with load_phenopacket_folder():

>>> from gpsea.preprocessing import load_phenopacket_folder
>>> pp_dir = 'path/to/folder/with/many/phenopacket/json/files'
>>> cohort, qc_results = load_phenopacket_folder(pp_dir, cohort_creator)