Phenopacket Store Summary

phenopacket-store-toolkit provides functions to generate various summaries and visualizations of the phenopackets and cohorts contained in a release. We use these functions to generate a summary of each Phenopacket-Store release.

Interested users should study the notebook file to learn how to access the functions.

Display cohorts sorted by size

In addition to the functions shown in the notebook, the following function will generate a Pandas table with all cohorts sorted according to size.

from ppktstore.model import PhenopacketStore
from ppktstore.release.stats import PPKtStoreStats
import zipfile
ppkt_zip_path = "../your/path/all_phenopackets.zip'
with zipfile.ZipFile(ppkt_zip_path) as zf:
    store = PhenopacketStore.from_release_zip(zf)
stats = PPKtStoreStats(store)
df = stats.get_disease_count_table()
df.head()
# ...

Note that some of the cohorts (which are usually gene-based) contain multiple disease entities. To get the total counts per cohort, the following code can be used.

df_grouped = df.groupby('cohort')['count'].sum().reset_index()
df_sorted = df_grouped.sort_values(by='count', ascending=False)
df_sorted = df_sorted[["cohort", "count"]]
df_sorted.reset_index(drop=True, inplace=True)

This will produce a table something like the following.

cohort	count
STXBP1	463
SCN2A	393
ANKRD11	337
RPGRIP1	229
SATB2	158
TBX5	156
…	…
MAF	1
OCA2	1