Validation
There are many types of errors that can occur in phenopackets. The Java application phenopacket-tools is a general purpose app for validating and working with phenopackets. pyphetools provides a limited number of commands to check validity of the generated phenoopackets that can be conveniently used as a part of notebooks that create phenopackets.
Commonly encvountered errors include redundancy and inheritance conflicts.
Mistaken HPO identifiers or labels
Sometimes a phenopacket may contain an obsolete HPO id or a spelling error in the label.
Redundant terms
If an individual is found to have Nuclear cataract(HP:0100018), which means an opacity that develops in the nucleus of the lens of the eye, then the individual always can be said to have a Cataract (HP:0000518), which refers to an opacity anywhere in the lens of the eye. This is because of the so-called true-path rule of ontologies, according to which if an HPO term is used to annotate an individual, then the parent of that term and all of the ancestors of that term must also apply. In this case, Cataract is a grand-parent of Nuclear cataract.
Because of this, if we have annotated with Nuclear cataract(HP:0100018), it is not necessary to annotate with Cataract (HP:0000518), because it is implicitly true.
We therefore recommend that only the most specific HPO term be used for a time point.
Conflicting terms
In some datasets we have seen, a patient is annotated with a specific term in an organ, but also indicate that abnormalities have been excluded at a higher level. For instance, we might see Ventricular septal hypertrophy (HP:0005144) but also excluded Abnormal heart morphology (HP:0001627).
QC with pyphetools.
We recommned checking all generated phenopackets with the following steps. First obtain the list of Individual objects. Pass this list together with a reference to the HPO to a CohortValidator object. To display the results of validation, use a QcVisualizer.
The QcVisualizer can show either a list of all issues with the to_html method or a summary of issues with the to_summary_html method.
individuals = encoder.get_individuals()
cvalidator = CohortValidator(cohort=individuals, ontology=hpo_ontology, min_hpo=1)
qc = QcVisualizer(ontology=hpo_ontology, cohort_validator=cvalidator)
display(HTML(qc.to_html()))
# alternatively: display(HTML(qc.to_summary_html()))
This will either display a message that no errors were found, or will show a table as in the following figure.
There are some kinds of error that need to be corrected in the notebook, such as malformed HPO ids or labels. Others can be corrected automatically, such as redudant terms.
Category | Explanation | Autocorrect? |
---|---|---|
REDUNDANT | HPO term and ancestor term both reported as observed | Yes (redundant ancestor term removed) |
CONFLICT | HPO term observed and ancestor term excluded | Yes (conflicting ancestor term removed) |
INSUFFICIENT_HPOS | Individual does not have at least threshold number of HPOs | Yes, individual removed from cohort |
INCORRECT_ALLELE_COUNT | number of alleles unexpected given mode of inheritance | No |
INCORRECT_VARIANT_COUNT | number of variants unexpected given mode of inheritance | No |
MALFORMED_ID | HPO id (e.g., HP:0001234) obsolete or incorrect | No |
MALFORMED_LABEL | Label obsolete or incorrect | No |
NOT_MEASURED | HPO term reported as not measure | Yes, not measured term removed |
OBSERVED_AND_EXCLUDED | Same HPO term reported as observed and excluded | No |
The issues that are shown as Autocorrect = Yes will be fixed automatically with the following command.
individuals = cvalidator.get_error_free_individual_list()
Issues that cannot be fixed with Autocorrect will lead to the individual being removed from the cohort and usually should be fixed in the notebook before proceeding.
If desired, it is possible to double-check that these individuals have no errors by doing another round of checks with the CohortValidator.