Skip to content

Comparison with HPOA ETL Process

Background

The Human Phenotype Ontology Annotation (HPOA) team previously processed OMIM gene-to-disease associations via MedGen's genes_to_disease.txt file. This document compares our direct OMIM morbidmap.txt ingest with HPOA's approach.

Data Sources

Pipeline Source File Data Origin Gene IDs Rows
HPOA genes_to_disease.txt MedGen via OMIM NCBIGene 16,966
HPOA Transformed hpoa_gene_to_disease_edges.tsv HPOA ETL (infores:omim only) NCBIGene 9,625
Our Direct Ingest morbidmap.txt OMIM directly OMIM gene IDs 5,990

Key Differences

1. Predicate Distribution

HPOA Edges (infores:omim): - biolink:causes: 9,029 (93.81%) - biolink:contributes_to: 596 (6.19%)

Our Direct OMIM Ingest (current): - biolink:causes: 6,983 (91.35%) - biolink:contributes_to: 661 (8.65%)

2. Association Type Mapping

HPOA Approach (from genes_to_disease.txt):

association_type Count HPOA Predicate HPOA Category
MENDELIAN 8,164 biolink:causes (primary) CausalGeneToDiseaseAssociation
MENDELIAN - biolink:contributes_to (some) CorrelatedGeneToDiseaseAssociation
POLYGENIC 586 biolink:contributes_to CorrelatedGeneToDiseaseAssociation
UNKNOWN 8,216 Varies -

Our Approach (from morbidmap.txt):

OMIM Marker Predicate Category
Confidence (3), no {} biolink:causes CausalGeneToDiseaseAssociation
Confidence (1,2), no {} biolink:contributes_to CorrelatedGeneToDiseaseAssociation
Any {} marker biolink:contributes_to CorrelatedGeneToDiseaseAssociation

Predicate Agreement Analysis

Disease Coverage (with newer data)

  • Overlapping diseases: 6,810
  • HPOA only: 38
  • Our ingest only: 46

Actual Predicate Agreement: 91.1%

  • Total disease-level matches: 15,352
  • Agreements: 13,979 (91.1%)
  • Disagreements: 1,373 (8.9%)

Agreement Matrix

HPOA Predicate Our Predicate Count % of Matches
biolink:causes biolink:causes 11,356 73.97% ✓
biolink:contributes_to biolink:contributes_to 2,623 17.09% ✓
biolink:causes biolink:contributes_to 761 4.96%
biolink:contributes_to biolink:causes 612 3.99%

Total Agreement: 13,979 / 15,352 = 91.1%

Understanding the Agreement

The 91.1% agreement reflects strong alignment where: - Both use causes for non-susceptibility confidence (3) cases (73.97%) - Both use contributes_to for susceptibility/POLYGENIC cases (17.09%)

The 8.9% disagreement represents: - 4.96%: We use contributes_to (detected {}), HPOA uses causes (classified as MENDELIAN) - 3.99%: We use causes (no {}), HPOA uses contributes_to (classified as POLYGENIC)

These differences likely stem from: 1. Subtle discrepancies between OMIM's {} markers and HPOA's POLYGENIC classification 2. Data differences between morbidmap.txt and MedGen's genes_to_disease.txt

HPOA Alignment Strategy

Our Approach

We align with HPOA by respecting susceptibility markers {}, which correspond to HPOA's POLYGENIC classification:

Excellent agreement: 91.1% predicate alignment ✅ Semantic precision: Captures OMIM's susceptibility semantics ✅ HPOA compatibility: Matches POLYGENIC → contributes_to pattern ✅ Two-predicate model: Uses causes/contributes_to

How It Aligns with HPOA

HPOA's classification: - MENDELIAN → causes - POLYGENIC (includes susceptibility) → contributes_to

Our mapping: - Confidence (3), no {}causes (matches MENDELIAN) - Any {} marker → contributes_to (matches POLYGENIC)

This works because OMIM's {} susceptibility markers correspond to HPOA's POLYGENIC classification.

Recommendations

For Users

Use this direct OMIM ingest for: - Better semantic precision (respects OMIM susceptibility markers) - Excellent alignment with HPOA (91.1%) - Direct parsing of morbidmap.txt (no MedGen intermediary) - Proper handling of susceptibility relationships

Key Advantage

By parsing {} markers directly from morbidmap.txt, we achieve alignment with HPOA's POLYGENIC classification without relying on MedGen's intermediate association_type field.

Data Quality Comparison

Aspect HPOA ETL Our Direct Ingest
Predicate agreement Baseline 91.1%
Predicate usage 2 predicates 2 predicates
Susceptibility handling POLYGENIC classification {} marker detection
Semantic precision Via association_type Direct from OMIM
RO alignment RO:0002326, RO:0003303 RO:0002326, RO:0003303
Source MedGen (indirect) morbidmap.txt (direct)

Conclusion

Our susceptibility-aware approach achieves 91.1% predicate agreement with HPOA by aligning {} markers with HPOA's POLYGENIC classification.

Key advantages: - ✅ Semantic precision: Directly interprets OMIM's susceptibility markers - ✅ Excellent HPOA alignment: 91.1% agreement - ✅ Direct source: Parses morbidmap.txt without MedGen intermediary - ✅ Explicit logic: Clear handling of susceptibility cases - ✅ Similar distribution: 91.35% causes vs HPOA's 93.81%

The 8.9% disagreement is minimal and reflects edge cases where OMIM's {} markers and HPOA's POLYGENIC classification diverge slightly, plus data differences between morbidmap.txt and MedGen's genes_to_disease.txt.

Analysis Methodology

This comparison was performed using DuckDB to: 1. Load HPOA genes_to_disease.txt (source data) 2. Load HPOA transformed edges filtered to infores:omim 3. Load our OMIM ingest output 4. Join on disease IDs (object) to compare predicates 5. Analyze disagreement patterns by HPOA association_type

Analysis code and detailed results available upon request.