koza
Details
| GitHub | monarch-initiative/koza |
| Language | Python |
| Description | Data transformation framework for LinkML data models |
Dependencies
External Dependencies
| Package | Version |
|---|---|
| coverage | >=7.13.0 |
| duckdb | |
| loguru | |
| biolink-model | >=4.3.6 |
| mergedeep | ==1.3.4 |
| ordered-set | >=4.1.0 |
| pydantic | >=2.12.5 |
| pyyaml | >=6.0.3 |
| requests | >=2.32.5 |
| sssom | >=0.4 |
| tqdm | >=4.67.1 |
| typer | >=0.20.0 |
| LinkML | >=1.9.0 |
Documentation
Koza - Knowledge Graph Transformation and Operations Toolkit
Disclaimer: Koza is in beta - we are looking for testers!
Overview
Koza is a Python library and CLI tool for transforming biomedical data and performing graph operations on Knowledge Graph Exchange (KGX) files. It provides two main capabilities:
๐ Graph Operations (New!)
Powerful DuckDB-based operations for KGX knowledge graphs:
- Join multiple KGX files with schema harmonization
- Split files by field values with format conversion
- Prune dangling edges and handle singleton nodes
- Append new data to existing databases with schema evolution
- Multi-format support for TSV, JSONL, and Parquet files
๐ Data Transformation (Core)
Transform biomedical data sources into KGX format: - Transform csv, json, yaml, jsonl, and xml to target formats - Output in KGX format - Write data transforms in semi-declarative Python - Configure source files, columns/properties, and metadata in YAML - Create mapping files and translation tables between vocabularies
Installation
Koza is available on PyPi and can be installed via pip/pipx:
[pip|pipx] install koza
Usage
Quick Start with Graph Operations
Koza's graph operations work seamlessly across multiple KGX formats (TSV, JSONL, Parquet):
# Join multiple KGX files into a unified database
koza join --nodes genes.tsv pathways.jsonl --edges interactions.parquet --output merged_graph.duckdb
# Prune dangling edges and handle singleton nodes
koza prune --database merged_graph.duckdb --keep-singletons
# Append new data to existing database with schema evolution
koza append --database merged_graph.duckdb --nodes new_genes.tsv --edges new_interactions.jsonl
# Split database by source with format conversion
koza split --database merged_graph.duckdb --split-on provided_by --output-format parquet
NOTE: As of version 0.2.0, there is a new method for getting your ingest's KozaApp instance. Please see the updated documentation for details.
See the Koza documentation for complete usage information
Examples
Validate
Give Koza a local or remote csv file, and get some basic information (headers, number of rows)
koza validate \
--file https://raw.githubusercontent.com/monarch-initiative/koza/main/examples/data/string.tsv \
--delimiter ' '
Sending a json or jsonl formatted file will confirm if the file is valid json or jsonl
koza validate \
--file ./examples/data/ZFIN_PHENOTYPE_0.jsonl.gz \
--format jsonl
koza validate \
--file ./examples/data/ddpheno.json.gz \
--format json
Transform
Run the example ingest, "string/protein-links-detailed"
koza transform \
--source examples/string/protein-links-detailed.yaml \
--global-table examples/translation_table.yaml
koza transform \
--source examples/string-declarative/protein-links-detailed.yaml \
--global-table examples/translation_table.yaml
Note:
Koza expects a directory structure as described in the above example
with the source config file and transform code in the same directory:
.
โโโ ...
โ โโโ your_source
โ โ โโโ your_ingest.yaml
โ โ โโโ your_ingest.py
โ โโโ some_translation_table.yaml
โโโ ...
Graph Operations
Create and manipulate knowledge graphs from existing KGX files:
# Join heterogeneous KGX files with automatic schema harmonization
koza join \
--nodes genes.tsv proteins.jsonl pathways.parquet \
--edges gene_protein.tsv protein_pathway.jsonl \
--output unified_graph.duckdb \
--schema-report
# Clean up graph integrity issues
koza prune \
--database unified_graph.duckdb \
--keep-singletons \
--dry-run # Preview changes before applying
# Incrementally add new data with schema evolution
koza append \
--database unified_graph.duckdb \
--nodes new_genes.tsv updated_pathways.jsonl \
--deduplicate \
--show-progress
# Export subsets with format conversion
koza split \
--database unified_graph.duckdb \
--split-on provided_by \
--output-format parquet \
--output-dir ./split_graphs
Key Features
๐ง Multi-Format Support
- Native support for TSV, JSONL, and Parquet KGX files
- Automatic format detection and conversion
- Mixed-format operations in single commands
๐ก๏ธ Schema Flexibility
- Automatic schema harmonization across heterogeneous files
- Schema evolution with backward compatibility
- Comprehensive schema reporting and validation
โก High Performance
- DuckDB-powered operations for fast bulk processing
- Memory-efficient handling of large knowledge graphs
- Parallel processing and streaming where possible
๐ Rich CLI Experience
- Progress indicators for long-running operations
- Detailed statistics and operation summaries
- Dry-run modes for safe operation preview
๐งน Data Integrity
- Dangling edge detection and preservation
- Duplicate detection and removal strategies
- Non-destructive operations with data archiving