Skip to content

koza

Details

GitHub monarch-initiative/koza
Language Python
Description Data transformation framework for LinkML data models

Dependencies

External Dependencies

Package Version
coverage >=7.13.0
duckdb
loguru
biolink-model >=4.3.6
mergedeep ==1.3.4
ordered-set >=4.1.0
pydantic >=2.12.5
pyyaml >=6.0.3
requests >=2.32.5
sssom >=0.4
tqdm >=4.67.1
typer >=0.20.0
LinkML >=1.9.0

Documentation

Koza - Knowledge Graph Transformation and Operations Toolkit

Pyversions PyPi Github Action

pupa

Documentation

Disclaimer: Koza is in beta - we are looking for testers!

Overview

Koza is a Python library and CLI tool for transforming biomedical data and performing graph operations on Knowledge Graph Exchange (KGX) files. It provides two main capabilities:

๐Ÿ“Š Graph Operations (New!)

Powerful DuckDB-based operations for KGX knowledge graphs: - Join multiple KGX files with schema harmonization - Split files by field values with format conversion
- Prune dangling edges and handle singleton nodes - Append new data to existing databases with schema evolution - Multi-format support for TSV, JSONL, and Parquet files

๐Ÿ”„ Data Transformation (Core)

Transform biomedical data sources into KGX format: - Transform csv, json, yaml, jsonl, and xml to target formats - Output in KGX format - Write data transforms in semi-declarative Python - Configure source files, columns/properties, and metadata in YAML - Create mapping files and translation tables between vocabularies

Installation

Koza is available on PyPi and can be installed via pip/pipx:

[pip|pipx] install koza

Usage

Quick Start with Graph Operations

Koza's graph operations work seamlessly across multiple KGX formats (TSV, JSONL, Parquet):

# Join multiple KGX files into a unified database
koza join --nodes genes.tsv pathways.jsonl --edges interactions.parquet --output merged_graph.duckdb

# Prune dangling edges and handle singleton nodes
koza prune --database merged_graph.duckdb --keep-singletons

# Append new data to existing database with schema evolution
koza append --database merged_graph.duckdb --nodes new_genes.tsv --edges new_interactions.jsonl

# Split database by source with format conversion
koza split --database merged_graph.duckdb --split-on provided_by --output-format parquet

NOTE: As of version 0.2.0, there is a new method for getting your ingest's KozaApp instance. Please see the updated documentation for details.

See the Koza documentation for complete usage information

Examples

Validate

Give Koza a local or remote csv file, and get some basic information (headers, number of rows)

koza validate \
  --file https://raw.githubusercontent.com/monarch-initiative/koza/main/examples/data/string.tsv \
  --delimiter ' '

Sending a json or jsonl formatted file will confirm if the file is valid json or jsonl

koza validate \
  --file ./examples/data/ZFIN_PHENOTYPE_0.jsonl.gz \
  --format jsonl
koza validate \
  --file ./examples/data/ddpheno.json.gz \
  --format json

Transform

Run the example ingest, "string/protein-links-detailed"

koza transform \
  --source examples/string/protein-links-detailed.yaml \
  --global-table examples/translation_table.yaml

koza transform \
  --source examples/string-declarative/protein-links-detailed.yaml \
  --global-table examples/translation_table.yaml

Note: Koza expects a directory structure as described in the above example
with the source config file and transform code in the same directory:

.
โ”œโ”€โ”€ ...
โ”‚   โ”œโ”€โ”€ your_source
โ”‚   โ”‚   โ”œโ”€โ”€ your_ingest.yaml
โ”‚   โ”‚   โ””โ”€โ”€ your_ingest.py
โ”‚   โ””โ”€โ”€ some_translation_table.yaml
โ””โ”€โ”€ ...

Graph Operations

Create and manipulate knowledge graphs from existing KGX files:

# Join heterogeneous KGX files with automatic schema harmonization
koza join \
  --nodes genes.tsv proteins.jsonl pathways.parquet \
  --edges gene_protein.tsv protein_pathway.jsonl \
  --output unified_graph.duckdb \
  --schema-report

# Clean up graph integrity issues
koza prune \
  --database unified_graph.duckdb \
  --keep-singletons \
  --dry-run  # Preview changes before applying

# Incrementally add new data with schema evolution
koza append \
  --database unified_graph.duckdb \
  --nodes new_genes.tsv updated_pathways.jsonl \
  --deduplicate \
  --show-progress

# Export subsets with format conversion
koza split \
  --database unified_graph.duckdb \
  --split-on provided_by \
  --output-format parquet \
  --output-dir ./split_graphs

Key Features

๐Ÿ”ง Multi-Format Support

  • Native support for TSV, JSONL, and Parquet KGX files
  • Automatic format detection and conversion
  • Mixed-format operations in single commands

๐Ÿ›ก๏ธ Schema Flexibility

  • Automatic schema harmonization across heterogeneous files
  • Schema evolution with backward compatibility
  • Comprehensive schema reporting and validation

โšก High Performance

  • DuckDB-powered operations for fast bulk processing
  • Memory-efficient handling of large knowledge graphs
  • Parallel processing and streaming where possible

๐Ÿ” Rich CLI Experience

  • Progress indicators for long-running operations
  • Detailed statistics and operation summaries
  • Dry-run modes for safe operation preview

๐Ÿงน Data Integrity

  • Dangling edge detection and preservation
  • Duplicate detection and removal strategies
  • Non-destructive operations with data archiving