Knowledge Base Partitioning

BOOMER uses graph-based partitioning to decompose large knowledge bases into smaller, independent components called cliques. This approach significantly improves computational efficiency while maintaining logical consistency.

Overview

Knowledge base partitioning identifies groups of related entities that form strongly connected components in the entity relationship graph. Each component can be solved independently, dramatically reducing the combinatorial explosion that would occur when solving the entire KB at once.

How Partitioning Works

Entity Relationship Graph

BOOMER constructs a directed graph where: - Nodes represent entities (concepts, terms, identifiers) - Edges represent relationships between entities

Different fact types create different edge patterns: - EquivalentTo: Creates bidirectional edges (A ↔ B) - SubClassOf/ProperSubClassOf: Creates unidirectional edges (A → B) - DisjointWith: Tracked but doesn't create edges for partitioning

Strongly Connected Components

A strongly connected component (SCC) is a group of entities where every entity can reach every other entity through directed paths. BOOMER uses NetworkX's strongly_connected_components() algorithm to identify these groups.

Examples: 1. Simple Equivalence Chain: If A≡B and B≡C, then {A, B, C} form one component 2. Mixed Relationships: If A⊆B, B≡C, C⊆D, D≡A, then {A, B, C, D} form one component (all mutually reachable) 3. Isolated Hierarchies: If A⊆B and C⊆D with no connections between them, they form separate components

Automatic Subclustering Strategy

When the solve() function encounters a large KB, it automatically applies a two-phase partitioning strategy:

Initial Threshold Check: If the KB has more pfacts than partition_initial_threshold (default: often the same as max_pfacts_per_clique), the KB is partitioned into strongly connected components
Recursive Subclustering: Each partition is solved recursively:
If a partition still exceeds max_pfacts_per_clique, it's further subdivided
The partition_initial_threshold is updated to prevent infinite recursion
Each sub-KB is solved independently
Solutions are combined to form the final result

This automatic subclustering ensures that even very large KBs can be processed efficiently without manual intervention.

Benefits of Partitioning

Computational Efficiency

Without partitioning, a KB with n probabilistic facts requires exploring 2^n combinations. With partitioning:

Total combinations = 2^n1 + 2^n2 + ... + 2^nk

Where n1, n2, ..., nk are the sizes of individual cliques.

Example: A KB with 20 pfacts - Without partitioning: 2^20 = 1,048,576 combinations - With 4 cliques of 5 pfacts each: 4 × 2^5 = 128 combinations - Speedup: 8,192x faster!

Logical Independence

Each clique represents a logically independent subset of the KB: - Facts in different cliques don't interact - Solutions for each clique can be combined without conflicts - Parallel processing becomes possible

Clique Size Management

Large cliques can still be computationally expensive. BOOMER provides the max_pfacts_per_clique parameter to limit clique sizes.

How It Works

When a clique exceeds the maximum size: 1. Probabilistic facts are sorted by probability (highest first) 2. Only the top N facts are retained 3. Lower probability facts are pruned

CLI Usage

# Limit each clique to maximum 100 probabilistic facts
boomer-cli solve large_kb.json --max-pfacts-per-clique 100

# More aggressive pruning for very large KBs
boomer-cli solve huge_kb.yaml -C 50

Python API

from boomer.splitter import partition_kb
from boomer.search import solve

# Partition with size limit
partitions = partition_kb(kb, max_pfacts_per_clique=100)

# Solve each partition
solutions = []
for partition in partitions:
    solution = solve(partition)
    solutions.append(solution)

Advanced Subclustering Algorithm

How Subclustering Works

When a clique is still too large after initial partitioning, BOOMER employs an iterative subclustering algorithm:

Sort by Probability: Pfacts are sorted by probability (highest first)
Iterative Dropping: Temporarily remove the lowest probability pfacts
Re-evaluate Components: Check if removing pfacts creates smaller connected components
Component Extraction: When a suitable component is found, extract it as a sub-KB
Restore and Repeat: Re-add dropped pfacts and continue with remaining entities

The split_connected_components Algorithm

The algorithm progressively "chips away" at large cliques:

# Pseudocode for the subclustering strategy
while clique has pfacts:
    dropped_pfacts = []

    while no suitable split found:
        # Analyze current graph structure
        components = find_strongly_connected_components(clique)

        for component in components:
            if min_size <= component_size <= max_size:
                # Found a good sub-clique
                yield component
                break

        if no split:
            # Drop lowest probability pfacts temporarily
            drop lowest probability pfacts (in batches)
            add to dropped_pfacts list

    # Restore dropped pfacts for remaining processing
    restore dropped_pfacts

Key Parameters

Step Size: Number of pfacts to drop at once (default: KB size / 20 + 1)
Min PFacts per Clique: Minimum size for a viable sub-clique (default: 5)
Max PFacts per Clique: Maximum allowed clique size

Example: Progressive Subclustering

Consider a highly connected KB with 200 pfacts in a single component:

from boomer.model import SearchConfig
from boomer.search import solve

config = SearchConfig(
    max_pfacts_per_clique=50,      # Target maximum size
    partition_initial_threshold=50  # When to start partitioning
)

# The solve() function automatically:
# 1. Detects KB has 200 pfacts > threshold of 50
# 2. Attempts initial partitioning
# 3. If still too large, applies subclustering:
#    - Drops low-probability pfacts in batches of ~10
#    - Checks if this creates smaller components
#    - Extracts components of size 5-50
#    - Continues until all pfacts are assigned
solution = solve(kb, config)

Practical Examples

Example 1: Disease Mapping with Multiple Systems

Consider mapping between three disease classification systems:

from boomer.model import KB, PFact, EquivalentTo

kb = KB(pfacts=[
    # MONDO to ICD10 mappings
    PFact(fact=EquivalentTo("MONDO:001", "ICD10:A01"), prob=0.9),
    PFact(fact=EquivalentTo("MONDO:001", "ICD10:A02"), prob=0.3),

    # MONDO to SNOMED mappings
    PFact(fact=EquivalentTo("MONDO:001", "SNOMED:123"), prob=0.8),

    # Separate disease cluster
    PFact(fact=EquivalentTo("MONDO:002", "ICD10:B01"), prob=0.9),
    PFact(fact=EquivalentTo("MONDO:002", "SNOMED:456"), prob=0.8),
])

# This creates 2 cliques:
# Clique 1: {MONDO:001, ICD10:A01, ICD10:A02, SNOMED:123}
# Clique 2: {MONDO:002, ICD10:B01, SNOMED:456}

Example 2: Taxonomic Relationships

kb = KB(
    facts=[
        ProperSubClassOf("cat", "mammal"),
        ProperSubClassOf("dog", "mammal"),
        ProperSubClassOf("mammal", "animal"),
    ],
    pfacts=[
        PFact(fact=EquivalentTo("cat", "feline"), prob=0.9),
        PFact(fact=EquivalentTo("dog", "canine"), prob=0.9),

        # Separate plant taxonomy
        PFact(fact=EquivalentTo("rose", "Rosa"), prob=0.9),
        PFact(fact=EquivalentTo("oak", "Quercus"), prob=0.9),
    ]
)

# Creates multiple cliques based on connectivity

Visualizing Partitions

You can inspect how your KB was partitioned:

from boomer.splitter import partition_kb, kb_to_graph
import networkx as nx

# Create the entity graph
graph = kb_to_graph(kb)

# Find strongly connected components
components = list(nx.strongly_connected_components(graph))
print(f"Found {len(components)} cliques")

for i, component in enumerate(components):
    print(f"Clique {i}: {component}")

# Or use the partition function directly
partitions = list(partition_kb(kb))
for i, partition in enumerate(partitions):
    print(f"Partition {i}:")
    print(f"  Entities: {len(partition.pfacts)} pfacts")
    print(f"  Facts: {len(partition.facts)} facts")

Performance Considerations

When Partitioning Helps Most

Multiple independent mappings: Different ontology alignments in the same file
Sparse connectivity: Facts that relate to distinct entity groups
Large knowledge bases: KBs with hundreds or thousands of facts

When Partitioning Has Limited Effect

Highly connected KBs: Where most entities relate to each other
Single large equivalence class: All entities are potentially equivalent
Dense relationship networks: Heavy use of transitive relationships

Configuration Guidelines

Choosing `max_pfacts_per_clique`

The optimal value depends on: - Available memory: Each clique uses 2^n memory for n pfacts - Time constraints: Larger cliques take exponentially longer - Accuracy requirements: Pruning may remove important low-probability facts

Recommended starting points: - Small KBs (< 100 pfacts): No limit needed - Medium KBs (100-1000 pfacts): 100-200 - Large KBs (> 1000 pfacts): 50-100 - Very large KBs: 20-50

Monitoring Partitioning

Enable verbose output to see partitioning details:

boomer-cli solve kb.json -vv

This will show: - Number of cliques found - Size of each clique - Which facts were pruned (if any)

Advanced Topics

Custom Partitioning Strategies

You can implement custom partitioning logic:

from boomer.splitter import partition_kb, extract_sub_kb

def custom_partition(kb, strategy="conservative"):
    if strategy == "conservative":
        # Keep more pfacts per clique
        return partition_kb(kb, max_pfacts_per_clique=200)
    elif strategy == "aggressive":
        # Aggressive pruning for speed
        return partition_kb(kb, max_pfacts_per_clique=30)
    else:
        # No pruning
        return partition_kb(kb)

Parallel Processing

Since cliques are independent, you can process them in parallel:

from concurrent.futures import ProcessPoolExecutor
from boomer.search import solve

def solve_partition(partition):
    return solve(partition)

# Solve all partitions in parallel
partitions = list(partition_kb(kb))
with ProcessPoolExecutor() as executor:
    solutions = list(executor.map(solve_partition, partitions))

How-to Guide: Tuning Subclustering

Diagnosing When You Need Subclustering

Check if subclustering would help:

from boomer.splitter import kb_to_graph
import networkx as nx

# Analyze your KB structure
graph = kb_to_graph(kb)
components = list(nx.strongly_connected_components(graph))

# Check component sizes
sizes = [len(c) for c in components]
print(f"Number of components: {len(components)}")
print(f"Largest component: {max(sizes)} entities")
print(f"Component size distribution: {sorted(sizes, reverse=True)[:10]}")

# If you have large components (>100 entities), subclustering will help

Choosing the Right Strategy

For Speed-Critical Applications

config = SearchConfig(
    max_pfacts_per_clique=30,       # Aggressive limit
    partition_initial_threshold=30   # Start partitioning early
)

For Accuracy-Critical Applications

config = SearchConfig(
    max_pfacts_per_clique=150,       # Conservative limit
    partition_initial_threshold=100  # Partition only when necessary
)

For Balanced Performance

config = SearchConfig(
    max_pfacts_per_clique=75,        # Moderate limit
    partition_initial_threshold=50   # Balanced threshold
)

Monitoring Subclustering Progress

Enable verbose logging to track subclustering:

# Set logging level to see subclustering details
export BOOMER_LOG_LEVEL=INFO
boomer-cli solve large_kb.json -C 50

You'll see output like:

Partitioning KB into 3 sub-KBs
Solving sub-KB: 45 facts, 89 pfacts
Splitting 89 pfacts into 50 pfacts per clique
Found split: 48 / 89 pfacts
Remaining: 41 pfacts
Sub-solution: pr:0.73 // post:0.95 // 1024 combinations

Reference: Subclustering Parameters

SearchConfig Parameters

Parameter	Default	Description	When to Adjust
`max_pfacts_per_clique`	None	Maximum pfacts per clique	Set when memory/time is limited
`partition_initial_threshold`	Same as max_pfacts_per_clique	When to trigger partitioning	Lower for earlier partitioning
`max_iterations`	1000000	Max search iterations	Increase for complex problems
`timeout_seconds`	None	Time limit per solve	Set for production systems

partition_kb() Parameters

Parameter	Type	Description
`kb`	KB	Knowledge base to partition
`max_pfacts_per_clique`	Optional[int]	Maximum pfacts per partition
`min_pfacts_per_clique`	int (default: 5)	Minimum viable partition size

split_connected_components() Internal Parameters

Parameter	Calculation	Purpose
`step_size`	len(kb.pfacts) // 20 + 1	Pfacts dropped per iteration
`dropped_pfacts`	List	Temporarily removed pfacts
`is_split`	Boolean	Whether a viable split was found

Troubleshooting Subclustering

Issue: Subclustering Takes Too Long

Symptoms: The partitioning phase runs for minutes

Solutions: 1. Reduce max_pfacts_per_clique to create smaller partitions faster 2. Increase the step_size in the algorithm (requires code modification) 3. Use more aggressive probability filtering before partitioning

Issue: Important Facts Are Being Dropped

Symptoms: Low-probability but important facts missing from solution

Solutions: 1. Increase max_pfacts_per_clique to retain more facts 2. Pre-boost probabilities of known important facts 3. Use manual partitioning for critical fact groups

Issue: Memory Exhaustion

Symptoms: Out of memory errors during solving

Solutions: 1. Reduce max_pfacts_per_clique significantly (try 20-30) 2. Enable incremental solving with smaller batches 3. Use the CLI with output streaming instead of in-memory processing

Best Practices

Start without limits: First try solving without max_pfacts_per_clique
Monitor performance: Use verbose mode to understand partitioning behavior
Adjust gradually: If needed, start with high limits and reduce gradually
Validate results: Ensure pruning doesn't remove critical facts
Document choices: Record why specific limits were chosen for reproducibility
Test subclustering: Use smaller test KBs to understand behavior before production