Content is user-generated and unverified.

SBML_dfs: A Comprehensive Overview

Introduction

SBML_dfs (Systems Biology Markup Language Data Frames) is the core data structure in Napistu for representing biological pathway models as interconnected pandas DataFrames. It provides a structured, programmatic way to work with complex biological networks by decomposing SBML models into manageable tabular components.

Core Concept

Traditional SBML files are XML-based and can be difficult to manipulate programmatically. SBML_dfs transforms these hierarchical structures into a collection of related DataFrames, making it easier to:

  • Query and filter biological networks
  • Merge multiple pathway models
  • Perform network analysis
  • Apply data science techniques to biological data

Data Structure Overview

An SBML_dfs object contains five core tables that represent different aspects of a biological network:

Core Tables

  1. compartments - Sub-cellular locations (e.g., cytoplasm, nucleus)
  2. species - Molecular entities (e.g., proteins, metabolites, genes)
  3. compartmentalized_species - Species-compartment combinations
  4. reactions - Biological processes or interactions
  5. reaction_species - Relationships between reactions and their participants

Optional Data Tables

  • species_data - Additional annotations for species (expression data, properties)
  • reactions_data - Additional annotations for reactions (kinetic parameters, scores)

Table Relationships and Schema

The tables are connected through a well-defined schema with primary and foreign keys:

compartments (c_id) ←─── compartmentalized_species (sc_id)
                              ↑
species (s_id) ←──────────────┘
                              
reactions (r_id) ←─── reaction_species (rsc_id)
                              ↑
compartmentalized_species ────┘

Key Relationships

  • Each compartmentalized species links a species to a compartment
  • Each reaction species links a reaction to a compartmentalized species with stoichiometry and role information
  • Identifiers provide cross-references to external databases (UniProt, KEGG, etc.)
  • Sources track the origin of each entity across different pathway databases

Creating SBML_dfs Objects

Method 1: From SBML Files

python
from napistu.ingestion.sbml import SBML
from napistu.sbml_dfs_core import SBML_dfs
from napistu.source import Source

# Load SBML file
sbml_model = SBML("path/to/model.sbml")

# Create source metadata
model_source = Source.single_entry(
    model="my_model",
    name="My Pathway Model",
    data_source="Reactome",
    organismal_species="Homo sapiens"
)

# Create SBML_dfs
sbml_dfs = SBML_dfs(sbml_model, model_source)

Method 2: From Interaction Edgelists

python
import pandas as pd
from napistu.identifiers import Identifiers

# Define species
species_df = pd.DataFrame({
    's_name': ['ProteinA', 'ProteinB', 'MetaboliteC'],
    's_Identifiers': [
        Identifiers([{'ontology': 'uniprot', 'identifier': 'P12345'}]),
        Identifiers([{'ontology': 'uniprot', 'identifier': 'P67890'}]),
        Identifiers([{'ontology': 'chebi', 'identifier': '12345'}])
    ]
})

# Define compartments
compartments_df = pd.DataFrame({
    'c_name': ['cytoplasm', 'nucleus'],
    'c_Identifiers': [
        Identifiers([{'ontology': 'go', 'identifier': 'GO:0005829'}]),
        Identifiers([{'ontology': 'go', 'identifier': 'GO:0005634'}])
    ]
})

# Define interactions
interaction_edgelist = pd.DataFrame({
    'upstream_name': ['ProteinA'],
    'downstream_name': ['ProteinB'],
    'r_name': ['ProteinA activates ProteinB'],
    'upstream_compartment': ['cytoplasm'],
    'downstream_compartment': ['cytoplasm'],
    'upstream_sbo_term_name': ['stimulator'],
    'downstream_sbo_term_name': ['product'],
    'r_isreversible': [False]
})

# Create SBML_dfs from interactions
sbml_dfs = SBML_dfs.from_edgelist(
    interaction_edgelist=interaction_edgelist,
    species_df=species_df,
    compartments_df=compartments_df,
    model_source=model_source
)

Method 3: From Consensus Models

Napistu can merge multiple pathway models into consensus networks:

python
from napistu.consensus import construct_consensus_model, prepare_consensus_model

# Load multiple models
model_list = [
    SBML_dfs.from_pickle("reactome_model.pkl"),
    SBML_dfs.from_pickle("kegg_model.pkl"),
    SBML_dfs.from_pickle("bigg_model.pkl")
]

# Prepare for consensus
sbml_dfs_dict, pw_index = prepare_consensus_model(model_list)

# Create consensus model (merges entities with shared identifiers)
consensus_model = construct_consensus_model(
    sbml_dfs_dict, 
    pw_index, 
    dogmatic=True  # Keep genes/transcripts/proteins separate
)

Key Features and Capabilities

1. Identifier Management

SBML_dfs uses a sophisticated identifier system to link biological entities across databases:

python
# Get species identifiers
species_ids = sbml_dfs.get_identifiers('species')

# Get characteristic identifiers (filters out subcomponents)
char_ids = sbml_dfs.get_characteristic_species_ids(dogmatic=True)

# Search by specific identifiers
entity_subset, matching_ids = sbml_dfs.search_by_ids(
    id_table=species_ids,
    identifiers=['P53_HUMAN'],
    ontologies=['uniprot']
)

2. Network Analysis

python
# Get network statistics
network_stats = sbml_dfs.get_network_summary()

# Get species connectivity features
species_features = sbml_dfs.get_species_features()
cspecies_features = sbml_dfs.get_cspecies_features()

# Generate reaction formulas
formulas = sbml_dfs.reaction_formulas()

3. Data Integration

python
# Add expression data
expression_data = pd.DataFrame({
    's_id': ['S00001', 'S00002'],
    'liver_expression': [5.2, 3.1],
    'brain_expression': [2.8, 7.4]
})
sbml_dfs.add_species_data('gtex_expression', expression_data)

# Add reaction scores
reaction_scores = pd.DataFrame({
    'r_id': ['R00001', 'R00002'], 
    'confidence_score': [0.95, 0.82]
})
sbml_dfs.add_reactions_data('confidence', reaction_scores)

4. Model Refinement

python
# Infer missing compartments
sbml_dfs.infer_uncompartmentalized_species_location()

# Infer SBO terms from stoichiometry
sbml_dfs.infer_sbo_terms()

# Name compartmentalized species
sbml_dfs.name_compartmentalized_species()

# Remove reactions and unused species
sbml_dfs.remove_reactions(['R00001', 'R00002'], remove_species=True)

5. Export and Visualization

python
# Export to various formats
sbml_dfs.export_sbml_dfs("my_model_", "output_directory/")

# Convert to network graph
from napistu.network.net_create import process_napistu_graph

network_graph = process_napistu_graph(
    sbml_dfs,
    directed=True,
    wiring_approach="regulatory",
    weighting_strategy="unweighted"
)

# Save/load models
sbml_dfs.to_pickle("my_model.pkl")
loaded_model = SBML_dfs.from_pickle("my_model.pkl")

Advanced Usage Patterns

Context-Specific Filtering

python
from napistu.context import filtering

# Filter by tissue expression (requires GTEx data)
filtering.filter_species_by_attribute(
    sbml_dfs, 
    'gtex', 
    attribute_name='liver',
    attribute_value=1.0,  # minimum expression threshold
    inplace=True
)

# Filter by subcellular localization (requires HPA data)
filtering.filter_reactions_with_disconnected_cspecies(
    sbml_dfs, 'hpa', inplace=True
)

Ontology-Based Operations

python
from napistu.ontologies.genodexito import Genodexito

# Expand identifiers using external databases
genodexito = Genodexito(
    organismal_species="Homo sapiens",
    preferred_method="bioconductor"
)

genodexito.expand_sbml_dfs_ids(
    sbml_dfs, 
    ontologies=['ensembl_gene', 'uniprot']
)

Best Practices

1. Validation

Always validate your SBML_dfs objects:

python
sbml_dfs.validate()  # Raises informative errors
# or
sbml_dfs.validate_and_resolve()  # Attempts automatic fixes

2. Source Tracking

Maintain proper source metadata for reproducibility:

python
model_source = Source.single_entry(
    model="unique_model_id",
    pathway_id="pathway_identifier", 
    name="Human readable name",
    data_source="Database name",
    organismal_species="Homo sapiens",
    date="20240315"
)

3. Identifier Quality

Use characteristic identifiers for mapping and analysis:

python
# Prefer this for cross-dataset mapping
char_ids = sbml_dfs.get_characteristic_species_ids(dogmatic=True)

# Over this (includes subcomponents and non-characteristic annotations)
all_ids = sbml_dfs.get_identifiers('species')

4. Memory Management

For large models, consider selective loading:

python
# Load specific tables only
species_table = sbml_dfs.get_table('species')
reactions_table = sbml_dfs.get_table('reactions')

# Remove unused data
sbml_dfs.remove_species_data('large_dataset')

Integration with Napistu Ecosystem

SBML_dfs integrates seamlessly with other Napistu components:

  • CLI Tools: Use napistu command-line interface for batch processing
  • Network Analysis: Convert to igraph objects for graph algorithms
  • Matching Module: Map experimental data to pathway entities
  • Context Module: Apply tissue/condition-specific filters
  • Ingestion Module: Import from 10+ pathway databases

Common Use Cases

  1. Multi-Database Integration: Merge Reactome, KEGG, and BiGG models
  2. Context-Specific Networks: Filter pathways by tissue expression
  3. Experimental Data Mapping: Link omics data to pathway entities
  4. Network Analysis: Calculate centrality, pathways, and modules
  5. Comparative Analysis: Compare pathway content across conditions
  6. Visualization: Export networks for Cytoscape, Gephi, or web tools

Conclusion

SBML_dfs provides a powerful, flexible framework for working with biological pathway data. Its tabular structure makes complex biological networks accessible to standard data science workflows while maintaining the biological semantics and relationships essential for systems biology research.

The modular design allows users to start simple (single pathway analysis) and scale up to complex multi-database consensus models, making it suitable for both exploratory analysis and production pipelines in computational biology.

Content is user-generated and unverified.
    SBML_dfs: A Comprehensive Overview | Claude