Content is user-generated and unverified.

SBML_dfs High-Level Overview - Napistu

Introduction

SBML_dfs (System Biology Markup Language Data Frames) is Napistu's core data structure for representing biological pathway models as an in-memory relational database. It transforms complex biological networks into a collection of interconnected pandas DataFrames, making pathway data highly accessible for computational analysis and manipulation.

Core Concept

SBML_dfs serves as an in-memory relational database inspired by the SBML data model. Rather than working with traditional XML-based SBML files, SBML_dfs represents biological pathways using structured pandas DataFrames linked by primary key-foreign key relationships. This approach provides several advantages:

  • Performance: Faster data access and manipulation compared to XML parsing
  • Integration: Seamless integration with Python's data science ecosystem
  • Flexibility: Easy querying, filtering, and analysis using pandas operations
  • Scalability: Efficient handling of large-scale biological networks
  • Validation: Built-in data integrity checks and automatic error resolution

Data Structure Architecture

SBML_dfs organizes biological pathway information into five core tables connected by primary-foreign key relationships:

Core Tables

1. Compartments (compartments)

  • Purpose: Defines sub-cellular compartments (e.g., cytoplasm, nucleus, mitochondria)
  • Index: Compartment ID (c_id)
  • Key Columns: c_name, c_Identifiers

2. Species (species)

  • Purpose: Catalog of molecular species (proteins, metabolites, genes, etc.)
  • Index: Species ID (s_id)
  • Key Columns: s_name, s_Identifiers
  • Features: Systematic identifiers for cross-referencing with external databases

3. Reactions (reactions)

  • Purpose: Biochemical reactions and molecular interactions
  • Index: Reaction ID (r_id)
  • Key Columns: r_name, r_Identifiers, r_isreversible

4. Reaction Species (reaction_species)

  • Purpose: Links species to reactions with stoichiometry and role information
  • Index: Reaction-Species ID (rsc_id)
  • Key Columns: r_id, sc_id, stoichiometry, sbo_term_name
  • Function: Defines how each species participates in each reaction

5. Compartmentalized Species (derived)

  • Purpose: Species-compartment combinations
  • Function: Represents the same molecular species in different cellular compartments

Extended Data Storage

Species Data (species_data)

  • Type: Dictionary of labeled DataFrames
  • Purpose: Additional species-specific information (e.g., expression data, annotations)
  • Index: Species ID (s_id)

Reactions Data (reactions_data)

  • Type: Dictionary of labeled DataFrames
  • Purpose: Additional reaction-specific information (e.g., kinetic parameters, confidence scores)
  • Index: Reaction ID (r_id)

Schema Validation

  • Schema: Built-in data structure validation system
  • Function: Ensures data integrity and proper relationships between tables

Object Creation Methods

1. From SBML Files

python
from napistu.sbml_dfs_core import SBML_dfs
from napistu import sbml, source

# Load from SBML file
sbml_model = sbml.SBML("pathway.sbml")
model_source = source.Source("MyDatabase", "v1.0")
sbml_dfs = SBML_dfs(sbml_model, model_source)

2. From Component DataFrames

python
# Direct construction from prepared DataFrames
tables = {
    'compartments': compartments_df,
    'species': species_df,
    'reactions': reactions_df,
    'reaction_species': reaction_species_df
}
sbml_dfs = SBML_dfs(tables, model_source)

3. From Interaction Edgelist (Recommended)

python
# Simplified creation from molecular interactions
sbml_dfs = SBML_dfs.from_edgelist(
    interaction_edgelist=interactions_df,  # upstream/downstream pairs
    species_df=species_df,                 # species definitions
    compartments_df=compartments_df,       # compartment definitions
    model_source=model_source,
    keep_species_data=True,                # preserve extra columns
    keep_reactions_data=True
)

4. From Pickle Files

python
# Save and load complete SBML_dfs objects
sbml_dfs.to_pickle("my_pathway.pkl")
loaded_sbml_dfs = SBML_dfs.from_pickle("my_pathway.pkl")

Key Features and Capabilities

Data Access and Querying

python
# Access core tables
species = sbml_dfs.species
reactions = sbml_dfs.reactions
reaction_species = sbml_dfs.reaction_species

# Get tables with validation
species_table = sbml_dfs.get_table("species", required_attributes={'s_name'})

# Search functionality
results = sbml_dfs.search_by_name("insulin", "species", partial_match=True)
entities, ids = sbml_dfs.search_by_ids(id_table, identifiers=["HGNC:6091"])

Network Analysis

python
# Network statistics
summary = sbml_dfs.get_network_summary()
# Returns: species counts, reaction counts, connectivity statistics, etc.

# Species connectivity analysis
species_features = sbml_dfs.get_species_features()  # species types
cspecies_features = sbml_dfs.get_cspecies_features()  # degree, parents, children

# Reaction analysis
formulas = sbml_dfs.reaction_formulas()  # human-readable reaction equations
summaries = sbml_dfs.reaction_summaries()  # reaction names and formulas

Species and Reaction Investigation

python
# Investigate specific species
status = sbml_dfs.species_status("species_id")
# Returns all reactions the species participates in with stoichiometry

# Get characteristic identifiers
char_ids = sbml_dfs.get_characteristic_species_ids(dogmatic=True)

Data Management

python
# Add supplementary data
sbml_dfs.add_species_data("expression_data", expression_df)
sbml_dfs.add_reactions_data("kinetics", kinetics_df)

# Access supplementary data
expression_data = sbml_dfs.select_species_data("expression_data")

# Remove data
sbml_dfs.remove_species_data("old_annotation")
sbml_dfs.remove_reactions_data("outdated_scores")

Model Modification

python
# Remove entities
sbml_dfs.remove_reactions(["reaction_1", "reaction_2"], remove_species=True)
sbml_dfs.remove_compartmentalized_species(["sc_id_1", "sc_id_2"])

# Infer missing information
sbml_dfs.infer_sbo_terms()  # fill missing SBO terms based on stoichiometry
sbml_dfs.infer_uncompartmentalized_species_location()  # assign compartments
sbml_dfs.name_compartmentalized_species()  # standardize naming

Validation and Quality Control

python
# Validate data integrity
sbml_dfs.validate()  # comprehensive validation

# Auto-fix common issues
sbml_dfs.validate_and_resolve()  # iterative validation and repair

# Manual validation of specific components
sbml_dfs._validate_identifiers()
sbml_dfs._validate_pk_fk_correspondence()

Export and Sharing

python
# Export to files
sbml_dfs.export_sbml_dfs(
    model_prefix="my_pathway",
    outdir="./exports/",
    overwrite=True,
    dogmatic=True  # treat genes/transcripts/proteins as separate
)

# Create deep copies
sbml_dfs_copy = sbml_dfs.copy()

Advanced Usage Patterns

Working with External Identifiers

python
# Get reference URLs for entities
urls = sbml_dfs.get_uri_urls("species", required_ontology="UniProt")

# Extract identifier tables
species_ids = sbml_dfs.get_identifiers("species")
reaction_ids = sbml_dfs.get_identifiers("reactions")

Source Tracking and Provenance

python
# Track data sources (important for consensus models)
source_counts = sbml_dfs.get_source_total_counts("species")

Integration with Consensus Models

SBML_dfs objects are designed to work seamlessly with Napistu's consensus modeling framework:

python
from napistu import consensus

# Combine multiple SBML_dfs objects
sbml_dfs_list = [sbml_dfs_1, sbml_dfs_2, sbml_dfs_3]
sbml_dfs_dict, pw_index = consensus.prepare_consensus_model(sbml_dfs_list)
consensus_model = consensus.construct_consensus_model(sbml_dfs_dict, pw_index)

Best Practices

1. Always Validate

  • Use validate() or validate_and_resolve() after creating or modifying SBML_dfs objects
  • Enable validation during initialization (default behavior)

2. Use Edgelist Creation

  • Prefer from_edgelist() for new models - it's simpler and handles complex relationships automatically
  • Set keep_species_data=True and keep_reactions_data=True to preserve additional information

3. Leverage Built-in Analysis

  • Use get_network_summary() for quick model overview
  • Use get_species_features() and get_cspecies_features() for network analysis
  • Use species_status() to investigate specific molecular species

4. Manage Additional Data Systematically

  • Use descriptive labels for species_data and reactions_data
  • Document the meaning and source of additional data tables
  • Clean up unused data tables regularly

5. Handle Identifiers Properly

  • Ensure systematic identifiers are properly formatted
  • Use get_characteristic_species_ids() for analysis to avoid double-counting
  • Leverage identifier search functions for cross-referencing

Common Use Cases

  1. Pathway Integration: Combine multiple pathway databases into consensus models
  2. Network Analysis: Analyze pathway topology, connectivity, and structure
  3. Data Integration: Add experimental data (expression, proteomics) to pathway models
  4. Quality Control: Validate and clean pathway data from various sources
  5. Model Export: Convert between different pathway data formats
  6. Comparative Analysis: Compare pathways across species or conditions

Error Handling and Troubleshooting

SBML_dfs includes robust error handling and automatic resolution capabilities:

  • Validation Errors: Detailed error messages indicate specific issues
  • Auto-Resolution: validate_and_resolve() attempts to fix common problems
  • Schema Enforcement: Built-in schema validation prevents invalid data structures
  • Relationship Integrity: Automatic checking of primary-foreign key relationships

The SBML_dfs framework provides a powerful, flexible foundation for computational biology workflows, enabling sophisticated analysis of biological pathways while maintaining data integrity and supporting complex multi-source data integration.

Content is user-generated and unverified.
    SBML_dfs High-Level Overview - Napistu | Claude