SBML_dfs (Systems Biology Markup Language Data Frames) is the core data structure in Napistu for representing biological pathway models as interconnected pandas DataFrames. It provides a structured, programmatic way to work with complex biological networks by decomposing SBML models into manageable tabular components.
Traditional SBML files are XML-based and can be difficult to manipulate programmatically. SBML_dfs transforms these hierarchical structures into a collection of related DataFrames, making it easier to:
An SBML_dfs object contains five core tables that represent different aspects of a biological network:
compartments - Sub-cellular locations (e.g., cytoplasm, nucleus)species - Molecular entities (e.g., proteins, metabolites, genes)compartmentalized_species - Species-compartment combinationsreactions - Biological processes or interactionsreaction_species - Relationships between reactions and their participantsspecies_data - Additional annotations for species (expression data, properties)reactions_data - Additional annotations for reactions (kinetic parameters, scores)The tables are connected through a well-defined schema with primary and foreign keys:
compartments (c_id) ←─── compartmentalized_species (sc_id)
↑
species (s_id) ←──────────────┘
reactions (r_id) ←─── reaction_species (rsc_id)
↑
compartmentalized_species ────┘from napistu.ingestion.sbml import SBML
from napistu.sbml_dfs_core import SBML_dfs
from napistu.source import Source
# Load SBML file
sbml_model = SBML("path/to/model.sbml")
# Create source metadata
model_source = Source.single_entry(
model="my_model",
name="My Pathway Model",
data_source="Reactome",
organismal_species="Homo sapiens"
)
# Create SBML_dfs
sbml_dfs = SBML_dfs(sbml_model, model_source)import pandas as pd
from napistu.identifiers import Identifiers
# Define species
species_df = pd.DataFrame({
's_name': ['ProteinA', 'ProteinB', 'MetaboliteC'],
's_Identifiers': [
Identifiers([{'ontology': 'uniprot', 'identifier': 'P12345'}]),
Identifiers([{'ontology': 'uniprot', 'identifier': 'P67890'}]),
Identifiers([{'ontology': 'chebi', 'identifier': '12345'}])
]
})
# Define compartments
compartments_df = pd.DataFrame({
'c_name': ['cytoplasm', 'nucleus'],
'c_Identifiers': [
Identifiers([{'ontology': 'go', 'identifier': 'GO:0005829'}]),
Identifiers([{'ontology': 'go', 'identifier': 'GO:0005634'}])
]
})
# Define interactions
interaction_edgelist = pd.DataFrame({
'upstream_name': ['ProteinA'],
'downstream_name': ['ProteinB'],
'r_name': ['ProteinA activates ProteinB'],
'upstream_compartment': ['cytoplasm'],
'downstream_compartment': ['cytoplasm'],
'upstream_sbo_term_name': ['stimulator'],
'downstream_sbo_term_name': ['product'],
'r_isreversible': [False]
})
# Create SBML_dfs from interactions
sbml_dfs = SBML_dfs.from_edgelist(
interaction_edgelist=interaction_edgelist,
species_df=species_df,
compartments_df=compartments_df,
model_source=model_source
)Napistu can merge multiple pathway models into consensus networks:
from napistu.consensus import construct_consensus_model, prepare_consensus_model
# Load multiple models
model_list = [
SBML_dfs.from_pickle("reactome_model.pkl"),
SBML_dfs.from_pickle("kegg_model.pkl"),
SBML_dfs.from_pickle("bigg_model.pkl")
]
# Prepare for consensus
sbml_dfs_dict, pw_index = prepare_consensus_model(model_list)
# Create consensus model (merges entities with shared identifiers)
consensus_model = construct_consensus_model(
sbml_dfs_dict,
pw_index,
dogmatic=True # Keep genes/transcripts/proteins separate
)SBML_dfs uses a sophisticated identifier system to link biological entities across databases:
# Get species identifiers
species_ids = sbml_dfs.get_identifiers('species')
# Get characteristic identifiers (filters out subcomponents)
char_ids = sbml_dfs.get_characteristic_species_ids(dogmatic=True)
# Search by specific identifiers
entity_subset, matching_ids = sbml_dfs.search_by_ids(
id_table=species_ids,
identifiers=['P53_HUMAN'],
ontologies=['uniprot']
)# Get network statistics
network_stats = sbml_dfs.get_network_summary()
# Get species connectivity features
species_features = sbml_dfs.get_species_features()
cspecies_features = sbml_dfs.get_cspecies_features()
# Generate reaction formulas
formulas = sbml_dfs.reaction_formulas()# Add expression data
expression_data = pd.DataFrame({
's_id': ['S00001', 'S00002'],
'liver_expression': [5.2, 3.1],
'brain_expression': [2.8, 7.4]
})
sbml_dfs.add_species_data('gtex_expression', expression_data)
# Add reaction scores
reaction_scores = pd.DataFrame({
'r_id': ['R00001', 'R00002'],
'confidence_score': [0.95, 0.82]
})
sbml_dfs.add_reactions_data('confidence', reaction_scores)# Infer missing compartments
sbml_dfs.infer_uncompartmentalized_species_location()
# Infer SBO terms from stoichiometry
sbml_dfs.infer_sbo_terms()
# Name compartmentalized species
sbml_dfs.name_compartmentalized_species()
# Remove reactions and unused species
sbml_dfs.remove_reactions(['R00001', 'R00002'], remove_species=True)# Export to various formats
sbml_dfs.export_sbml_dfs("my_model_", "output_directory/")
# Convert to network graph
from napistu.network.net_create import process_napistu_graph
network_graph = process_napistu_graph(
sbml_dfs,
directed=True,
wiring_approach="regulatory",
weighting_strategy="unweighted"
)
# Save/load models
sbml_dfs.to_pickle("my_model.pkl")
loaded_model = SBML_dfs.from_pickle("my_model.pkl")from napistu.context import filtering
# Filter by tissue expression (requires GTEx data)
filtering.filter_species_by_attribute(
sbml_dfs,
'gtex',
attribute_name='liver',
attribute_value=1.0, # minimum expression threshold
inplace=True
)
# Filter by subcellular localization (requires HPA data)
filtering.filter_reactions_with_disconnected_cspecies(
sbml_dfs, 'hpa', inplace=True
)from napistu.ontologies.genodexito import Genodexito
# Expand identifiers using external databases
genodexito = Genodexito(
organismal_species="Homo sapiens",
preferred_method="bioconductor"
)
genodexito.expand_sbml_dfs_ids(
sbml_dfs,
ontologies=['ensembl_gene', 'uniprot']
)Always validate your SBML_dfs objects:
sbml_dfs.validate() # Raises informative errors
# or
sbml_dfs.validate_and_resolve() # Attempts automatic fixesMaintain proper source metadata for reproducibility:
model_source = Source.single_entry(
model="unique_model_id",
pathway_id="pathway_identifier",
name="Human readable name",
data_source="Database name",
organismal_species="Homo sapiens",
date="20240315"
)Use characteristic identifiers for mapping and analysis:
# Prefer this for cross-dataset mapping
char_ids = sbml_dfs.get_characteristic_species_ids(dogmatic=True)
# Over this (includes subcomponents and non-characteristic annotations)
all_ids = sbml_dfs.get_identifiers('species')For large models, consider selective loading:
# Load specific tables only
species_table = sbml_dfs.get_table('species')
reactions_table = sbml_dfs.get_table('reactions')
# Remove unused data
sbml_dfs.remove_species_data('large_dataset')SBML_dfs integrates seamlessly with other Napistu components:
napistu command-line interface for batch processingSBML_dfs provides a powerful, flexible framework for working with biological pathway data. Its tabular structure makes complex biological networks accessible to standard data science workflows while maintaining the biological semantics and relationships essential for systems biology research.
The modular design allows users to start simple (single pathway analysis) and scale up to complex multi-database consensus models, making it suitable for both exploratory analysis and production pipelines in computational biology.