SBML_dfs (System Biology Markup Language Data Frames) is Napistu's core data structure for representing biological pathway models as an in-memory relational database. It transforms complex biological networks into a collection of interconnected pandas DataFrames, making pathway data highly accessible for computational analysis and manipulation.
SBML_dfs serves as an in-memory relational database inspired by the SBML data model. Rather than working with traditional XML-based SBML files, SBML_dfs represents biological pathways using structured pandas DataFrames linked by primary key-foreign key relationships. This approach provides several advantages:
SBML_dfs organizes biological pathway information into five core tables connected by primary-foreign key relationships:
compartments)c_id)c_name, c_Identifiersspecies)s_id)s_name, s_Identifiersreactions)r_id)r_name, r_Identifiers, r_isreversiblereaction_species)rsc_id)r_id, sc_id, stoichiometry, sbo_term_namespecies_data)s_id)reactions_data)r_id)from napistu.sbml_dfs_core import SBML_dfs
from napistu import sbml, source
# Load from SBML file
sbml_model = sbml.SBML("pathway.sbml")
model_source = source.Source("MyDatabase", "v1.0")
sbml_dfs = SBML_dfs(sbml_model, model_source)# Direct construction from prepared DataFrames
tables = {
'compartments': compartments_df,
'species': species_df,
'reactions': reactions_df,
'reaction_species': reaction_species_df
}
sbml_dfs = SBML_dfs(tables, model_source)# Simplified creation from molecular interactions
sbml_dfs = SBML_dfs.from_edgelist(
interaction_edgelist=interactions_df, # upstream/downstream pairs
species_df=species_df, # species definitions
compartments_df=compartments_df, # compartment definitions
model_source=model_source,
keep_species_data=True, # preserve extra columns
keep_reactions_data=True
)# Save and load complete SBML_dfs objects
sbml_dfs.to_pickle("my_pathway.pkl")
loaded_sbml_dfs = SBML_dfs.from_pickle("my_pathway.pkl")# Access core tables
species = sbml_dfs.species
reactions = sbml_dfs.reactions
reaction_species = sbml_dfs.reaction_species
# Get tables with validation
species_table = sbml_dfs.get_table("species", required_attributes={'s_name'})
# Search functionality
results = sbml_dfs.search_by_name("insulin", "species", partial_match=True)
entities, ids = sbml_dfs.search_by_ids(id_table, identifiers=["HGNC:6091"])# Network statistics
summary = sbml_dfs.get_network_summary()
# Returns: species counts, reaction counts, connectivity statistics, etc.
# Species connectivity analysis
species_features = sbml_dfs.get_species_features() # species types
cspecies_features = sbml_dfs.get_cspecies_features() # degree, parents, children
# Reaction analysis
formulas = sbml_dfs.reaction_formulas() # human-readable reaction equations
summaries = sbml_dfs.reaction_summaries() # reaction names and formulas# Investigate specific species
status = sbml_dfs.species_status("species_id")
# Returns all reactions the species participates in with stoichiometry
# Get characteristic identifiers
char_ids = sbml_dfs.get_characteristic_species_ids(dogmatic=True)# Add supplementary data
sbml_dfs.add_species_data("expression_data", expression_df)
sbml_dfs.add_reactions_data("kinetics", kinetics_df)
# Access supplementary data
expression_data = sbml_dfs.select_species_data("expression_data")
# Remove data
sbml_dfs.remove_species_data("old_annotation")
sbml_dfs.remove_reactions_data("outdated_scores")# Remove entities
sbml_dfs.remove_reactions(["reaction_1", "reaction_2"], remove_species=True)
sbml_dfs.remove_compartmentalized_species(["sc_id_1", "sc_id_2"])
# Infer missing information
sbml_dfs.infer_sbo_terms() # fill missing SBO terms based on stoichiometry
sbml_dfs.infer_uncompartmentalized_species_location() # assign compartments
sbml_dfs.name_compartmentalized_species() # standardize naming# Validate data integrity
sbml_dfs.validate() # comprehensive validation
# Auto-fix common issues
sbml_dfs.validate_and_resolve() # iterative validation and repair
# Manual validation of specific components
sbml_dfs._validate_identifiers()
sbml_dfs._validate_pk_fk_correspondence()# Export to files
sbml_dfs.export_sbml_dfs(
model_prefix="my_pathway",
outdir="./exports/",
overwrite=True,
dogmatic=True # treat genes/transcripts/proteins as separate
)
# Create deep copies
sbml_dfs_copy = sbml_dfs.copy()# Get reference URLs for entities
urls = sbml_dfs.get_uri_urls("species", required_ontology="UniProt")
# Extract identifier tables
species_ids = sbml_dfs.get_identifiers("species")
reaction_ids = sbml_dfs.get_identifiers("reactions")# Track data sources (important for consensus models)
source_counts = sbml_dfs.get_source_total_counts("species")SBML_dfs objects are designed to work seamlessly with Napistu's consensus modeling framework:
from napistu import consensus
# Combine multiple SBML_dfs objects
sbml_dfs_list = [sbml_dfs_1, sbml_dfs_2, sbml_dfs_3]
sbml_dfs_dict, pw_index = consensus.prepare_consensus_model(sbml_dfs_list)
consensus_model = consensus.construct_consensus_model(sbml_dfs_dict, pw_index)validate() or validate_and_resolve() after creating or modifying SBML_dfs objectsfrom_edgelist() for new models - it's simpler and handles complex relationships automaticallykeep_species_data=True and keep_reactions_data=True to preserve additional informationget_network_summary() for quick model overviewget_species_features() and get_cspecies_features() for network analysisspecies_status() to investigate specific molecular speciesspecies_data and reactions_dataget_characteristic_species_ids() for analysis to avoid double-countingSBML_dfs includes robust error handling and automatic resolution capabilities:
validate_and_resolve() attempts to fix common problemsThe SBML_dfs framework provides a powerful, flexible foundation for computational biology workflows, enabling sophisticated analysis of biological pathways while maintaining data integrity and supporting complex multi-source data integration.