Background & Summary

The annual increase in the volume and complexity of biomedical data has posed significant challenges for analysts interested in comprehensive data integration, and requires advanced tools to harness the potential of biomedical datasets that include omics data. Knowledge graphs are the best near-term solutions for the integration and analysis of heterogeneous data within and between large biomedical datasets1. Knowledge graphs can effectively integrate and analyze heterogeneous data sources within and across large biomedical datasets. Methods such as node and link prediction algorithms, supervised and unsupervised machine learning can be applied to biomedical knowledge graphs for many types of use cases.

Most biomedical knowledge graphs are customized for particular use cases. In recent years, knowledge graphs have witnessed a rapid proliferation in their adoption, with applications spanning drug discovery, drug repurposing, and the prediction of drug targets2,3,4,5,6. Other biomedical knowledge graphs gave integrated heterogeneous COVID-19 data7,8,9,10,11,12,13, oncology datasets14,15,16, and gene-disease associations3,17. These knowledge graphs are very application-specific, which is expected for efficiency and analysis. General genomic data-integrated knowledge graphs such as Petagraph and GenomicKB18 are expected to increase in number, helped in no small part by the maturation of projects on ontological unification such as the Monarch Initiative19 and on genomics data standards such as GA4GH20.

To facilitate the widespread adoption of knowledge graphs in the biomedical research community, we aimed to develop a modular knowledge graph framework comprising ontologies, vocabularies, standards, and commonly used data resources including omics data. This framework would enable the efficient creation of knowledge graphs for diverse applications. Our requirements for this knowledge graph framework included the need to accommodate various omics data types within a network of interconnected standards and ontology systems to ingest and seamlessly link experimentally-derived data to other data sources within the graph.

In order to support these requirements, we created the Unified Biomedical Knowledge Graph (UBKG)21. The UBKG takes the form of a property graph, based on the NIH Unified Medical Language Service (UMLS)22, and incorporates 105 English language-based ontologies and standards updated regularly from the biannual UMLS releases. The UBKG extends the content from the UMLS by adding biomedical data from a variety of other sources, including both standard ontologies and custom reference data. UBKG offers adaptability for various biomedical ontological knowledge graph construction. It has already been adapted for projects such as HuBMAP23, SenNet24, and the NIH Common Fund Data Ecosystem Data Distillery Project25. Table 2 provides a list of additional ontologies that the UBKG contributes to the UMLS ontology collection, which will enhance support for building knowledge graphs and supporting future omics initiatives.

Adopting the UBKG as a base ontological scaffold, we constructed Petagraph, a biomedical knowledge graph that embeds and connects omics and related data into the UBKG’s ontological structure, effectively ‘bringing the data to the ontologies’ by embedding these data within a richly connected annotation environment. Petagraph has added over 12 million nodes to the UBKG, an increase of almost 44%, more than doubling the number of relationships from 52 million to 118 million. Petagraph increases independent relationship types to 1,861 versus UBKG’s 1,756. Petagraph was originally piloted as a resource for rapid feature selection to identify, annotate, and explore gene candidates for human diseases, and has since expanded to act as a base module for user-customized knowledge graphs for other types of biomedical use cases. To create Petagraph, we added 21 sources of supporting omics and annotation to UBKG (Tables 2 and 3). Among these additions is the Homo sapiens Chromosomal Location Ontology (HSCLO38), developed to allow Petagraph queries to easily link relevant genomics features across different resolutions by chromosome position and chromosomal vicinity26. The modular design for data incorporation enables any user to add and/or subset data on top of Petagraph’s omics-rich data structure for their particular use case. Petagraph is therefore distinguished by its generalizable schema and richly annotated structure, which will allow for the subsetting of highly integrated omics data for a myriad of use cases.

In choosing datasets for Petagraph, we prioritized the integration of interpreted knowledge over raw experimental data. The genomic and related datasets within Petagraph have been curated, harmonized, and interpreted through thousands of hours of expert effort, ensuring that users have access to high-quality data for their queries and thus increasing the likelihood of generating meaningful insights. We show that queries on Petagraph can, in fact, rapidly return meaningful results for a diverse set of example use cases, including those for annotation and analysis. The integration of large datasets into Petagraph has the potential to advance how biomedical and biomolecular data is mined and analyzed. By harmonizing diverse data types, including genomics, transcriptomics, proteomics, and clinical data, Petagraph supports system-wide approaches to analysis.

Use cases for Petagraph include but are not limited to identifying genomic features functionally linked to genes or diseases, linking across genetics data between human and animal models, linking transcriptional perturbations by compound in tissues of interest, or identifying cell types from single-cell data that are most associated with diseases or genes. Analytical use cases for Petagraph’s data collection include applying machine learning methods or topological analyses to predict relationships such as link prediction, or predicting new properties on biomedical data types, such as node property prediction.

The scalable and modular design of Petagraph allows for continuous incorporation of new datasets. Petagraph’s mature architecture is designed to be extended easily by users wishing to build their own custom knowledge graph. Utilizing the Unified Biomedical Knowledge Graph (UBKG) ingestion protocols, Petagraph users can easily integrate additional data sources, whether they are publicly available or proprietary. This modularity ensures that researchers can curate the knowledge graph to include specific builds relevant to their unique applications. The node and edge CSV files that are used to construct Petagraph are 12.5 GB in size and can be used to reconstruct the database or can be used with our instructions to expand or change the knowledge graph with the UBKG as the base scaffold. We provide a 4.5 GB database dump of Petagraph that can be installed on local laptops with at least 16 GB of memory. Petagraph is currently being utilized as a base module by the NIH Common Fund Data Ecosystem (CFDE) Data Distillery Project to integrate Common Fund omics data from twelve data coordination centers25.

Future work will focus on expanding Petagraph’s capabilities, enhancing automated validation techniques, and developing standardized benchmarks for knowledge graph comparisons. Another emerging and important research area is in the integration of knowledge graphs with Large Language Models (LLMs) to enhance biomedical data analysis. Curated knowledge graphs like Petagraph provide a curated source of structured knowledge that can improve LLMs’ understanding and generation of biomedical knowledge. LLMs can use Petagraph to generate more accurate and contextually relevant responses to complex biomedical queries, with many potential applications in personalized medicine, drug discovery, and clinical outcome prediction.

In conclusion, Petagraph represents a significant advancement in the integration and analysis of multi-omics and biomedical data. Its robust framework, extensive dataset integration, and advanced analytical capabilities position it as a valuable resource for the biomedical research community, enabling new discoveries and fostering a deeper understanding of complex biological systems.

Methods

We discuss the origins and versions of records used in Petagraph by the type of data source: ontologies, mappings, and source genomics data sets. Details of these datasets can be found in Tables 2 and 3. For clarity, we indicate graph elements in courier font, for example, distinguishing graph elements such as HGNC nodes from the dataset source HGNC.

Records for supporting ontologies

This section represents ontologies and ontology collections.

UBKG base context

The UBKG acts as the base build for Petagraph. The foundation of the UBKG is a set of entities and relationships obtained from the NIH Unified Medical Language System (UMLS). The UMLS consolidates the content of a large number of standard biomedical vocabularies. The set of assertions obtained from extracted UMLS 2023AB content is referred to as the UMLS source context27. The UBKG adds to the data from UMLS information from a variety of sources, including:

  • Ontologies published in OBO Foundry28 and NCBO BioPortal29

  • Reference sites, such as UniProt30 and GENCODE31

  • Custom data sources, including those from CFDE partners such as HuBMAP23

The combination of the UMLS source context with the additional sources is referred to as the UBKG base context27.

Homo sapiens chromosomal location ontology (HSCLO38)

The Petagraph team created the Homo Sapiens Chromosomal Location Ontology for GRCh38 (HSCLO38) primarily to connect genomic features by chromosomal position. HSCLO38 is utilized to connect features from annotation standards such as GENCODE and datasets such as 4DN and GTEXEQTL locations in the graph as searchable nodes at 1 kbp resolution. The HSCLO38 nodes are defined at 5 resolution levels; chromosome, 1 Mbp, 100 kbp, 10 kbp and 1 kbp with each level connected up and down size scales. The HSCLO38 tree contains 3,431,155 nodes and 6,862,195 relationships (13,724,390 including reverse edges)26

Supporting mappings

Genomic features-to-chromosomal position (GENCODEHSCLO)

Gene names with chromosomal positions were downloaded on 2023-12-10 from GENCODE31 v41 (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/). GENCODE genomic features are mapped to 1 kbp resolution nodes in HSCLO38 at their 5′ and 3′ genomic positions. The preprocessing script is provided in GitHub (Table 9).

Human-to-mouse ortholog mappings (HCOP)

Human-to-mouse orthology mappings were downloaded from the HGNC Comparisons of Orthology Predictions (HCOP) tool (https://www.genenames.org/tools/hcop/)32 on 04/15/2021. HCOP maps mouse orthologs to 20,715 of the 41,638 HGNC Concept nodes. Each orthologous pair of human-mouse gene nodes share reciprocal relationships, ‘has_human_ortholog’, and ‘has_mouse_ortholog’.

Human gene-to-phenotype mappings (HGNCHPO)

Mapping data from human genes to human phenotypes were obtained in July 2021 from the Human Phenotype Ontology (HPO) (https://hpo.jax.org/app/data/annotations)33. The dataset contains 4,545 genes mapped to at least one phenotype and 10,896 phenotypes mapped to at least one gene. The Hugo Gene Nomenclature Committee’s (HGNC)34 gene names are included with the UMLS as HGNC Concept nodes which are connected to HPO Concept nodes through an ‘associated_with’ relationship type.

Human-to-mouse phenotype mappings (HPOMP)

Human-to-mouse phenotype mapping data connecting Human Phenotype Ontology (HPO) terms to Mammalian Phenotype Ontology (MP) terms were generated on 2020-12-12 using PheKnowLator35 v1 (Table 9). We included only PhenoKnowLator’s “exact phenotype matches” between HPO and MP, resulting in ~1000 mappings. More detailed descriptions of the PheKnowLator mapping scores can be found on their GitHub page (https://github.com/callahantiff/PheKnowLator). Precise mapping of human phenotypes in HPO to mammalian phenotypes in MP is an ongoing project by the MONDO and uPheno projects19.

Mouse gene-to-phenotype mappings (MPMGI)

We were interested in including mouse gene-to-phenotype relationships. Data were downloaded (2021-01-10) as multiple datasets from two separate databases.The first set of datasets for genotype and phenotype assertions were downloaded from the International Mouse Phenotyping Consortium (IMPC)36 (https://www.mousephenotype.org/) from their ftp site (http://ftp.ebi.ac.uk/pub/databases/impc/all-data-releases/latest/results/genotype-phenotype-assertions-ALL.csv.gz) and the results of statistical analyses connecting mouse genes to phenotypes (http://ftp.ebi.ac.uk/pub/databases/impc/all-data-releases/latest/results/statistical-results-ALL.csv.gz). The second set of datasets were obtained from the mouse genome informatics (MGI) database37 and can be found at (http://www.informatics.jax.org/downloads/reports/index.html#pheno). We imported MGI datasets PhenoGeno (http://www.informatics.jax.org/downloads/reports/MGI_PhenoGenoMP.rpt) GenePheno and (http://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt) and Geno_Disease (http://www.informatics.jax.org/downloads/reports/MGI_Geno_DiseaseDO.rpt). All 3 datasets contain, among other data, phenotype-to-gene mappings. The datasets from IMPC and MGI were combined to create a mouse genotype-to-phenotype dataset. This master dataset MPMGI contains 10,380 MP terms that are mapped to at least one gene and 17,936 genes that are mapped to at least one MP term.

Molecular signatures database (MSIGDB)

MSigDB is a collection of gene set resources, curated or collected from several different sources38,39 (https://www.gsea-msigdb.org/gsea/msigdb/). Five subsets of MSigDB v7.4 datasets were introduced as entity-gene relationships to the knowledge graph: C1 (positional gene sets), C2 (curated gene sets), C3 (regulatory target gene sets), C8 (cell type signature gene sets) and H (hallmark gene sets). With this subset, we created MSIGDB Concept nodes for 31,516 MSigDB systematic names (used as Codes). The MSIGDB and HGNC Concept nodes are connected by relationships that reflect the content of each of the MSigDB subsets. For example, a pathway in the MSigDB Hallmark dataset will link to its member genes through the has_signature_gene and reverse as inverse_has_signature_gene edge types. The MSIGDB Term names were also compiled according to the MSigDB generic entity names. Collectively, MSigDB adds 2,598,060 Concept-to-Concept direct and inverse relationships to Petagraph. Details on MSigDB relationships types connecting the Concept nodes are found in Table 7. Preprocessing scripts for MSIGDB relationships are deposited in GitHub (Table 9).

Human-to-rat ENSEMBL mappings (RATHCOP)

The source of the human-to-rat ENSEMBL40 ortholog mappings was from the HGNC Comparisons of Orthology Predictions tool (HCOP)32 (https://www.genenames.org/tools/hcop/) downloaded on 2023-11-16.

Supporting genomics data sets

Summary details for each mapping and quantitative dataset are featured in Tables 3, 4.

4D Nucleome program (4DN)

We obtained 21 loop files (Table 8) stored in dot call format from the 4D nucleome project41 website (https://www.4dnucleome.org) on 2023-01-05. The loop files were further processed for ingestion by first creating dataset nodes (SAB: 4DND) with the respective terms containing the dataset information (assay type, lab and cell type involved), file nodes (SAB: 4DNF). The respective terms containing the file information, loop nodes (SAB: 4DNL) were attached to HSCLO38 nodes at 1kpb resolution level corresponding to upstream start and end and downstream start and end nodes of the characteristic anchor of the loop and q-value nodes (SAB: 4DNQ) corresponding to donut q-value of the loops. Preprocessing scripts to format the loop and q-value data were deposited in GitHub (Table 9).

Single cell fetal heart data (ASP2019)

This dataset includes single cell RNA-seq data from human fetal heart tissue as described in Asp et al.42. The data was downloaded on 2021-08-10 (https://www.spatialresearch.org/resources-published-datasets/doi-10-1016-j-cell-2019-11-025/) The average gene expression of each author-supplied cell type cluster was calculated and used to represent each gene within the cluster with a preprocessing script (Table 9). Single cell fetal heart Concept nodes were created and connections to cell type nodes from the Cell Ontology (CL) and HGNC nodes connections were made. There were cell types defined in the Asp et al. paper that do not currently exist in the CL. We created our own cell type Concept nodes for these cell types with an SAB of ASP2019CLUSTER. The Single cell heart Code nodes have an SAB of ASP2019.

ClinVar (CLINVAR)

The ClinVar43 human genetic variants-phenotype submission summary dataset was utilized to define relationships between human genes and phenotypes and was downloaded on (2023-01-05) from the FTP site (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz) and preprocessed to create edge files (Table 9). As a preprocessing step, we imported only genes with single nucleotide variants (SNVs) annotated as pathogenic, likely pathogenic and pathogenic/likely pathogenic variants. Diseases and phenotypes included from ClinVar were sourced from MEDGEN44, MONDO45, HPO33, EFO46 and MESH47. Diseases and phenotypes were linked to genes with the 'gene_associated_with_disease_or_phenotype' edge type and reverse as 'inverse_gene_associated_with_disease_or_phenotype.' As a result, ClinVar represents 214,040 relationships (including reverse) connecting genes to diseases and/or phenotypes, thus connecting HGNC and MEDGEN, MONDO, HPO, EFO, and MSH Concept nodes. Preprocessing scripts for CLINVAR relationships were deposited in GitHub (Table 9).

Connectivity MAP (CMAP)

We obtained the edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset from the Harmonizome database (https://maayanlab.cloud/Harmonizome/dataset/CMAP+Signatures+of+Differentially+Expressed+Genes+for+Small+Molecules)48,49. These edge lists combine chemical data from CHEBI50 with HGNC gene IDs. The dataset features genes from microarray gene expression molecular signatures that were responsive to a chemical perturbation introduced to selected human cell lines. CHEBI and HGNC connectors in CMAP have edge types 'positively_correlated_with_gene' and 'negatively_correlated_with_gene' plus their inverses. The dataset added 2,625,336 relationships (including reverse). Preprocessing scripts for CMAP relationships were deposited in GitHub (Table 9).

GTEx, Expression and eQTL data (GTEXEXP and GTEXEQTL)

Human gene expression per tissue data was obtained from Genotype-Tissue Expression (GTEx) Portal (Version 8) (https://gtexportal.org/home/datasets) on 2023-04-10. We preprocessed the gene expression dataset (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm), as well as the expression quantitative trait loci (eQTL) dataset (GTEx_Analysis_v8_eQTL) (Table 9). The gene expression dataset contains expression profiles from 54 tissues across 56,200 transcripts. The eQTLs dataset contains over 1.2 million eQTLs from 49 tissues. GTEx includes HGNC gene IDs and Uberon tissue names51. We created Concept nodes for each eQTL and each tissue-gene expression pair. The eQTL Concepts were then connected to their corresponding tissue node (UBERON), gene node (HGNC) and genomic location node (HSCLO38). The gene expression nodes are connected to their corresponding tissue node and gene node. We also integrated quantitative data from GTEx including p-values for the eQTL data and transcripts per million (TPM) for the gene expression data into the graph. In order to reduce redundancy of nodes we created bin Concept nodes for these quantitative data types. For example, if the gene TP53 has a TPM of 10.5 in the heart, the GTEx Concept for TP53 - Heart will be connected to the ‘[10.0.11.0]’ TPM bin Concept node. Similarly, if an eQTL has a p-value of 0.0005 it will be connected to the ‘[0.0001.0.001]’ p-value bin Concept node.

GTEx, Coexpression data (GTEXCOEXP)

Coexpression of human genes was computed using the GTEx gene TPM (transcript per million) normalized data (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm) (https://gtexportal.org/home/datasets) by first computing the correlation matrix for each of the GTEx tissues and selecting the entries with pairwise Pearson’s correlation above 0.99. This preprocessing was done for 54 human tissues and cell types as a result 1,078,042 (including reverse) relationships were ingested into the KG connecting HGNC Concept nodes. These relationships link genes that are co-expressed with evidence codes reflecting the number of tissues where the connected genes meet the criteria for co-expression. Preprocessing scripts for GTEXCOEXP relationships were deposited in GitHub (Table 9).

GlyGen selected datasets (GLYGEN)

Five datasets from the GlyGen data website52,53 were chosen based on their relevance to our preliminary use cases and for all datasets we used release v1.12.3. The first two datasets were simply lists of genes that code for glycosyltransferase proteins in the human (https://data.glygen.org/GLY_000004) and mouse (https://data.glygen.org/GLY_000030). These datasets were modeled by creating a ‘human glycosyltransferase’ Concept node as well as a ‘mouse glycosyltransferase’ Concept node. Then, the Concept nodes for human genes (HGNC nodes) and mouse genes (MGI nodes) were connected to their respective glycosyltransferase nodes with a ‘is_glycotransferase’ relationship. The next three datasets contain human O-linked and N-linked glycosylation information from GlyGen, namely O-GlcNac (https://data.glygen.org/GLY_000517), Glyconnect (https://data.glygen.org/GLY_000329) and UniCarbKB (https://data.glygen.org/GLY_000138). These datasets contain information on human proteoforms, such as the exact residue on a protein isoform which is glycosylated, the type of glycosylation and the glycans found to bind that amino acid. To define relationships between human proteins from UniProtKB (UNIPROTKB Concept nodes)54 and glycans from the ChEBI resource50 (as included in CHEBI data) we introduced an intermediary ontology of glycosylation sites derived from the information included in the mentioned dataset. In this process, we added 38,344 protein isoform relationships and glycosylation_type_site relationships to GLYGEN based on the three selected data sources.

Gabriella Miller Kids First datasets (KFPT)

To provide a test set for human phenotype-to-genetic-variant analysis in Petagraph, we imported subject ID to-phenotype and cohort-wide gene-variant counts downloaded from the Gabriella Miller Kids First (GMKF) Data Resource Center (https://portal.kidsfirstdrc.org) on 2022-07-01. We used summarized de-identified data originating from a Kids First congenital heart defects cohort (phs001138.v4, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001138.v4.p2)55 based on 5,006 subject-parent trios. Subjects were modeled as Concept nodes with SAB of KFPT, for ‘Kids First Patient’, and subject IDs were connected to their respective HPO Concept nodes. Analysis added counts of damaging de novo single nucleotide variants observed per gene across all probands’ VCFs based on a VEP score prediction of ‘HIGH’ impact. A single summed count for predicted damaging variants per gene are reported for this cohort.

LINCS L1000 (LINCS)

We introduced gene-small molecule perturbagen relationships to Petagraph based on the LINCS L1000 edge list available on the Harmonizome database (https://maayanlab.cloud/Harmonizome/search?t=all&q=l1000)49,56. These relationships were summarized from LINCS L1000 Signatures of Differentially Expressed Genes for Small Molecules dataset57. This was done by first finding the corresponding CHEBI Concept nodes for the L1000 small molecules and then establishing the relationship of such nodes to the HGNC nodes according to the edge list mentioned above. The relationships were then collapsed to exclude the cell line, dosage, and treatment time information but the effect directions were retained in relationship types (scripts in Table 9). This led to 3,198,094 relationships (noted as negative or positively correlated) between HGNC and CHEBI nodes. Preprocessing scripts for LINCS relationships are available in GitHub (Table 9).

Mouse Genome Informatics (MGI)

IDs for mouse genes were provided by MGI58 and downloaded from the MGI site at the Jackson Laboratory (http://www.informatics.jax.org/downloads/reports/MGI_Geno_DiseaseDO.rpt) on 2023-12-15. No filtering or preprocessing was done to the MGI Codes prior to ingestion. The MGI nodes have 22,682 relationships to their orthologous HGNC nodes. There are also 234,043 gene-to-phenotype relationships between MGI nodes and MP nodes.

STRING (STRING)

Human protein to protein interaction data was downloaded from the STRING website (https://stringdb-downloads.org/download/protein.links.v12.0/9606.protein.links.v12.0.txt.gz). As Petagraph preferentially utilizes UniprotKB, we converted ENSEMBL protein ID entries to UNIPROTKB and filtered the dataset for the top 10% of the STRING-provided “combined score”. The refined dataset contains 459,701 relationships linking UNIPROTKB nodes with a relationship property from the STRING combined score as evidence of the interaction. The preprocessing script for STRING relationships was deposited in GitHub (Table 9).

Knowledge graph construction

Here, we discuss how we built Petagraph from each of its individual components. Petagraph’s ingestion workflow facilitates the integration of large-scale, biomolecular, and biomedical data sets. The basic steps we took to build Petagraph were: (1) obtain a UMLS license from NIH, (2) build the UBKG from ingestion processing scripts, and (3) ingest the Petagraph CSVs into the UBKG base release. Users interested in building Petagraph from source can follow the same process, however we supply a database dump discussed in the Usage Notes section.

Petagraph is built on a Neo4j graph database, version 5, community edition. The scripts for data ingestion were developed as part of the UBKG project and can be found on the UBKG project’s GitHub repository21. The ingestion scripts require any new data ingestion to utilize identifiers for the data sources and Codes supported by UBKG59 with examples such as HGNC symbols and Human Phenotype Ontology IDs. If a Concept identifier is already in the UBKG, the scripts identify the existing Concept and any new Code(s) and Term(s) are attached. In cases where there are no existing Concepts for newly introduced ontologies or data sources, the UBKG ingestion process will automatically generate a new Concept node along with its unique identifier. We have made minor adaptations in the UBKG scripts for loading biomolecular data for Petagraph, and those scripts have been included in the Petagraph GitHub repository60.

The Petagraph ingestion pipeline we created contains four major steps (Figure 1). These are: (1) data selection and modeling, (2) cleaning and formatting data, (3) running the appropriate UBKG ETL scripts to convert data into UBKG CSV format, and (4) building Petagraph with the Neo4j bulk import tool. We discuss each of these steps in detail below.

Fig. 1
figure 1

Petagraph Data Ingestion Workflow. The data ingestion workflow starts with Step 1 by downloading the UBKG base CSVs and the dataset(s) that will be integrated. Then in Step 2, the raw dataset (or ontology) must be formatted into nodes and edges files according to guidelines found in the UBKG user guide. If the import file is in OWL format then the user can jump to Step 3 which involves running the OWLNETS Python script to convert the edges and nodes files into the UBKG format and simply appended to the base UBKG files. Lastly, for Step 4k,Neo4js command-line bulk import tool is used to build the graph database.

Step 1: Data selection and modeling

The selection, sourcing, review, and modeling of data appropriate for the knowledge graph required a comprehensive understanding of the knowledge graph’s existing model and structure. When creating a model that merges disparate types of biomolecular data, there should be an emphasis on modeling relationships between discrete entities in a biologically meaningful way. For example, when modeling relationships between Concepts for genes and Concepts for associated regulatory motifs, biological context, such as the proximity of a regulatory motif to a gene within the genome should be considered. Therefore, instead of just attaching the regulatory motif to a gene, we created the HSCLO38 in order to map genomic elements at any size scale to their location on the chromosomes.

Step 2: Cleaning and formatting of raw data

In the second step, cleaning and formatting the data was performed. The UBKG generation framework designed by our team converts data from a variety of data sources into a triplet representation based on the OWL NEtwork Transformation for Statistical Learning (OWLNETS) format61. The ingestion process is optimized for working with ontology data provided in Web Ontology Language (OWL) format. OWL format, which is based on the resource description framework (RDF), is a set of standards designed for creating formal, structured knowledge representations62).

The UBKG generation framework employs the Phenotype Knowledge Translator (PheKnowLator) Python package when ingesting OWL files63. The PheKnowLator package converts semantic information from an OWL file into a set of files in OWLNETS format. The UBKG generation framework works with OWL files in a variety of serializations. Currently supported serialization syntaxes are Turtle and RDF/XML, which seems to be the norm for ontology files published to repositories such as the Open Biological and Biomedical Foundry64 or the National Center for Biomedical Ontology’s BioPortal29. If the source data is already in the UBKG Edge Node format65 (derived from the OWLNETS format), then the conversion to OWLNETS format is not necessary.

The UBKG generation framework also obtains biomedical reference data from sources other than OWL files, including data downloaded from GENCODE31, HUGO34, UniProtKB30, and RefSeq66, converting this data into files in the UBKG edge and node formats. To ingest genomics data into Petagraph, we must clean and reformat the data. The main quantitative data types selected from genomics datasets are: p-values, log2 fold changes, and gene expression data. Genomics data cleaning generally involves removing missing or irrelevant data points, and formatting the data according to the specific dataset schema that was decided on in Step 1.

The required edges file asserts a set of relationships between entities using triples, each consisting of a subject, a predicate, and an object. The subject and object of an assertion represent the starting node and ending node of a relationship and the predicate represents the relationship. The nodes file describes the nodes that comprise the subjects and objects in the triples of the edge file, providing information about nodes such as Codes for the nodes in a source (e.g., the HGNC ID), preferred terms, synonyms, definitions, and cross-references for the Code to other vocabularies. For data that required reformatting, we put all data into the nodes and edge file formats as described.

Step 3: Convert nodes and edges files into UBKG CSV format

The UBKG generation framework converts the content of a collection of edge and node files into a set of ontology CSV files. The ontology CSV files represent the entities, relationships, and metadata that will be in the UBKG. The ontology CSV files conform to the format recognized by the neo4j-admin import utility67. Importing the ontology CSV file into Neo4j by means of the neo4j-admin import tool generates the entities and relationships of the UBKG.

The UBKG generation framework builds a set of ontology CSV files iteratively, starting with an initial set of CSV files extracted from a release of the UMLS. The framework extends the content from the UMLS with that from another data source by appending to the ontology CSVs data that it extracts from files in UBKG Edge Node format.

The UBKG base build is provided by a set of CSV files that follow the Neo4j bulk import tool header format67. The UBKG import file headers specify any of the five node types that the metadata attributes are assigned to, and which types of nodes the relationships connect. This conversion is done via a set of UBKG ingestion scripts which can be found at the UBKG GitHub site68. After the CSVs from each dataset have been converted from the nodes and edges files into UBKG CSV format, they are appended to the base UBKG CSVs. We now refer to these updated CSVs as the “Petagraph CSVs” since they contain assertions from the 21 datasets added to the UBKG (Table 2).

Step 4: Build Petagraph with the Neo4j bulk import tool

Lastly, once the nodes and edges files for each dataset were converted to UBKG CSV format and appended to the base UBKG CSVs, we used the Neo4j bulk import tool (neo4j-admin database import) to build the graph on the Neo4j Desktop platform using Neo4j v5.

Schema structure

Petagraph’s data records arise from a heterogeneous collection of datasets that each have their own schema representing the underlying dataset structure.

Node types

There are five main data-specific types of nodes within Petagraph: Concept, Code, Term, Definition, and Semantic Type. These node types are discussed in further detail in the UMLS Metathesaurus69. Within Petagraph, the most important node type is the Concept node, as they form the central backbone of the entire graph model. All other node types are essentially metadata attached to Concept nodes.

Concept nodes represent a node nexus for organizing on a particular conceptual meaning. For example, a Concept node for a particular gene will include references to many different representations of that gene from resources such as HGNC34, Ensembl40, and ENTREZ70. Every Concept node has just one property called the Concept Unique Identifier (CUI). CUIs are alphanumeric identifiers and are not informative outside of their use as unique identifiers. For example, the CUI for the concept for the H. sapiens TP53 gene is C0079419.

Code nodes identify the reference IDs for a given Code in a particular ontology or standard. Several Code nodes can share associations with the same Concept. It is this one-to-many relationship between Concept and Code nodes that allows for traversal of the graph between data sources: if a Code in one source is linked to the same Concept as a Code in another source, then it may be possible to propose an equivalence between the two Codes. The only relationship between Concept nodes and their respective Code nodes is the CODE relationship. All Code nodes have three properties: (1) the source abbreviation (SAB), (2) the Code from the reference source, and (3) the CodeID which is the SAB and the Code separated by a colon. In Petagraph’s schema, the SAB signifies the identity of the originating dataset or ontology. Examples of three CodeIDs supporting the Concept node for the H. sapiens TP53 gene are HGNC:11998, NCI:C17359, and ENTREZ:7157, representing identifiers from the HGNC, NCI Thesaurus, and NCBI’s ENTREZ databases respectively. Figure 2 shows an example of two concept nodes (blue) connecting a human gene and phenotype. Code nodes are represented for each Concept node (blue) showing the different dataset identifiers for the concepts.

Fig. 2
figure 2

Petagraph dataset schema. (a) General schema design. Schematic summary of the Petagraph schema. Main “Concept” nodes (blue circle) are the bi-directional connected backbone of the Petagraph schema and represent organizing nodes where multiple annotation systems for a single type of entity converge. For example the Concept node for Type 2 diabetes (T2D) may have many different definitions and codings across multiple systems such as ICD10, MONDO, HPO, or SNOMED that can all link to one Concept node for T2D. Each system would have its own Code node connecting to the particular Concept representing the coded identifier from a system. The CUI-CUI connector between Concept nodes represents over 2,000 edge types defining Concept-to-Concept relationships. Code nodes (yellow circle) are the entities that store systematic IDs from different systems that connect to the Concept, and Terms (brown circle) give human-readable definitions for Codes and Concepts. Semantic types (light blue circle) classify the Concept node type while Definition nodes (orange circle) provide a unified definition of the Concept node. Bidirectional relationship links only exist between Concept nodes, simplifying queries and improving query times. (b) Specific Schema example. The Concept node for TP53 (C0079419) with three of its Codes and Terms are shown: HGNC (HGNC:11998), NCI Gene (NCI:C17359) and ENTREZ (ENTREZ:7157). Other Code nodes also connect to C0079419 but are not shown. It can also be seen in this example that the only bidirectional connections are those between Concept nodes, shown here between TP53’s Concept node and one of the HPO Concept nodes. PT: Preferred term. STY: Semantic Type. DEF: Definition.

Term nodes have a name property that provides human-readable annotation about the linked Code node. As an example, the Term node for the HGNC:11998 Code node (representing H. sapiens TP53) has the name property of “tumor protein p53” provided by HGNC. Most Code nodes have a relationship to at least one Term node, usually through a PT relationship type, which stands for preferred term. Code nodes may carry multiple associated Term nodes if provided from the original data source, such as synonyms. Concept nodes also have relationships to Term nodes; however, many Concept nodes connect to a “preferred term” Term node through a PREF_TERM relationship. This relationship allows for a quick evaluation of a Concept node’s identity.

There are two additional types of nodes that connect directly to Concept nodes where appropriate: the Definition nodes and the Semantic nodes. Definition nodes are connected through a DEF relationship type to Concepts and have a DEF property where they provide definitions for Concept nodes. (Sources provide definitions for Codes, not Concepts; because the corresponding Code nodes can share links to Concept nodes, a Concept generally has more than one Definition node.) Semantic nodes are connected to Concept nodes by a STY relationship type and have a name property that provides wider semantic type classifications such as Body System, Cell Component, Gene, Mammal, etc., to Concepts.

A data dictionary covering Petagraph’s schema and its node structures can be found at the Petagraph data dictionary website71. From here onward, we will often exclude the word “nodes” when referring to Concept nodes, Code nodes, and Term nodes in this manuscript for ease of reading.

Edge types

Edges define links and relationships. Petagraph has 1,861 distinct edge types. The majority of these edges fall into the category of Concept to Concept edges, which are Petagraph’s only bidirectional edge type. Each edge type comes with its own Source Abbreviation (SAB) property, which is particularly useful for quick inclusion or exclusion of edge-only data sources when querying the graph. Petagraph also extensively uses edge identifiers between Concepts from the Relations Ontology64 whenever possible. The Concept to Concept relationship network serves as the primary traversable component of Petagraph and constitutes the graph’s fundamental structural backbone.

There are five other main non-(Concept to Concept) relationship types in Petagraph that add metadata to the Concept nodes. These relationships are always unidirectional and point away from their respective Concept nodes. These relationships include: the Concept node to Code node relationship (relationship type = CODE), the Concept node to Term node relationship (relationship type = PREF_TERM), the Code node to Term node relationship(s) (relationship types = PT, SYN, ACR which stand for “preferred term”, “synonym” and “acronym”, respectively), the Concept node to Definition node relationship (relationship type = DEF) and the Concept node to Semantic node relationship (relationship type = STY which stands for “Semantic Type”). There are an additional 180 relationship types between Codes and Terms that are used with much lower frequency.

Data modeling

Data modeling was a major consideration in how we introduced quantitative data within the structure supported by the UBKG. The Petagraph schema supports quantitative values as node and edge properties. Under the framework of the UMLS and UBKG, the properties for each node type are well defined: every Concept node has a CUI property; every Code node has SAB, Code and CodeID properties and every Term node has a name property. However, Code nodes can have an extra quantitative attribute called value. For example, GTEx eQTL Code nodes have a value property corresponding to their p-values. We also wanted to integrate quantitative data within the categorial and conceptual framework of the UMLS schema. This allows query results to be returned rapidly, even for multidimensional, numerical searches. We accomplished this by creating Concept nodes representing interval bins of numerical ranges, for example, p-values, expression TPM, and log2FC. Nodes with a quantitative value can then be assigned to the appropriate bin node allowing for rapid selection of numerical values within the graph. As many bioinformatics analyses are performed with results that meet a certain threshold, the numerical bin nodes are especially useful for thresholding data in biologically meaningful queries (e.g. return all data points with p-value < 0.05).

In order for Petagraph to more fully support the use of experimental omics data, we also included dozens of relationship-only (“mapping”) datasets that may come from observational data themselves, many of which map to and within genomic and phenotypic databases for human and mouse models. Sixteen of these mapping datasets (and the datasets they map to) are visualized in Figure 3. The interconnectedness of these datasets enables many different types of queries and their contributions are represented in the heatmap intensities. For example, LINCS and CMAP are represented by the CHEBI to HGNC connections and STRING is represented by the UNIPROTKB to UNIPROTKB connections. HGNC is clearly a hub dataset, as human gene names connect across many different omics and annotation datasets. Of the sixteen datasets shown in Figure 3, nine have relationships to three or more others. Several datasets were already part of the UBKG (HGNC, HPO, UBERON), and displayed in Figure 3 because of their high number of connections in Petagraph. Note that many of these datasets have connections to other datasets not shown in Figure 3, for example, the developmental heart scRNA-seq dataset also shares edges with the Cell Ontology (CL)72.

Fig. 3
figure 3

Dataset connectivity. Querying across datasets in Petagraph requires connecting relationships. A hallmark feature of Petagraph is the rich set of ontological and biomedical mapping data sources that connects phenotypic and genomic data together across human and mouse models. This figure only captures a select set of datasets and relationships in Petagraph. The log10 relationship count between Petagraph-specific datasets reveals which of these datasets have direct Concept to Concept relationships through one-hop relationships, however many of these datasets are linked together through one or more intermediary datasets. For example, HGNC establishes a connection to MP through the HPO dataset (HGNC to HPO to MP).

Most Concept nodes in Petagraph are classified by Semantic Type. The UBKG generation framework does not explicitly assign Semantic Types to Concepts outside of those brought in as part of the UMLS context. Currently, there are 127 Semantic Types inherited from the UMLS within Petagraph attached to their member dataset Concept nodes through STY relationship types. We summarize the categories of Semantic Types in Petagraph in Figure 4a which also depicts the connectivity of 10 major Semantic Types in the graph (Figure 4b). As shown in this figure, all these major Semantic Types have relationships to their own types (e.g. gene-gene or phenotype-phenotype relationships). In this plot, the linkages represent direct Semantic Types connectivities regardless of hierarchical relationship between STYs.

Fig. 4
figure 4

Semantic Types in Petagraph (a) Major Semantic Types and their counts in Petagraph grouped by class. (b) Example of interconnections between selected Semantic Type Concept nodes in Petagraph where circle size corresponds to node degree. The relationships are considered for the immediate Semantic Types attached to the graph concept nodes, therefore upstream STY parent nodes were disregarded.

To understand how the Concept-Concept relationship data connect Semantic Types, we chose 55 Semantic Types and extracted the relationships connecting them through the Concept nodes directly connected to them (that is, no hierarchical connection between Semantic Types was considered). Subsequently we mapped the quality and quantity where the selected STYs are related to anatomy, phenotypes, diseases, and chemical species and biological entities, cells or metabolic pathways (Figures 57). The presence of relationships (Figure. 5), the cumulative number of relationship types (Figure 6), and the relationship counts (Figure.7) were plotted using the package pheatmap v1.0.1273 in RStudio v 1.4.110674 R v4.0.475. For Figures 6, 7, the frequency of relationships are shown between pairwise combinations of Semantic Types, connected through their Concept nodes. Therefore, Concept to Concept relationships where one or both Concept nodes are not connected to Semantic Types are excluded. To increase the dynamic range, the base-10 logarithm was used, and to avoid log10 (0), we added 1 to all values.

Fig. 5
figure 5

Heatmap representation of the presence or absence of a direct relationship between the 55 selected Semantic Types. This figure depicts the presence (red) or absence (blue) map of at least one relationship (type + SAB) between pairwise combinations of Semantic Types through their respective Concept nodes. Note that the matrices are diagonally symmetrical as Concept to Concept relationships are bidirectional and the relationships are considered for the immediate Semantic Types attached to the graph concept nodes, therefore upstream STY parent nodes were disregarded.

Fig. 6
figure 6

Heatmap representation of relationship type statistics of 55 selected semantic types. Data represents direct Semantic types on both sides of Concept to Concept node connections. Here, the colors represent the diversity of the log10 relationship counts (+1) (type + SAB) connecting the candidate Semantic Types through their respective Concept nodes.

Fig. 7
figure 7

Heatmap representation of relationship count statistics of 55 selected semantic types. The colors represent the log10 number of outgoing relationships (+1) (from row to column) connecting the candidate Semantic Types through their respective Concept nodes.

Figure 5 illustrates the presence or absence of at least one relationship type between pairs of Concept nodes between different Semantic Types. This figure shows how Semantic Type pairs that lack direct graph-wide relationships can be linked through other relationships. For example, Concept nodes categorized under the Semantic Types Congenital Abnormality and Clinical Drug do not share a direct relationship within the graph. However, both of these Semantic Types do have relationships with the Physiologic Function Semantic Type which allows for a connection. Queries can therefore utilize the intermediary to link the Congenital Abnormality Semantic Type and the Clinical Drug Semantic Type. Similarly, we can bridge all data in Semantic Types through intermediary links. To further analyze these relationships, we quantified the number of relationship types (log10-transformed) per pair of the 55 Semantic Types, considering distinctions based on the start and end nodes, relationship type, and Source Abbreviation (SAB). Figure 7 shows the distribution of relationships among each pair of the 55 Semantic Types. Semantic Types positioned along the diagonal exhibit a higher likelihood of being connected to a greater number of links, as indicated by the log10-transformed relationship counts. Collectively, the information presented in Figures. 57 enables us to draw conclusions regarding which Semantic Types within the graph possess the majority of relationships. Furthermore, it sheds light on the potential for extracting valuable information from these relationships, thereby unveiling insights that may not be directly accessible through other means.

Finally, we were interested in useful examples of pairwise shortest path lengths, and chose to map gene-to-gene relationships using HGNC Concept nodes on the Concept-Concept subgraph Shortest Path Length and Connectivity Analyses were performed to measure the expected relationships between terms that are known to be related from orthogonal sources. This analysis will show how the HGNC dataset is connected to other datasets by secondary relationships within the graph. This is an important consideration in designing queries looking for gene-to-gene relationships. The distribution of the shortest path lengths between the graph’s gene to gene (HGNC-HGNC) Concept nodes was estimated using a sample of 1 million pairs of such nodes. Figure 8 shows the probability distribution for shortest lengths between Concept nodes, which range from 1 to 11. The majority of HGNC Concept nodes are connected to each other through 4 or fewer hops, with the peak at 3. This shows that in the case of gene-to-gene relationships, most genes are not directly connected through a single intermediate resource, which allows for more informative results with queries spanning more intermediate relationships.

Fig. 8
figure 8

Shortest Path Lengths in Petagraph. Probability density of associated Concept to Concept node shortest path lengths for 1 million pairwise combinations of human gene Concept nodes in Petagraph.

Data Records

Installing Petagraph from a Neo4j dump file

Users with a UMLS license can follow site instructions to obtain and utilize a UMLS license key at https://ubkg-downloads.xconsortia.org/ to download the most recent Petagraph dump file (currently 4.5 GB). The dump file can be used with Neo4j Desktop to build Petagraph quickly and easily. Detailed instructions on building the database with the dump file can be found on the Petagraph Github README60 or users can follow the standard procedure for loading Neo4j dump files from the vendor.

Petagraph installation from source

Building Petagraph from the source files can take several hours but allows for build customization. The instructions can also be used for ingesting new data into any Petagraph build. The code and instructions to recreate Petagraph from source data is on the project’s Github site60. The process for building from source consists of two stages: establishing the source framework and then the generation framework. In the source framework stage a user downloads the UMLS Metathesaurus and Semantic Network files from the UMLS website76 and then runs a set of SQL queries to extract data to build the UMLS base CSVs. In the generation framework stage, the ingestion pipeline will add additional ontologies and datasets into the UMLS base. The same scripts are then run to add additional datasets onto the UBKG CSV files, creating Petagraph. We provide the source files for Petagraph at https://ubkg-downloads.xconsortia.org/ on the project’s Open Science Framework site (https://osf.io/6jtc9/)77.

Technical Analysis and Validation

UBKG Generation framework ingestion validation

The heterogeneous and often custom nature of data sources means that the validation of ingestion from a particular data source involves analysis that is not obviously amenable to automation. The generation framework generates basic analytical reports during ingestion to aid manual validation; however, manual validation by a subject matter expert is still necessary.

Basic content and consistency requirements

A source for ingestion into UBKG represents a set of assertions. Each assertion involves two nodes (entities) and an edge (relationship). In the UBKG Edge Node format,

  • The edge file represents assertions as triplets in which both nodes and edges are represented with Codes.

  • The node file decodes nodes with metadata such as terms and definitions.

To be represented properly in the UBKG, the edge and node files must satisfy basic requirements for content and internal consistency. These requirements include:

  • Every node in the edge file must either be described explicitly in the node file or already exist in the UBKG.

  • If a node in the node file has a cross-reference to another Code, the cross-referenced Code should already exist in the UBKG.

Ingestion validation

Most data quality problems in the assertion files arise from missing node references, such as:

  • An assertion includes a Code for a node that is neither defined in the node file or previously ingested into the UBKG.

  • A node has a cross-reference to a Code that is not in the UBKG.

Data ingestion validation

After a set of assertions from a source is ingested, the resulting ontology CSVs are imported into a Neo4j instance. Using the Neo4j browser, nodes and relationships from the source are compared in the UBKG with the corresponding edge and node files, using tools such as shortest path queries. For example, if the edge file asserts that SAB1:CodeA has_relationship_X_with SAB2:CodeB, then there should be a shortest path that connects the Code nodes SAB1:CodeA and SAB2:CodeB with a relationship of type has_relationship_X_with. Additional queries validate information from the node file – e.g., that the Code node in the UBKG that corresponds to the node in the node file has the expected terms, synonyms, definitions, and cross-references to Codes from other sources.

For ingestion of genomics and related datasets into the UBKG scaffold, we wrote a continuous integration (CI) workflow using Github Actions to ensure that the final Petagraph product contains the correct schema as well as the correct types and counts of nodes and edges. Our Github Actions CI downloads the latest stable release of Neo4j, downloads the latest version of the Petagraph CSVs from the Open Science Framework website and then builds Petagraph. Then we run tests comparing the count of node and edge types from the Petagraph CSVs and the Petagraph graph to confirm that the graph has been built as expected.

For reviewing the integrity and accuracy of ingested data, the UBKG framework generates a summary report. This report helps identify potential issues and opportunities for alignment before finalizing the ingestion process (Table 1).

Table 1 Summary of checks and analyses performed during the ingestion of new data into UBKG: information assessed, their purpose, and comments on potential actions.
Table 2 Ontologies added to the existing UMLS collection by UBKG.
Table 3 Genomic data mappings added to the UBKG to create Petagraph.
Table 4 Quantitative genomics data added to the UBKG to create Petagraph. Those entries with 0% total Concept nodes are mapping (relationship-only) datasets.

FAIR validation

When building Petagraph we adhered strictly to findability, accessibility, interoperability and reusability (FAIR) data principles (https://www.nature.com/articles/sdata201618). Petagraph is findable by way of GitHub, the Open Science Framework (https://osf.io/search)77 and any major search engine. Metadata in the form of a data dictionary is available on the project GitHub, which describes the schema of every individual dataset that has been ingested. The data dictionary also contains detailed descriptions of preprocessing and formatting that was done to the data prior to ingestion. Descriptions of the ingested datasets and ontologies are also present in the Methods section of this paper. Petagraph is accessible by being freely downloadable and small enough (4.5GB) that a dump of the entire database can easily fit onto a user’s personal computer. Whether building from source or from the database dump file, our build process is OS-agnostic. Petagraph, and the UBKG, are built on top of the UMLS, so they are inherently interoperable. The UMLS alone harmonizes hundreds of biomedical vocabularies and ontologies so that users can easily query across standards.

Adding new data to Petagraph is straightforward. Ontologies or datasets that are in OWL, RDF/XML, OBO or Turtle format can be incorporated into the graph automatically using our ingestion workflow, outlined in detail in the Methods section. Datasets that are not in one of these formats can be added to the graph after some simple preprocessing steps. Lastly, all data sources and Petagraph releases are versioned and the entire process of downloading the dump file and building the Neo4j database locally can easily be automated using the OSF and Neo4j APIs which helps ensure reproducibility and consistency for Petagraph users.

Structural validation

Validation through link prediction

As a link prediction-based validation, we computed the Common Neighbors scores for about 500,000 pairs of gene-to-gene direct connections and compared the results to the distribution of Common Neighbors scores for random selections of pairs of genes. As shown in Figure 9, the Common Neighbors of genes with direct connection in Petagraph are approximately three orders of magnitude greater than randomly selected genes. This analysis has two implications. First, it suggests that the orthogonal datasets ingested in Petagraph could effectively be used for link prediction. Second, the links in the graph can be cross-validated using the information in the graph data sources from independent datasets.

Fig. 9
figure 9

Common Neighbors Scores in Petagraph. Here we show a comparison between the Common Neighbors scores for genes connected with direct links in Petagraph (blue) versus random selections from all genes regardless of their direct connectivity. The log10 Common Neighbors Count distribution on the x-axis indicates a three order of magnitude shift towards higher common neighbors counts where the genes are directly connected. These measures serve as additional evidence of how orthogonal datasets ingested in Petagraph can be utilized to predict links between entities of interest.

Validation using analysis of local structures (transitivity and triangle counts)

We analyzed the Petagraph Concept-to-Concept subgraph nodes in terms of transitivity (the probability of nodes adjacent to a nominal node are connected to each other) and triangle counts (the number of triangles or 3-cliques each node is a part of). We compared the distribution of these measures of local structure connectivity derived from Petagraph with a randomized graph created with the same number of nodes and relationships and node degrees but with random connections to other nodes (Figure 10). As portrayed by the distributions of transitivity (ranging between 0 and 1) and triangle counts, the Petagraph Concept-to-Concept subgraph presents a shift towards higher values of both node transitivity and triangle counts. These meaningful shifts provide an insight into less-random organization in Petagraph Concept nodes connectivity as opposed to the randomized graph, therefore showing informational consistency in the data ingested in Petagraph.

Fig. 10
figure 10

Validation using local structure. Here we show a comparison between Petagraph and a random graph with the same number of relationships and nodes degrees as Petagraph but randomized Concept-to-Concept connections in terms of Concept node transitivity and number of triangles each node belongs to indicates a shift towards lower means for both measures as a result of randomization.

Validation through low dimensional visualization of high dimensional graph embeddings

We visualized the UMAP components of a 100-dimensional embedding of the subgraph of Petagraph consisting ~400,000 nodes and ~12,000,000 relationships (bidirectional). This was derived from a query that extracts all nodes one or two hops away from 12 abnormal heart morphology and 2 blood cancer phenotypes (Methods) and all their interconnections. This included Concept nodes from all 127 Semantic Types including ~10,000 genes and ~10,000 phenotypes. It visualizes the results for a selection of Concept nodes from major Semantic Types including Gene or Genome, Disease or Syndrome, Neoplastic Process, Clinical Drug, Cells and Molecular Processes (Figure 11). In this figure, UMAP low-dimensional visualizations illustrate the spatial distribution of major Semantic Type-related Concept nodes. Nodes belonging to the same STYs are shown to cluster together, indicating their relative proximity and suggesting a higher likelihood of connections within the graph. Conversely, nodes from different STY groups are distinctly separated, although nodes that are adjacent in the graph may still appear close to each other in the embedding space, indicating potential linkages.

Fig. 11
figure 11

Validation using embeddings and dimensionality reduction. UMAP representations of 100-dimensional Node2Vec embeddings of subsets of Concept nodes in an example subgraph, created around 12 abnormal heart morphology phenotypes and 2 blood cancer phenotypes by including Petagraph nodes 1 or 2 hops away from these source nodes. (a) Gene or Genome Concept nodes vs. Disease or Syndrome. (b) Neoplastic Processes Concept nodes vs. Clinical Drugs. (c) Perturbagen Concept nodes vs. Genes or Genomes and (d) Organ Component vs. Cell vs. Molecular Function Concept nodes.

Assessing major dataset contributions to link prediction

We analyzed the influence of major datasets on link prediction by using two graph-wide connectivity metrics: transitivity and assortativity (Figure 12). Transitivity measures the likelihood of new connections based on existing local patterns, such as the probability that two nodes sharing a neighbor will be directly connected. This is linked to the formation of triangles in the graph, indicated by metrics like the clustering coefficient and triangle count. Utilizing transitivity helps predict potential links based on the current graph structure. Assortativity evaluates the tendency of nodes with similar attributes or degrees to connect, offering insights into Petagraph's structural patterns. Including assortativity in predictive models enhances accuracy in forecasting future link formations. Together, these metrics provide insights into graph connectivity dynamics and link prediction capabilities.

Fig. 12
figure 12

Contribution of major datasets on link prediction. Impact of dataset removal on global transitivity and degree assortativity in subgraphs around human genes (1 hop) in Petagraph. Major datasets were removed one at a time to assess their contributions to graph connectivity and the resulting subgraphs were each evaluated for transitivity and assortativity coefficients using the R package igraph82. As Petagraph is a sparse graph, assortativity is globally negative in the [−1, 0) range. Therefore Assortativity is graphed as its negative reciprocal calculated as (−1/assortativity) where less assortative subgraphs have lower values. Dataset descriptions are found under the SAB column in Tables 3, 4.

To assess the impact of individual datasets on global transitivity and degree assortativity within subgraphs around human genes in Petagraph, we systematically removed our largest gene-related datasets from the graph. This approach allows us to measure each dataset’s contribution to overall connectivity and structural features. Figure 12a shows that excluding the GTEXCOEXP dataset significantly reduces global transitivity, indicating decreased local clustering. Figure 12b reveals that removing most major datasets generally decreases degree assortativity, except for the GTEXEQTL dataset. Removal of the GTEXEQTL dataset leads to an increase in degree assortativity as a GTEXEQTL node connects terminally to a gene target node thus not providing links between genes.

These findings underscore the specific contributions of each dataset to the graph’s structural properties and functional relationships among human genes.

Use case validation

In this section we validate Petagraph’s relevance to biomedical research through evaluation of its capacity to return biologically relevant information with three use cases. Further validation of these use case results are performed through evaluation of orthogonal information such as literature review.

Use Case 1: Validation by re-predicting relationships between congenital heart defects and genes

Figure 13 illustrates our evaluation of topological link prediction methods in identifying associations between the Tetralogy of Fallot (ToF) phenotype (HP:00001636) and genes cataloged by HGNC (43,001 nodes). We employed four methods for this analysis: preferential attachment, total neighbors, common neighbors, and Jaccard’s Index. By analyzing the Receiver Operating Characteristic (ROC) and Precision-Recall Curve (PRC) of the binary classifiers against the graph’s existing gene-phenotype links, we show that the graph’s structure—specifically, the node-to-node relationships—empowers these topological link prediction methods, achieving ROC AUCs greater than 0.9 (as shown in Figure 13a,b). Notably, the common neighbors method and Jaccard’s Index achieved exceptional performance, with ROC and PRC AUCs approaching 1, indicating a high reliability for predicting links across the graph. We also conducted a literature review to validate the top 10 gene-phenotype relationships identified by the common neighbors method, using this external information as a form of orthogonal validation not originally incorporated into the graph. The supporting evidence from this review is detailed in Table 5.

Fig. 13
figure 13

Validation with Use Case 1. Comparison of (a) ROC and (b) PRC curves calculated for the link prediction scores vs. the presence or absence of direct links (binary classifier) between Tetralogy of Fallot (HP:0001636) phenotype and human genes in Petagraph. The analysis shows that the Common Neighbors scores and Jaccard’s Index similarity metric behave like near-perfect classifiers of such phenotype to gene connectivities.

Table 5 Prioritization of genes related to heart defects based on KF data.

Use Case 2: Validation by predicting drug side effects within tissues of interest

The drug rofecoxib was recalled in 2004 due to safety concerns because of an observed increased risk for cardiovascular events, most notably heart attack, and stroke78. We queried Petagraph for all shared genes between the transcriptional profiles for the drug rofecoxib (CHEBI:8887) in LINCS (L1000, CMAP) to all tissue transcriptional profiles in GTEx with expression higher than a minimum level (TPM > 5) (GTEXEXP). Once we extracted such genes, we computed the number of shared genes as a ratio for each tissue with respect to the total number of genes with TPM > 5 and then normalized the value to the highest value in all tissues and ranked them accordingly.

Figure 14 summarizes the result of the query used to extract the shared genes in the transcriptional profiles from rofecoxib and human tissues. This query is based on the genes with expression levels higher than a predefined threshold (TPM > 5) with relationships with rofecoxib in the LINCS dataset. As shown in Figure 14, the ranked tissues point to different organs; the LINCS data provides a correct prediction that heart and blood vessels in the GTEx dataset are most closely related to perturbation gene profiles in rofecoxib (e.g. right auricular appendage, myocardium of left ventricle, coronary arteries). These predictions are not obvious by just examining where rofecoxib’s target gene, PTGS2, is expressed.

Fig. 14
figure 14

Validation with Use Case 2. Querying possible drug-tissue interactions for rofecoxib using CMAP, LINCS L1000 and GTEX_EXP datasets. Top 19 tissues with the highest ratio of genes correlated with rofecoxib (LINCS L1000) to all genes (TPM > 5).

Use Case 3: Validation through shortest path analysis of subgraphs

Epidemiological analysis has shown that brain tumors are more common in children with CNS abnormalities than in the general population79,80. In this analysis, we focused on constructing and examining a subgraph derived from Petagraph, specifically targeting the connections between HPO Concept nodes related to abnormal CNS morphology and CNS neoplastic processes. The subgraph, comprising 777 nodes as shown in Figure 15a, was formed by identifying all shortest paths linking a selected set of 54 CNS morphology abnormalities with 54 CNS neoplastic process nodes. Upon analyzing this subgraph, we observed particular patterns in node degree distribution, link traversal frequencies, and shortest path lengths, as detailed in Figure 15b–d. The distributions observed align with a scale-free structure, which has been recurrently identified within various segments of Petagraph. This type of structure is significant because scale-free networks are known for few highly connected nodes (hubs) amidst many nodes with fewer connections, which allows for shorter direct paths between nodes as compared to randomly configured graphs of the same size. Scale-free graph structures are prevalent in many biological systems and data collections.

Fig. 15
figure 15

Validation through shortest path analysis. Analysis of connections between HPO terms in a Petagraph subgraph between abnormal central nervous system (CNS) morphology and CNS neoplastic processes (a) An example Petagraph subgraph consisting of shortest paths connecting 54 abnormal CNS morphology and 54 CNS neoplasms, containing a total of 777 nodes including 191 human genes. Colors reflect the edge traversal counts and node degree centralities. (b) Node degree histogram of the subgraph. (c) Edge traversal count frequencies for the subgraph. (d) Shortest path length distribution of the subgraph.

Investigation into the node degrees within this subgraph allowed us to rank human genes based on their connectivity. Genes with common links between CNS phenotypic abnormalities and brain cancers could be implicated in both conditions with origins in perturbed developmental processes. The top-ranked genes, as listed in Table 6, offer potential targets for further research into their relevance to both CNS disorders and neoplasms.

Table 6 Genes ranked by their degree in a subgraph generated from all the shortest paths between selected abnormal CNS phenotypes and CNS neoplasms.
Table 7 Summary of Concept-to-Concept Relationships in the MSigDB dataset within Petagraph.
Table 8 List of the 4DN dot calls files, their description and download URL.
Table 9 Preprocessing scripts to prepare datasets ingested into Petagraph.

Limitations

Despite its strengths, Petagraph has certain limitations. Inherent complexity and variability of the data sources integrated into the graph will always be a challenge. Ensuring the accuracy and consistency of data ingestion and mapping processes requires expert curation, continuous refinement and validation. Additionally, the scalability of Petagraph, while robust, may face constraints with the incorporation of increasingly large and diverse datasets, necessitating ongoing optimization of computational resources and algorithms.

Another limitation is the potential bias introduced by the selection and integration of specific datasets. While Petagraph aims to be comprehensive for general use, the inclusion of certain data and the exclusion of others can influence the results and interpretations derived from the graph. Biomedical data also tends to be sparse which can compound biases in available data. Careful consideration and transparency in the data selection process is suggested for those who extend Petagraph with additional data for their own purposes.

Usage Notes

Usage documentation

To help aid users who would like to query Petagraph we have written a user guide with example queries available at https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/user_guide.md.

Analysis considerations

Before employing Petagraph for analytical use, users should consider several factors in setting up their analyses. After finding a use case, users may want to identify and familiarize themselves with the datasets and ontologies that their use case will include (if they are known). This is helpful for two reasons. First, stating explicitly what datasets or relationships to search for in the Cypher query can drastically speed up query run time. Secondly, including a predefined set of datasets and relationships can help with interpretability of results. Furthermore, to avoid writing nonsensical queries and to be able to correctly interpret results, it is important to consult the data dictionary and to understand the basic preprocessing and data modeling decisions that took place while creating Petagraph. Located on the Petagraph Github71, the data dictionary includes detailed descriptions of the datasets, preprocessing steps, as well as images and descriptions of how each dataset has been modeled. Users also need to be wary of comparing quantitative results from different data sources. For example, p-values from two separate sources should not be directly compared with each other. For more complex use cases (and queries) it may be helpful to define a projected subgraph using function(s) from Neo4j GDS: Graph Data Science library81. A graph projection creates an in-memory graph that can be queried more quickly than the database and may help to speed up analysis.

Careful selection of Cypher query strategies can aid in returning rapid results. For example, recursive queries can be performed on OBO-compliant ontologies in Petagraph. Ontologies like HPO, MP, UBERON can be queried recursively to include child nodes at any specified level. This can be useful, for example, when a user wants to include information about a general disease condition instead of just a single phenotype term. In Figure 16 we show how this can be done by querying and returning eQTLs for the term Atrial Septal Defects in addition to all of its direct child terms. When writing a Cypher statement, a user can query recursively be using the ‘*’ operator inside a relationship definition.

Fig. 16
figure 16

Recursive queries on ontologies allow for returning results and increasing search depths. This is an example of a recursive (transitive closure) search using the mammalian phenotype (MP) Atrial Septal Defects (ASD) as the parent phenotype. This query returns all of the eQTLs (p-value < 0.05) from ASD (MP:00110403) and its direct child phenotypes, based on genes that are expressed in the GTEx heart data. Concept nodes are in blue, Code nodes in purple.