Abstract
Over the past decade, there has been substantial growth in both the quantity and complexity of available biomedical data. In order to more efficiently harness this extensive data and alleviate challenges associated with integration of multi-omics data, we developed Petagraph, a biomedical knowledge graph that encompasses over 32 million nodes and 118 million relationships. Petagraph leverages more than 180 ontologies and standards in the Unified Biomedical Knowledge Graph (UBKG) to embed millions of quantitative genomics data points. Petagraph provides a cohesive data environment that enables users to efficiently analyze, annotate, and discern relationships within and across complex multi-omics datasets supported by UBKG’s annotation scaffold. We demonstrate how queries on Petagraph can generate meaningful results across various research contexts and use cases.
Similar content being viewed by others
Background & Summary
The annual increase in the volume and complexity of biomedical data has posed significant challenges for analysts interested in comprehensive data integration, and requires advanced tools to harness the potential of biomedical datasets that include omics data. Knowledge graphs are the best near-term solutions for the integration and analysis of heterogeneous data within and between large biomedical datasets1. Knowledge graphs can effectively integrate and analyze heterogeneous data sources within and across large biomedical datasets. Methods such as node and link prediction algorithms, supervised and unsupervised machine learning can be applied to biomedical knowledge graphs for many types of use cases.
Most biomedical knowledge graphs are customized for particular use cases. In recent years, knowledge graphs have witnessed a rapid proliferation in their adoption, with applications spanning drug discovery, drug repurposing, and the prediction of drug targets2,3,4,5,6. Other biomedical knowledge graphs gave integrated heterogeneous COVID-19 data7,8,9,10,11,12,13, oncology datasets14,15,16, and gene-disease associations3,17. These knowledge graphs are very application-specific, which is expected for efficiency and analysis. General genomic data-integrated knowledge graphs such as Petagraph and GenomicKB18 are expected to increase in number, helped in no small part by the maturation of projects on ontological unification such as the Monarch Initiative19 and on genomics data standards such as GA4GH20.
To facilitate the widespread adoption of knowledge graphs in the biomedical research community, we aimed to develop a modular knowledge graph framework comprising ontologies, vocabularies, standards, and commonly used data resources including omics data. This framework would enable the efficient creation of knowledge graphs for diverse applications. Our requirements for this knowledge graph framework included the need to accommodate various omics data types within a network of interconnected standards and ontology systems to ingest and seamlessly link experimentally-derived data to other data sources within the graph.
In order to support these requirements, we created the Unified Biomedical Knowledge Graph (UBKG)21. The UBKG takes the form of a property graph, based on the NIH Unified Medical Language Service (UMLS)22, and incorporates 105 English language-based ontologies and standards updated regularly from the biannual UMLS releases. The UBKG extends the content from the UMLS by adding biomedical data from a variety of other sources, including both standard ontologies and custom reference data. UBKG offers adaptability for various biomedical ontological knowledge graph construction. It has already been adapted for projects such as HuBMAP23, SenNet24, and the NIH Common Fund Data Ecosystem Data Distillery Project25. Table 2 provides a list of additional ontologies that the UBKG contributes to the UMLS ontology collection, which will enhance support for building knowledge graphs and supporting future omics initiatives.
Adopting the UBKG as a base ontological scaffold, we constructed Petagraph, a biomedical knowledge graph that embeds and connects omics and related data into the UBKG’s ontological structure, effectively ‘bringing the data to the ontologies’ by embedding these data within a richly connected annotation environment. Petagraph has added over 12 million nodes to the UBKG, an increase of almost 44%, more than doubling the number of relationships from 52 million to 118 million. Petagraph increases independent relationship types to 1,861 versus UBKG’s 1,756. Petagraph was originally piloted as a resource for rapid feature selection to identify, annotate, and explore gene candidates for human diseases, and has since expanded to act as a base module for user-customized knowledge graphs for other types of biomedical use cases. To create Petagraph, we added 21 sources of supporting omics and annotation to UBKG (Tables 2 and 3). Among these additions is the Homo sapiens Chromosomal Location Ontology (HSCLO38), developed to allow Petagraph queries to easily link relevant genomics features across different resolutions by chromosome position and chromosomal vicinity26. The modular design for data incorporation enables any user to add and/or subset data on top of Petagraph’s omics-rich data structure for their particular use case. Petagraph is therefore distinguished by its generalizable schema and richly annotated structure, which will allow for the subsetting of highly integrated omics data for a myriad of use cases.
In choosing datasets for Petagraph, we prioritized the integration of interpreted knowledge over raw experimental data. The genomic and related datasets within Petagraph have been curated, harmonized, and interpreted through thousands of hours of expert effort, ensuring that users have access to high-quality data for their queries and thus increasing the likelihood of generating meaningful insights. We show that queries on Petagraph can, in fact, rapidly return meaningful results for a diverse set of example use cases, including those for annotation and analysis. The integration of large datasets into Petagraph has the potential to advance how biomedical and biomolecular data is mined and analyzed. By harmonizing diverse data types, including genomics, transcriptomics, proteomics, and clinical data, Petagraph supports system-wide approaches to analysis.
Use cases for Petagraph include but are not limited to identifying genomic features functionally linked to genes or diseases, linking across genetics data between human and animal models, linking transcriptional perturbations by compound in tissues of interest, or identifying cell types from single-cell data that are most associated with diseases or genes. Analytical use cases for Petagraph’s data collection include applying machine learning methods or topological analyses to predict relationships such as link prediction, or predicting new properties on biomedical data types, such as node property prediction.
The scalable and modular design of Petagraph allows for continuous incorporation of new datasets. Petagraph’s mature architecture is designed to be extended easily by users wishing to build their own custom knowledge graph. Utilizing the Unified Biomedical Knowledge Graph (UBKG) ingestion protocols, Petagraph users can easily integrate additional data sources, whether they are publicly available or proprietary. This modularity ensures that researchers can curate the knowledge graph to include specific builds relevant to their unique applications. The node and edge CSV files that are used to construct Petagraph are 12.5 GB in size and can be used to reconstruct the database or can be used with our instructions to expand or change the knowledge graph with the UBKG as the base scaffold. We provide a 4.5 GB database dump of Petagraph that can be installed on local laptops with at least 16 GB of memory. Petagraph is currently being utilized as a base module by the NIH Common Fund Data Ecosystem (CFDE) Data Distillery Project to integrate Common Fund omics data from twelve data coordination centers25.
Future work will focus on expanding Petagraph’s capabilities, enhancing automated validation techniques, and developing standardized benchmarks for knowledge graph comparisons. Another emerging and important research area is in the integration of knowledge graphs with Large Language Models (LLMs) to enhance biomedical data analysis. Curated knowledge graphs like Petagraph provide a curated source of structured knowledge that can improve LLMs’ understanding and generation of biomedical knowledge. LLMs can use Petagraph to generate more accurate and contextually relevant responses to complex biomedical queries, with many potential applications in personalized medicine, drug discovery, and clinical outcome prediction.
In conclusion, Petagraph represents a significant advancement in the integration and analysis of multi-omics and biomedical data. Its robust framework, extensive dataset integration, and advanced analytical capabilities position it as a valuable resource for the biomedical research community, enabling new discoveries and fostering a deeper understanding of complex biological systems.
Methods
We discuss the origins and versions of records used in Petagraph by the type of data source: ontologies, mappings, and source genomics data sets. Details of these datasets can be found in Tables 2 and 3. For clarity, we indicate graph elements in courier font, for example, distinguishing graph elements such as HGNC nodes from the dataset source HGNC.
Records for supporting ontologies
This section represents ontologies and ontology collections.
UBKG base context
The UBKG acts as the base build for Petagraph. The foundation of the UBKG is a set of entities and relationships obtained from the NIH Unified Medical Language System (UMLS). The UMLS consolidates the content of a large number of standard biomedical vocabularies. The set of assertions obtained from extracted UMLS 2023AB content is referred to as the UMLS source context27. The UBKG adds to the data from UMLS information from a variety of sources, including:
-
Custom data sources, including those from CFDE partners such as HuBMAP23
The combination of the UMLS source context with the additional sources is referred to as the UBKG base context27.
Homo sapiens chromosomal location ontology (HSCLO38)
The Petagraph team created the Homo Sapiens Chromosomal Location Ontology for GRCh38 (HSCLO38) primarily to connect genomic features by chromosomal position. HSCLO38 is utilized to connect features from annotation standards such as GENCODE and datasets such as 4DN and GTEXEQTL locations in the graph as searchable nodes at 1 kbp resolution. The HSCLO38 nodes are defined at 5 resolution levels; chromosome, 1 Mbp, 100 kbp, 10 kbp and 1 kbp with each level connected up and down size scales. The HSCLO38 tree contains 3,431,155 nodes and 6,862,195 relationships (13,724,390 including reverse edges)26
Supporting mappings
Genomic features-to-chromosomal position (GENCODEHSCLO)
Gene names with chromosomal positions were downloaded on 2023-12-10 from GENCODE31 v41 (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/). GENCODE genomic features are mapped to 1 kbp resolution nodes in HSCLO38 at their 5′ and 3′ genomic positions. The preprocessing script is provided in GitHub (Table 9).
Human-to-mouse ortholog mappings (HCOP)
Human-to-mouse orthology mappings were downloaded from the HGNC Comparisons of Orthology Predictions (HCOP) tool (https://www.genenames.org/tools/hcop/)32 on 04/15/2021. HCOP maps mouse orthologs to 20,715 of the 41,638 HGNC Concept nodes. Each orthologous pair of human-mouse gene nodes share reciprocal relationships, ‘has_human_ortholog’, and ‘has_mouse_ortholog’.
Human gene-to-phenotype mappings (HGNCHPO)
Mapping data from human genes to human phenotypes were obtained in July 2021 from the Human Phenotype Ontology (HPO) (https://hpo.jax.org/app/data/annotations)33. The dataset contains 4,545 genes mapped to at least one phenotype and 10,896 phenotypes mapped to at least one gene. The Hugo Gene Nomenclature Committee’s (HGNC)34 gene names are included with the UMLS as HGNC Concept nodes which are connected to HPO Concept nodes through an ‘associated_with’ relationship type.
Human-to-mouse phenotype mappings (HPOMP)
Human-to-mouse phenotype mapping data connecting Human Phenotype Ontology (HPO) terms to Mammalian Phenotype Ontology (MP) terms were generated on 2020-12-12 using PheKnowLator35 v1 (Table 9). We included only PhenoKnowLator’s “exact phenotype matches” between HPO and MP, resulting in ~1000 mappings. More detailed descriptions of the PheKnowLator mapping scores can be found on their GitHub page (https://github.com/callahantiff/PheKnowLator). Precise mapping of human phenotypes in HPO to mammalian phenotypes in MP is an ongoing project by the MONDO and uPheno projects19.
Mouse gene-to-phenotype mappings (MPMGI)
We were interested in including mouse gene-to-phenotype relationships. Data were downloaded (2021-01-10) as multiple datasets from two separate databases.The first set of datasets for genotype and phenotype assertions were downloaded from the International Mouse Phenotyping Consortium (IMPC)36 (https://www.mousephenotype.org/) from their ftp site (http://ftp.ebi.ac.uk/pub/databases/impc/all-data-releases/latest/results/genotype-phenotype-assertions-ALL.csv.gz) and the results of statistical analyses connecting mouse genes to phenotypes (http://ftp.ebi.ac.uk/pub/databases/impc/all-data-releases/latest/results/statistical-results-ALL.csv.gz). The second set of datasets were obtained from the mouse genome informatics (MGI) database37 and can be found at (http://www.informatics.jax.org/downloads/reports/index.html#pheno). We imported MGI datasets PhenoGeno (http://www.informatics.jax.org/downloads/reports/MGI_PhenoGenoMP.rpt) GenePheno and (http://www.informatics.jax.org/downloads/reports/MGI_GenePheno.rpt) and Geno_Disease (http://www.informatics.jax.org/downloads/reports/MGI_Geno_DiseaseDO.rpt). All 3 datasets contain, among other data, phenotype-to-gene mappings. The datasets from IMPC and MGI were combined to create a mouse genotype-to-phenotype dataset. This master dataset MPMGI contains 10,380 MP terms that are mapped to at least one gene and 17,936 genes that are mapped to at least one MP term.
Molecular signatures database (MSIGDB)
MSigDB is a collection of gene set resources, curated or collected from several different sources38,39 (https://www.gsea-msigdb.org/gsea/msigdb/). Five subsets of MSigDB v7.4 datasets were introduced as entity-gene relationships to the knowledge graph: C1 (positional gene sets), C2 (curated gene sets), C3 (regulatory target gene sets), C8 (cell type signature gene sets) and H (hallmark gene sets). With this subset, we created MSIGDB Concept nodes for 31,516 MSigDB systematic names (used as Codes). The MSIGDB and HGNC Concept nodes are connected by relationships that reflect the content of each of the MSigDB subsets. For example, a pathway in the MSigDB Hallmark dataset will link to its member genes through the has_signature_gene and reverse as inverse_has_signature_gene edge types. The MSIGDB Term names were also compiled according to the MSigDB generic entity names. Collectively, MSigDB adds 2,598,060 Concept-to-Concept direct and inverse relationships to Petagraph. Details on MSigDB relationships types connecting the Concept nodes are found in Table 7. Preprocessing scripts for MSIGDB relationships are deposited in GitHub (Table 9).
Human-to-rat ENSEMBL mappings (RATHCOP)
The source of the human-to-rat ENSEMBL40 ortholog mappings was from the HGNC Comparisons of Orthology Predictions tool (HCOP)32 (https://www.genenames.org/tools/hcop/) downloaded on 2023-11-16.
Supporting genomics data sets
Summary details for each mapping and quantitative dataset are featured in Tables 3, 4.
4D Nucleome program (4DN)
We obtained 21 loop files (Table 8) stored in dot call format from the 4D nucleome project41 website (https://www.4dnucleome.org) on 2023-01-05. The loop files were further processed for ingestion by first creating dataset nodes (SAB: 4DND) with the respective terms containing the dataset information (assay type, lab and cell type involved), file nodes (SAB: 4DNF). The respective terms containing the file information, loop nodes (SAB: 4DNL) were attached to HSCLO38 nodes at 1kpb resolution level corresponding to upstream start and end and downstream start and end nodes of the characteristic anchor of the loop and q-value nodes (SAB: 4DNQ) corresponding to donut q-value of the loops. Preprocessing scripts to format the loop and q-value data were deposited in GitHub (Table 9).
Single cell fetal heart data (ASP2019)
This dataset includes single cell RNA-seq data from human fetal heart tissue as described in Asp et al.42. The data was downloaded on 2021-08-10 (https://www.spatialresearch.org/resources-published-datasets/doi-10-1016-j-cell-2019-11-025/) The average gene expression of each author-supplied cell type cluster was calculated and used to represent each gene within the cluster with a preprocessing script (Table 9). Single cell fetal heart Concept nodes were created and connections to cell type nodes from the Cell Ontology (CL) and HGNC nodes connections were made. There were cell types defined in the Asp et al. paper that do not currently exist in the CL. We created our own cell type Concept nodes for these cell types with an SAB of ASP2019CLUSTER. The Single cell heart Code nodes have an SAB of ASP2019.
ClinVar (CLINVAR)
The ClinVar43 human genetic variants-phenotype submission summary dataset was utilized to define relationships between human genes and phenotypes and was downloaded on (2023-01-05) from the FTP site (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz) and preprocessed to create edge files (Table 9). As a preprocessing step, we imported only genes with single nucleotide variants (SNVs) annotated as pathogenic, likely pathogenic and pathogenic/likely pathogenic variants. Diseases and phenotypes included from ClinVar were sourced from MEDGEN44, MONDO45, HPO33, EFO46 and MESH47. Diseases and phenotypes were linked to genes with the 'gene_associated_with_disease_or_phenotype' edge type and reverse as 'inverse_gene_associated_with_disease_or_phenotype.' As a result, ClinVar represents 214,040 relationships (including reverse) connecting genes to diseases and/or phenotypes, thus connecting HGNC and MEDGEN, MONDO, HPO, EFO, and MSH Concept nodes. Preprocessing scripts for CLINVAR relationships were deposited in GitHub (Table 9).
Connectivity MAP (CMAP)
We obtained the edge lists of the CMAP Signatures of Differentially Expressed Genes for Small Molecules dataset from the Harmonizome database (https://maayanlab.cloud/Harmonizome/dataset/CMAP+Signatures+of+Differentially+Expressed+Genes+for+Small+Molecules)48,49. These edge lists combine chemical data from CHEBI50 with HGNC gene IDs. The dataset features genes from microarray gene expression molecular signatures that were responsive to a chemical perturbation introduced to selected human cell lines. CHEBI and HGNC connectors in CMAP have edge types 'positively_correlated_with_gene' and 'negatively_correlated_with_gene' plus their inverses. The dataset added 2,625,336 relationships (including reverse). Preprocessing scripts for CMAP relationships were deposited in GitHub (Table 9).
GTEx, Expression and eQTL data (GTEXEXP and GTEXEQTL)
Human gene expression per tissue data was obtained from Genotype-Tissue Expression (GTEx) Portal (Version 8) (https://gtexportal.org/home/datasets) on 2023-04-10. We preprocessed the gene expression dataset (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm), as well as the expression quantitative trait loci (eQTL) dataset (GTEx_Analysis_v8_eQTL) (Table 9). The gene expression dataset contains expression profiles from 54 tissues across 56,200 transcripts. The eQTLs dataset contains over 1.2 million eQTLs from 49 tissues. GTEx includes HGNC gene IDs and Uberon tissue names51. We created Concept nodes for each eQTL and each tissue-gene expression pair. The eQTL Concepts were then connected to their corresponding tissue node (UBERON), gene node (HGNC) and genomic location node (HSCLO38). The gene expression nodes are connected to their corresponding tissue node and gene node. We also integrated quantitative data from GTEx including p-values for the eQTL data and transcripts per million (TPM) for the gene expression data into the graph. In order to reduce redundancy of nodes we created bin Concept nodes for these quantitative data types. For example, if the gene TP53 has a TPM of 10.5 in the heart, the GTEx Concept for TP53 - Heart will be connected to the ‘[10.0.11.0]’ TPM bin Concept node. Similarly, if an eQTL has a p-value of 0.0005 it will be connected to the ‘[0.0001.0.001]’ p-value bin Concept node.
GTEx, Coexpression data (GTEXCOEXP)
Coexpression of human genes was computed using the GTEx gene TPM (transcript per million) normalized data (GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm) (https://gtexportal.org/home/datasets) by first computing the correlation matrix for each of the GTEx tissues and selecting the entries with pairwise Pearson’s correlation above 0.99. This preprocessing was done for 54 human tissues and cell types as a result 1,078,042 (including reverse) relationships were ingested into the KG connecting HGNC Concept nodes. These relationships link genes that are co-expressed with evidence codes reflecting the number of tissues where the connected genes meet the criteria for co-expression. Preprocessing scripts for GTEXCOEXP relationships were deposited in GitHub (Table 9).
GlyGen selected datasets (GLYGEN)
Five datasets from the GlyGen data website52,53 were chosen based on their relevance to our preliminary use cases and for all datasets we used release v1.12.3. The first two datasets were simply lists of genes that code for glycosyltransferase proteins in the human (https://data.glygen.org/GLY_000004) and mouse (https://data.glygen.org/GLY_000030). These datasets were modeled by creating a ‘human glycosyltransferase’ Concept node as well as a ‘mouse glycosyltransferase’ Concept node. Then, the Concept nodes for human genes (HGNC nodes) and mouse genes (MGI nodes) were connected to their respective glycosyltransferase nodes with a ‘is_glycotransferase’ relationship. The next three datasets contain human O-linked and N-linked glycosylation information from GlyGen, namely O-GlcNac (https://data.glygen.org/GLY_000517), Glyconnect (https://data.glygen.org/GLY_000329) and UniCarbKB (https://data.glygen.org/GLY_000138). These datasets contain information on human proteoforms, such as the exact residue on a protein isoform which is glycosylated, the type of glycosylation and the glycans found to bind that amino acid. To define relationships between human proteins from UniProtKB (UNIPROTKB Concept nodes)54 and glycans from the ChEBI resource50 (as included in CHEBI data) we introduced an intermediary ontology of glycosylation sites derived from the information included in the mentioned dataset. In this process, we added 38,344 protein isoform relationships and glycosylation_type_site relationships to GLYGEN based on the three selected data sources.
Gabriella Miller Kids First datasets (KFPT)
To provide a test set for human phenotype-to-genetic-variant analysis in Petagraph, we imported subject ID to-phenotype and cohort-wide gene-variant counts downloaded from the Gabriella Miller Kids First (GMKF) Data Resource Center (https://portal.kidsfirstdrc.org) on 2022-07-01. We used summarized de-identified data originating from a Kids First congenital heart defects cohort (phs001138.v4, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001138.v4.p2)55 based on 5,006 subject-parent trios. Subjects were modeled as Concept nodes with SAB of KFPT, for ‘Kids First Patient’, and subject IDs were connected to their respective HPO Concept nodes. Analysis added counts of damaging de novo single nucleotide variants observed per gene across all probands’ VCFs based on a VEP score prediction of ‘HIGH’ impact. A single summed count for predicted damaging variants per gene are reported for this cohort.
LINCS L1000 (LINCS)
We introduced gene-small molecule perturbagen relationships to Petagraph based on the LINCS L1000 edge list available on the Harmonizome database (https://maayanlab.cloud/Harmonizome/search?t=all&q=l1000)49,56. These relationships were summarized from LINCS L1000 Signatures of Differentially Expressed Genes for Small Molecules dataset57. This was done by first finding the corresponding CHEBI Concept nodes for the L1000 small molecules and then establishing the relationship of such nodes to the HGNC nodes according to the edge list mentioned above. The relationships were then collapsed to exclude the cell line, dosage, and treatment time information but the effect directions were retained in relationship types (scripts in Table 9). This led to 3,198,094 relationships (noted as negative or positively correlated) between HGNC and CHEBI nodes. Preprocessing scripts for LINCS relationships are available in GitHub (Table 9).
Mouse Genome Informatics (MGI)
IDs for mouse genes were provided by MGI58 and downloaded from the MGI site at the Jackson Laboratory (http://www.informatics.jax.org/downloads/reports/MGI_Geno_DiseaseDO.rpt) on 2023-12-15. No filtering or preprocessing was done to the MGI Codes prior to ingestion. The MGI nodes have 22,682 relationships to their orthologous HGNC nodes. There are also 234,043 gene-to-phenotype relationships between MGI nodes and MP nodes.
STRING (STRING)
Human protein to protein interaction data was downloaded from the STRING website (https://stringdb-downloads.org/download/protein.links.v12.0/9606.protein.links.v12.0.txt.gz). As Petagraph preferentially utilizes UniprotKB, we converted ENSEMBL protein ID entries to UNIPROTKB and filtered the dataset for the top 10% of the STRING-provided “combined score”. The refined dataset contains 459,701 relationships linking UNIPROTKB nodes with a relationship property from the STRING combined score as evidence of the interaction. The preprocessing script for STRING relationships was deposited in GitHub (Table 9).
Knowledge graph construction
Here, we discuss how we built Petagraph from each of its individual components. Petagraph’s ingestion workflow facilitates the integration of large-scale, biomolecular, and biomedical data sets. The basic steps we took to build Petagraph were: (1) obtain a UMLS license from NIH, (2) build the UBKG from ingestion processing scripts, and (3) ingest the Petagraph CSVs into the UBKG base release. Users interested in building Petagraph from source can follow the same process, however we supply a database dump discussed in the Usage Notes section.
Petagraph is built on a Neo4j graph database, version 5, community edition. The scripts for data ingestion were developed as part of the UBKG project and can be found on the UBKG project’s GitHub repository21. The ingestion scripts require any new data ingestion to utilize identifiers for the data sources and Codes supported by UBKG59 with examples such as HGNC symbols and Human Phenotype Ontology IDs. If a Concept identifier is already in the UBKG, the scripts identify the existing Concept and any new Code(s) and Term(s) are attached. In cases where there are no existing Concepts for newly introduced ontologies or data sources, the UBKG ingestion process will automatically generate a new Concept node along with its unique identifier. We have made minor adaptations in the UBKG scripts for loading biomolecular data for Petagraph, and those scripts have been included in the Petagraph GitHub repository60.
The Petagraph ingestion pipeline we created contains four major steps (Figure 1). These are: (1) data selection and modeling, (2) cleaning and formatting data, (3) running the appropriate UBKG ETL scripts to convert data into UBKG CSV format, and (4) building Petagraph with the Neo4j bulk import tool. We discuss each of these steps in detail below.
Step 1: Data selection and modeling
The selection, sourcing, review, and modeling of data appropriate for the knowledge graph required a comprehensive understanding of the knowledge graph’s existing model and structure. When creating a model that merges disparate types of biomolecular data, there should be an emphasis on modeling relationships between discrete entities in a biologically meaningful way. For example, when modeling relationships between Concepts for genes and Concepts for associated regulatory motifs, biological context, such as the proximity of a regulatory motif to a gene within the genome should be considered. Therefore, instead of just attaching the regulatory motif to a gene, we created the HSCLO38 in order to map genomic elements at any size scale to their location on the chromosomes.
Step 2: Cleaning and formatting of raw data
In the second step, cleaning and formatting the data was performed. The UBKG generation framework designed by our team converts data from a variety of data sources into a triplet representation based on the OWL NEtwork Transformation for Statistical Learning (OWLNETS) format61. The ingestion process is optimized for working with ontology data provided in Web Ontology Language (OWL) format. OWL format, which is based on the resource description framework (RDF), is a set of standards designed for creating formal, structured knowledge representations62).
The UBKG generation framework employs the Phenotype Knowledge Translator (PheKnowLator) Python package when ingesting OWL files63. The PheKnowLator package converts semantic information from an OWL file into a set of files in OWLNETS format. The UBKG generation framework works with OWL files in a variety of serializations. Currently supported serialization syntaxes are Turtle and RDF/XML, which seems to be the norm for ontology files published to repositories such as the Open Biological and Biomedical Foundry64 or the National Center for Biomedical Ontology’s BioPortal29. If the source data is already in the UBKG Edge Node format65 (derived from the OWLNETS format), then the conversion to OWLNETS format is not necessary.
The UBKG generation framework also obtains biomedical reference data from sources other than OWL files, including data downloaded from GENCODE31, HUGO34, UniProtKB30, and RefSeq66, converting this data into files in the UBKG edge and node formats. To ingest genomics data into Petagraph, we must clean and reformat the data. The main quantitative data types selected from genomics datasets are: p-values, log2 fold changes, and gene expression data. Genomics data cleaning generally involves removing missing or irrelevant data points, and formatting the data according to the specific dataset schema that was decided on in Step 1.
The required edges file asserts a set of relationships between entities using triples, each consisting of a subject, a predicate, and an object. The subject and object of an assertion represent the starting node and ending node of a relationship and the predicate represents the relationship. The nodes file describes the nodes that comprise the subjects and objects in the triples of the edge file, providing information about nodes such as Codes for the nodes in a source (e.g., the HGNC ID), preferred terms, synonyms, definitions, and cross-references for the Code to other vocabularies. For data that required reformatting, we put all data into the nodes and edge file formats as described.
Step 3: Convert nodes and edges files into UBKG CSV format
The UBKG generation framework converts the content of a collection of edge and node files into a set of ontology CSV files. The ontology CSV files represent the entities, relationships, and metadata that will be in the UBKG. The ontology CSV files conform to the format recognized by the neo4j-admin import utility67. Importing the ontology CSV file into Neo4j by means of the neo4j-admin import tool generates the entities and relationships of the UBKG.
The UBKG generation framework builds a set of ontology CSV files iteratively, starting with an initial set of CSV files extracted from a release of the UMLS. The framework extends the content from the UMLS with that from another data source by appending to the ontology CSVs data that it extracts from files in UBKG Edge Node format.
The UBKG base build is provided by a set of CSV files that follow the Neo4j bulk import tool header format67. The UBKG import file headers specify any of the five node types that the metadata attributes are assigned to, and which types of nodes the relationships connect. This conversion is done via a set of UBKG ingestion scripts which can be found at the UBKG GitHub site68. After the CSVs from each dataset have been converted from the nodes and edges files into UBKG CSV format, they are appended to the base UBKG CSVs. We now refer to these updated CSVs as the “Petagraph CSVs” since they contain assertions from the 21 datasets added to the UBKG (Table 2).
Step 4: Build Petagraph with the Neo4j bulk import tool
Lastly, once the nodes and edges files for each dataset were converted to UBKG CSV format and appended to the base UBKG CSVs, we used the Neo4j bulk import tool (neo4j-admin database import) to build the graph on the Neo4j Desktop platform using Neo4j v5.
Schema structure
Petagraph’s data records arise from a heterogeneous collection of datasets that each have their own schema representing the underlying dataset structure.
Node types
There are five main data-specific types of nodes within Petagraph: Concept, Code, Term, Definition, and Semantic Type. These node types are discussed in further detail in the UMLS Metathesaurus69. Within Petagraph, the most important node type is the Concept node, as they form the central backbone of the entire graph model. All other node types are essentially metadata attached to Concept nodes.
Concept nodes represent a node nexus for organizing on a particular conceptual meaning. For example, a Concept node for a particular gene will include references to many different representations of that gene from resources such as HGNC34, Ensembl40, and ENTREZ70. Every Concept node has just one property called the Concept Unique Identifier (CUI). CUIs are alphanumeric identifiers and are not informative outside of their use as unique identifiers. For example, the CUI for the concept for the H. sapiens TP53 gene is C0079419.
Code nodes identify the reference IDs for a given Code in a particular ontology or standard. Several Code nodes can share associations with the same Concept. It is this one-to-many relationship between Concept and Code nodes that allows for traversal of the graph between data sources: if a Code in one source is linked to the same Concept as a Code in another source, then it may be possible to propose an equivalence between the two Codes. The only relationship between Concept nodes and their respective Code nodes is the CODE relationship. All Code nodes have three properties: (1) the source abbreviation (SAB), (2) the Code from the reference source, and (3) the CodeID which is the SAB and the Code separated by a colon. In Petagraph’s schema, the SAB signifies the identity of the originating dataset or ontology. Examples of three CodeIDs supporting the Concept node for the H. sapiens TP53 gene are HGNC:11998, NCI:C17359, and ENTREZ:7157, representing identifiers from the HGNC, NCI Thesaurus, and NCBI’s ENTREZ databases respectively. Figure 2 shows an example of two concept nodes (blue) connecting a human gene and phenotype. Code nodes are represented for each Concept node (blue) showing the different dataset identifiers for the concepts.
Term nodes have a name property that provides human-readable annotation about the linked Code node. As an example, the Term node for the HGNC:11998 Code node (representing H. sapiens TP53) has the name property of “tumor protein p53” provided by HGNC. Most Code nodes have a relationship to at least one Term node, usually through a PT relationship type, which stands for preferred term. Code nodes may carry multiple associated Term nodes if provided from the original data source, such as synonyms. Concept nodes also have relationships to Term nodes; however, many Concept nodes connect to a “preferred term” Term node through a PREF_TERM relationship. This relationship allows for a quick evaluation of a Concept node’s identity.
There are two additional types of nodes that connect directly to Concept nodes where appropriate: the Definition nodes and the Semantic nodes. Definition nodes are connected through a DEF relationship type to Concepts and have a DEF property where they provide definitions for Concept nodes. (Sources provide definitions for Codes, not Concepts; because the corresponding Code nodes can share links to Concept nodes, a Concept generally has more than one Definition node.) Semantic nodes are connected to Concept nodes by a STY relationship type and have a name property that provides wider semantic type classifications such as Body System, Cell Component, Gene, Mammal, etc., to Concepts.
A data dictionary covering Petagraph’s schema and its node structures can be found at the Petagraph data dictionary website71. From here onward, we will often exclude the word “nodes” when referring to Concept nodes, Code nodes, and Term nodes in this manuscript for ease of reading.
Edge types
Edges define links and relationships. Petagraph has 1,861 distinct edge types. The majority of these edges fall into the category of Concept to Concept edges, which are Petagraph’s only bidirectional edge type. Each edge type comes with its own Source Abbreviation (SAB) property, which is particularly useful for quick inclusion or exclusion of edge-only data sources when querying the graph. Petagraph also extensively uses edge identifiers between Concepts from the Relations Ontology64 whenever possible. The Concept to Concept relationship network serves as the primary traversable component of Petagraph and constitutes the graph’s fundamental structural backbone.
There are five other main non-(Concept to Concept) relationship types in Petagraph that add metadata to the Concept nodes. These relationships are always unidirectional and point away from their respective Concept nodes. These relationships include: the Concept node to Code node relationship (relationship type = CODE), the Concept node to Term node relationship (relationship type = PREF_TERM), the Code node to Term node relationship(s) (relationship types = PT, SYN, ACR which stand for “preferred term”, “synonym” and “acronym”, respectively), the Concept node to Definition node relationship (relationship type = DEF) and the Concept node to Semantic node relationship (relationship type = STY which stands for “Semantic Type”). There are an additional 180 relationship types between Codes and Terms that are used with much lower frequency.
Data modeling
Data modeling was a major consideration in how we introduced quantitative data within the structure supported by the UBKG. The Petagraph schema supports quantitative values as node and edge properties. Under the framework of the UMLS and UBKG, the properties for each node type are well defined: every Concept node has a CUI property; every Code node has SAB, Code and CodeID properties and every Term node has a name property. However, Code nodes can have an extra quantitative attribute called value. For example, GTEx eQTL Code nodes have a value property corresponding to their p-values. We also wanted to integrate quantitative data within the categorial and conceptual framework of the UMLS schema. This allows query results to be returned rapidly, even for multidimensional, numerical searches. We accomplished this by creating Concept nodes representing interval bins of numerical ranges, for example, p-values, expression TPM, and log2FC. Nodes with a quantitative value can then be assigned to the appropriate bin node allowing for rapid selection of numerical values within the graph. As many bioinformatics analyses are performed with results that meet a certain threshold, the numerical bin nodes are especially useful for thresholding data in biologically meaningful queries (e.g. return all data points with p-value < 0.05).
In order for Petagraph to more fully support the use of experimental omics data, we also included dozens of relationship-only (“mapping”) datasets that may come from observational data themselves, many of which map to and within genomic and phenotypic databases for human and mouse models. Sixteen of these mapping datasets (and the datasets they map to) are visualized in Figure 3. The interconnectedness of these datasets enables many different types of queries and their contributions are represented in the heatmap intensities. For example, LINCS and CMAP are represented by the CHEBI to HGNC connections and STRING is represented by the UNIPROTKB to UNIPROTKB connections. HGNC is clearly a hub dataset, as human gene names connect across many different omics and annotation datasets. Of the sixteen datasets shown in Figure 3, nine have relationships to three or more others. Several datasets were already part of the UBKG (HGNC, HPO, UBERON), and displayed in Figure 3 because of their high number of connections in Petagraph. Note that many of these datasets have connections to other datasets not shown in Figure 3, for example, the developmental heart scRNA-seq dataset also shares edges with the Cell Ontology (CL)72.
Most Concept nodes in Petagraph are classified by Semantic Type. The UBKG generation framework does not explicitly assign Semantic Types to Concepts outside of those brought in as part of the UMLS context. Currently, there are 127 Semantic Types inherited from the UMLS within Petagraph attached to their member dataset Concept nodes through STY relationship types. We summarize the categories of Semantic Types in Petagraph in Figure 4a which also depicts the connectivity of 10 major Semantic Types in the graph (Figure 4b). As shown in this figure, all these major Semantic Types have relationships to their own types (e.g. gene-gene or phenotype-phenotype relationships). In this plot, the linkages represent direct Semantic Types connectivities regardless of hierarchical relationship between STYs.
To understand how the Concept-Concept relationship data connect Semantic Types, we chose 55 Semantic Types and extracted the relationships connecting them through the Concept nodes directly connected to them (that is, no hierarchical connection between Semantic Types was considered). Subsequently we mapped the quality and quantity where the selected STYs are related to anatomy, phenotypes, diseases, and chemical species and biological entities, cells or metabolic pathways (Figures 5–7). The presence of relationships (Figure. 5), the cumulative number of relationship types (Figure 6), and the relationship counts (Figure.7) were plotted using the package pheatmap v1.0.1273 in RStudio v 1.4.110674 R v4.0.475. For Figures 6, 7, the frequency of relationships are shown between pairwise combinations of Semantic Types, connected through their Concept nodes. Therefore, Concept to Concept relationships where one or both Concept nodes are not connected to Semantic Types are excluded. To increase the dynamic range, the base-10 logarithm was used, and to avoid log10 (0), we added 1 to all values.
Figure 5 illustrates the presence or absence of at least one relationship type between pairs of Concept nodes between different Semantic Types. This figure shows how Semantic Type pairs that lack direct graph-wide relationships can be linked through other relationships. For example, Concept nodes categorized under the Semantic Types Congenital Abnormality and Clinical Drug do not share a direct relationship within the graph. However, both of these Semantic Types do have relationships with the Physiologic Function Semantic Type which allows for a connection. Queries can therefore utilize the intermediary to link the Congenital Abnormality Semantic Type and the Clinical Drug Semantic Type. Similarly, we can bridge all data in Semantic Types through intermediary links. To further analyze these relationships, we quantified the number of relationship types (log10-transformed) per pair of the 55 Semantic Types, considering distinctions based on the start and end nodes, relationship type, and Source Abbreviation (SAB). Figure 7 shows the distribution of relationships among each pair of the 55 Semantic Types. Semantic Types positioned along the diagonal exhibit a higher likelihood of being connected to a greater number of links, as indicated by the log10-transformed relationship counts. Collectively, the information presented in Figures. 5–7 enables us to draw conclusions regarding which Semantic Types within the graph possess the majority of relationships. Furthermore, it sheds light on the potential for extracting valuable information from these relationships, thereby unveiling insights that may not be directly accessible through other means.
Finally, we were interested in useful examples of pairwise shortest path lengths, and chose to map gene-to-gene relationships using HGNC Concept nodes on the Concept-Concept subgraph Shortest Path Length and Connectivity Analyses were performed to measure the expected relationships between terms that are known to be related from orthogonal sources. This analysis will show how the HGNC dataset is connected to other datasets by secondary relationships within the graph. This is an important consideration in designing queries looking for gene-to-gene relationships. The distribution of the shortest path lengths between the graph’s gene to gene (HGNC-HGNC) Concept nodes was estimated using a sample of 1 million pairs of such nodes. Figure 8 shows the probability distribution for shortest lengths between Concept nodes, which range from 1 to 11. The majority of HGNC Concept nodes are connected to each other through 4 or fewer hops, with the peak at 3. This shows that in the case of gene-to-gene relationships, most genes are not directly connected through a single intermediate resource, which allows for more informative results with queries spanning more intermediate relationships.
Data Records
Installing Petagraph from a Neo4j dump file
Users with a UMLS license can follow site instructions to obtain and utilize a UMLS license key at https://ubkg-downloads.xconsortia.org/ to download the most recent Petagraph dump file (currently 4.5 GB). The dump file can be used with Neo4j Desktop to build Petagraph quickly and easily. Detailed instructions on building the database with the dump file can be found on the Petagraph Github README60 or users can follow the standard procedure for loading Neo4j dump files from the vendor.
Petagraph installation from source
Building Petagraph from the source files can take several hours but allows for build customization. The instructions can also be used for ingesting new data into any Petagraph build. The code and instructions to recreate Petagraph from source data is on the project’s Github site60. The process for building from source consists of two stages: establishing the source framework and then the generation framework. In the source framework stage a user downloads the UMLS Metathesaurus and Semantic Network files from the UMLS website76 and then runs a set of SQL queries to extract data to build the UMLS base CSVs. In the generation framework stage, the ingestion pipeline will add additional ontologies and datasets into the UMLS base. The same scripts are then run to add additional datasets onto the UBKG CSV files, creating Petagraph. We provide the source files for Petagraph at https://ubkg-downloads.xconsortia.org/ on the project’s Open Science Framework site (https://osf.io/6jtc9/)77.
Technical Analysis and Validation
UBKG Generation framework ingestion validation
The heterogeneous and often custom nature of data sources means that the validation of ingestion from a particular data source involves analysis that is not obviously amenable to automation. The generation framework generates basic analytical reports during ingestion to aid manual validation; however, manual validation by a subject matter expert is still necessary.
Basic content and consistency requirements
A source for ingestion into UBKG represents a set of assertions. Each assertion involves two nodes (entities) and an edge (relationship). In the UBKG Edge Node format,
-
The edge file represents assertions as triplets in which both nodes and edges are represented with Codes.
-
The node file decodes nodes with metadata such as terms and definitions.
To be represented properly in the UBKG, the edge and node files must satisfy basic requirements for content and internal consistency. These requirements include:
-
Every node in the edge file must either be described explicitly in the node file or already exist in the UBKG.
-
If a node in the node file has a cross-reference to another Code, the cross-referenced Code should already exist in the UBKG.
Ingestion validation
Most data quality problems in the assertion files arise from missing node references, such as:
-
An assertion includes a Code for a node that is neither defined in the node file or previously ingested into the UBKG.
-
A node has a cross-reference to a Code that is not in the UBKG.
Data ingestion validation
After a set of assertions from a source is ingested, the resulting ontology CSVs are imported into a Neo4j instance. Using the Neo4j browser, nodes and relationships from the source are compared in the UBKG with the corresponding edge and node files, using tools such as shortest path queries. For example, if the edge file asserts that SAB1:CodeA has_relationship_X_with SAB2:CodeB, then there should be a shortest path that connects the Code nodes SAB1:CodeA and SAB2:CodeB with a relationship of type has_relationship_X_with. Additional queries validate information from the node file – e.g., that the Code node in the UBKG that corresponds to the node in the node file has the expected terms, synonyms, definitions, and cross-references to Codes from other sources.
For ingestion of genomics and related datasets into the UBKG scaffold, we wrote a continuous integration (CI) workflow using Github Actions to ensure that the final Petagraph product contains the correct schema as well as the correct types and counts of nodes and edges. Our Github Actions CI downloads the latest stable release of Neo4j, downloads the latest version of the Petagraph CSVs from the Open Science Framework website and then builds Petagraph. Then we run tests comparing the count of node and edge types from the Petagraph CSVs and the Petagraph graph to confirm that the graph has been built as expected.
For reviewing the integrity and accuracy of ingested data, the UBKG framework generates a summary report. This report helps identify potential issues and opportunities for alignment before finalizing the ingestion process (Table 1).
FAIR validation
When building Petagraph we adhered strictly to findability, accessibility, interoperability and reusability (FAIR) data principles (https://www.nature.com/articles/sdata201618). Petagraph is findable by way of GitHub, the Open Science Framework (https://osf.io/search)77 and any major search engine. Metadata in the form of a data dictionary is available on the project GitHub, which describes the schema of every individual dataset that has been ingested. The data dictionary also contains detailed descriptions of preprocessing and formatting that was done to the data prior to ingestion. Descriptions of the ingested datasets and ontologies are also present in the Methods section of this paper. Petagraph is accessible by being freely downloadable and small enough (4.5GB) that a dump of the entire database can easily fit onto a user’s personal computer. Whether building from source or from the database dump file, our build process is OS-agnostic. Petagraph, and the UBKG, are built on top of the UMLS, so they are inherently interoperable. The UMLS alone harmonizes hundreds of biomedical vocabularies and ontologies so that users can easily query across standards.
Adding new data to Petagraph is straightforward. Ontologies or datasets that are in OWL, RDF/XML, OBO or Turtle format can be incorporated into the graph automatically using our ingestion workflow, outlined in detail in the Methods section. Datasets that are not in one of these formats can be added to the graph after some simple preprocessing steps. Lastly, all data sources and Petagraph releases are versioned and the entire process of downloading the dump file and building the Neo4j database locally can easily be automated using the OSF and Neo4j APIs which helps ensure reproducibility and consistency for Petagraph users.
Structural validation
Validation through link prediction
As a link prediction-based validation, we computed the Common Neighbors scores for about 500,000 pairs of gene-to-gene direct connections and compared the results to the distribution of Common Neighbors scores for random selections of pairs of genes. As shown in Figure 9, the Common Neighbors of genes with direct connection in Petagraph are approximately three orders of magnitude greater than randomly selected genes. This analysis has two implications. First, it suggests that the orthogonal datasets ingested in Petagraph could effectively be used for link prediction. Second, the links in the graph can be cross-validated using the information in the graph data sources from independent datasets.
Validation using analysis of local structures (transitivity and triangle counts)
We analyzed the Petagraph Concept-to-Concept subgraph nodes in terms of transitivity (the probability of nodes adjacent to a nominal node are connected to each other) and triangle counts (the number of triangles or 3-cliques each node is a part of). We compared the distribution of these measures of local structure connectivity derived from Petagraph with a randomized graph created with the same number of nodes and relationships and node degrees but with random connections to other nodes (Figure 10). As portrayed by the distributions of transitivity (ranging between 0 and 1) and triangle counts, the Petagraph Concept-to-Concept subgraph presents a shift towards higher values of both node transitivity and triangle counts. These meaningful shifts provide an insight into less-random organization in Petagraph Concept nodes connectivity as opposed to the randomized graph, therefore showing informational consistency in the data ingested in Petagraph.
Validation through low dimensional visualization of high dimensional graph embeddings
We visualized the UMAP components of a 100-dimensional embedding of the subgraph of Petagraph consisting ~400,000 nodes and ~12,000,000 relationships (bidirectional). This was derived from a query that extracts all nodes one or two hops away from 12 abnormal heart morphology and 2 blood cancer phenotypes (Methods) and all their interconnections. This included Concept nodes from all 127 Semantic Types including ~10,000 genes and ~10,000 phenotypes. It visualizes the results for a selection of Concept nodes from major Semantic Types including Gene or Genome, Disease or Syndrome, Neoplastic Process, Clinical Drug, Cells and Molecular Processes (Figure 11). In this figure, UMAP low-dimensional visualizations illustrate the spatial distribution of major Semantic Type-related Concept nodes. Nodes belonging to the same STYs are shown to cluster together, indicating their relative proximity and suggesting a higher likelihood of connections within the graph. Conversely, nodes from different STY groups are distinctly separated, although nodes that are adjacent in the graph may still appear close to each other in the embedding space, indicating potential linkages.
Assessing major dataset contributions to link prediction
We analyzed the influence of major datasets on link prediction by using two graph-wide connectivity metrics: transitivity and assortativity (Figure 12). Transitivity measures the likelihood of new connections based on existing local patterns, such as the probability that two nodes sharing a neighbor will be directly connected. This is linked to the formation of triangles in the graph, indicated by metrics like the clustering coefficient and triangle count. Utilizing transitivity helps predict potential links based on the current graph structure. Assortativity evaluates the tendency of nodes with similar attributes or degrees to connect, offering insights into Petagraph's structural patterns. Including assortativity in predictive models enhances accuracy in forecasting future link formations. Together, these metrics provide insights into graph connectivity dynamics and link prediction capabilities.
To assess the impact of individual datasets on global transitivity and degree assortativity within subgraphs around human genes in Petagraph, we systematically removed our largest gene-related datasets from the graph. This approach allows us to measure each dataset’s contribution to overall connectivity and structural features. Figure 12a shows that excluding the GTEXCOEXP dataset significantly reduces global transitivity, indicating decreased local clustering. Figure 12b reveals that removing most major datasets generally decreases degree assortativity, except for the GTEXEQTL dataset. Removal of the GTEXEQTL dataset leads to an increase in degree assortativity as a GTEXEQTL node connects terminally to a gene target node thus not providing links between genes.
These findings underscore the specific contributions of each dataset to the graph’s structural properties and functional relationships among human genes.
Use case validation
In this section we validate Petagraph’s relevance to biomedical research through evaluation of its capacity to return biologically relevant information with three use cases. Further validation of these use case results are performed through evaluation of orthogonal information such as literature review.
Use Case 1: Validation by re-predicting relationships between congenital heart defects and genes
Figure 13 illustrates our evaluation of topological link prediction methods in identifying associations between the Tetralogy of Fallot (ToF) phenotype (HP:00001636) and genes cataloged by HGNC (43,001 nodes). We employed four methods for this analysis: preferential attachment, total neighbors, common neighbors, and Jaccard’s Index. By analyzing the Receiver Operating Characteristic (ROC) and Precision-Recall Curve (PRC) of the binary classifiers against the graph’s existing gene-phenotype links, we show that the graph’s structure—specifically, the node-to-node relationships—empowers these topological link prediction methods, achieving ROC AUCs greater than 0.9 (as shown in Figure 13a,b). Notably, the common neighbors method and Jaccard’s Index achieved exceptional performance, with ROC and PRC AUCs approaching 1, indicating a high reliability for predicting links across the graph. We also conducted a literature review to validate the top 10 gene-phenotype relationships identified by the common neighbors method, using this external information as a form of orthogonal validation not originally incorporated into the graph. The supporting evidence from this review is detailed in Table 5.
Use Case 2: Validation by predicting drug side effects within tissues of interest
The drug rofecoxib was recalled in 2004 due to safety concerns because of an observed increased risk for cardiovascular events, most notably heart attack, and stroke78. We queried Petagraph for all shared genes between the transcriptional profiles for the drug rofecoxib (CHEBI:8887) in LINCS (L1000, CMAP) to all tissue transcriptional profiles in GTEx with expression higher than a minimum level (TPM > 5) (GTEXEXP). Once we extracted such genes, we computed the number of shared genes as a ratio for each tissue with respect to the total number of genes with TPM > 5 and then normalized the value to the highest value in all tissues and ranked them accordingly.
Figure 14 summarizes the result of the query used to extract the shared genes in the transcriptional profiles from rofecoxib and human tissues. This query is based on the genes with expression levels higher than a predefined threshold (TPM > 5) with relationships with rofecoxib in the LINCS dataset. As shown in Figure 14, the ranked tissues point to different organs; the LINCS data provides a correct prediction that heart and blood vessels in the GTEx dataset are most closely related to perturbation gene profiles in rofecoxib (e.g. right auricular appendage, myocardium of left ventricle, coronary arteries). These predictions are not obvious by just examining where rofecoxib’s target gene, PTGS2, is expressed.
Use Case 3: Validation through shortest path analysis of subgraphs
Epidemiological analysis has shown that brain tumors are more common in children with CNS abnormalities than in the general population79,80. In this analysis, we focused on constructing and examining a subgraph derived from Petagraph, specifically targeting the connections between HPO Concept nodes related to abnormal CNS morphology and CNS neoplastic processes. The subgraph, comprising 777 nodes as shown in Figure 15a, was formed by identifying all shortest paths linking a selected set of 54 CNS morphology abnormalities with 54 CNS neoplastic process nodes. Upon analyzing this subgraph, we observed particular patterns in node degree distribution, link traversal frequencies, and shortest path lengths, as detailed in Figure 15b–d. The distributions observed align with a scale-free structure, which has been recurrently identified within various segments of Petagraph. This type of structure is significant because scale-free networks are known for few highly connected nodes (hubs) amidst many nodes with fewer connections, which allows for shorter direct paths between nodes as compared to randomly configured graphs of the same size. Scale-free graph structures are prevalent in many biological systems and data collections.
Investigation into the node degrees within this subgraph allowed us to rank human genes based on their connectivity. Genes with common links between CNS phenotypic abnormalities and brain cancers could be implicated in both conditions with origins in perturbed developmental processes. The top-ranked genes, as listed in Table 6, offer potential targets for further research into their relevance to both CNS disorders and neoplasms.
Limitations
Despite its strengths, Petagraph has certain limitations. Inherent complexity and variability of the data sources integrated into the graph will always be a challenge. Ensuring the accuracy and consistency of data ingestion and mapping processes requires expert curation, continuous refinement and validation. Additionally, the scalability of Petagraph, while robust, may face constraints with the incorporation of increasingly large and diverse datasets, necessitating ongoing optimization of computational resources and algorithms.
Another limitation is the potential bias introduced by the selection and integration of specific datasets. While Petagraph aims to be comprehensive for general use, the inclusion of certain data and the exclusion of others can influence the results and interpretations derived from the graph. Biomedical data also tends to be sparse which can compound biases in available data. Careful consideration and transparency in the data selection process is suggested for those who extend Petagraph with additional data for their own purposes.
Usage Notes
Usage documentation
To help aid users who would like to query Petagraph we have written a user guide with example queries available at https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/user_guide.md.
Analysis considerations
Before employing Petagraph for analytical use, users should consider several factors in setting up their analyses. After finding a use case, users may want to identify and familiarize themselves with the datasets and ontologies that their use case will include (if they are known). This is helpful for two reasons. First, stating explicitly what datasets or relationships to search for in the Cypher query can drastically speed up query run time. Secondly, including a predefined set of datasets and relationships can help with interpretability of results. Furthermore, to avoid writing nonsensical queries and to be able to correctly interpret results, it is important to consult the data dictionary and to understand the basic preprocessing and data modeling decisions that took place while creating Petagraph. Located on the Petagraph Github71, the data dictionary includes detailed descriptions of the datasets, preprocessing steps, as well as images and descriptions of how each dataset has been modeled. Users also need to be wary of comparing quantitative results from different data sources. For example, p-values from two separate sources should not be directly compared with each other. For more complex use cases (and queries) it may be helpful to define a projected subgraph using function(s) from Neo4j GDS: Graph Data Science library81. A graph projection creates an in-memory graph that can be queried more quickly than the database and may help to speed up analysis.
Careful selection of Cypher query strategies can aid in returning rapid results. For example, recursive queries can be performed on OBO-compliant ontologies in Petagraph. Ontologies like HPO, MP, UBERON can be queried recursively to include child nodes at any specified level. This can be useful, for example, when a user wants to include information about a general disease condition instead of just a single phenotype term. In Figure 16 we show how this can be done by querying and returning eQTLs for the term Atrial Septal Defects in addition to all of its direct child terms. When writing a Cypher statement, a user can query recursively be using the ‘*’ operator inside a relationship definition.
Data availability
UMLS licensing information: Whether starting from source data or from the dump file, construction of Petagraph with the UBKG requires licensing of the UMLS82. The UBKG includes information extracted from the UMLS under a distributor licensing agreement, and consumers of the UBKG must have a UMLS license83. The UBKG’s Neo4j database, which contains licensed UMLS content, should not be stored in public repositories.
Software Versions: We used Neo4j Desktop 1.5.7 for general development, analyses, and prototyping for the analyses in this paper. We used Neo4j 5.14 for server-side hosting of Petagraph and for larger queries.
Python 3.10 was used for data formatting and preprocessing, such as in preparation of the edge and node source CSV files, as well as for Petagraph build testing.
Path and connectivity analyses were performed using igraph v1.2.1184 in RStudio 2022.12.0.35374 and R v4.2.275.
References
Nicholson, D. N. & Greene, C. S. Constructing knowledge graphs and their biomedical applications. Comput. Struct. Biotechnol. J. 18, 1414–1428 (2020).
Moon, C. et al. Learning Drug-Disease-Target Embedding (DDTE) from knowledge graphs to inform drug repurposing hypotheses. J. Biomed. Inform. 119, 103838 (2021).
Alves, V. M. et al. Knowledge-based approaches to drug discovery for rare diseases. Drug Discov. Today https://doi.org/10.1016/j.drudis.2021.10.014 (2021).
Zheng, S. et al. PharmKG: a dedicated knowledge graph benchmark for biomedical data mining. Brief. Bioinform. 22 (2021).
Alshahrani, M. & Hoehndorf, R. Drug repurposing through joint learning on knowledge graphs and literature. bioRxiv 385617, https://doi.org/10.1101/385617 (2018).
Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Sci Data 10, 67 (2023).
Steenwinckel, B. et al. Facilitating the Analysis of COVID-19 Literature Through a Knowledge Graph. in The Semantic Web – ISWC 2020 344–357, https://doi.org/10.1007/978-3-030-62466-8_22 (Springer International Publishing, 2020).
Cernile, G. et al. Network graph representation of COVID-19 scientific publications to aid knowledge discovery. BMJ Health Care Inform 28 (2021).
Reese, J. T. et al. KG-COVID-19: A Framework to Produce Customized Knowledge Graphs for COVID-19 Response. Patterns (N Y) 2, 100155 (2021).
Domingo-Fernández, D. et al. COVID-19 Knowledge Graph: a computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology. Bioinformatics 37, 1332–1334 (2021).
Zhang, P. et al. Toward a Coronavirus Knowledge Graph. Genes 12, (2021).
Chen, C., Ross, K. E., Gavali, S., Cowart, J. E. & Wu, C. H. COVID-19 knowledge graph from semantic integration of biomedical literature and databases. Bioinformatics https://doi.org/10.1093/bioinformatics/btab694 (2021).
Ostaszewski, M. et al. COVID19 Disease Map, a computational knowledge repository of virus-host interaction mechanisms. Mol. Syst. Biol. 17, e10387 (2021).
Zhao, L. et al. Biological knowledge graph-guided investigation of immune therapy response in cancer with graph neural network. Brief. Bioinform. https://doi.org/10.1093/bib/bbad023 (2023).
Zhu, Y., Zhou, Y., Liu, Y., Wang, X. & Li, J. SLGNN: Synthetic lethality prediction in human cancers based on factor-aware knowledge graph neural network. Bioinformatics, https://doi.org/10.1093/bioinformatics/btad015 (2023).
Jha, A., Khan, Y., Sahay, R. & d’Aquin, M. Metastatic Site Prediction in Breast Cancer using Omics Knowledge Graph and Pattern Mining with Kirchhoff’s Law Traversal. https://doi.org/10.1101/2020.07.14.203208.
Choi, W. & Lee, H. Identifying disease-gene associations using a convolutional neural network-based model by embedding a biological knowledge graph with entity descriptions. PLoS One 16, e0258626 (2021).
Feng, F. et al. GenomicKB: a knowledge graph for the human genome. Nucleic Acids Res. 51, D950–D956 (2023).
Shefchek, K. A. et al. The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 48, D704–D715 (2020).
Birney, E., Vamathevan, J. & Goodhand, P. Genomics in healthcare: GA4GH looks to 2022. bioRxiv 203554, https://doi.org/10.1101/203554 (2017).
Silverstein, J. C. et al. The Unified Biomedical Knowledge Graph (UBKG). GitHub https://github.com/x-atlas-consortia/ubkg-etl (2023).
Bodenreider, O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 32 267D–270 (2004)
HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019).
SenNet Consortium. NIH SenNet Consortium to map senescent cells throughout the human lifespan to understand physiological health. Nat Aging 2, 1090–1100 (2022).
NIH Common Fund Data Ecosystem Data Distillery Partnership Repository. GitHub https://github.com/nih-cfde/data-distillery.
Ahooyi, T. M., Stear, B. J. & Taylor, D. M. Positioning Genomic Features in Biomedical Knowledge Graphs using the Homo sapiens Chromosomal Location Ontology for GRCh38 (HSCLO38). bioRxiv 2024.02.15.580505, https://doi.org/10.1101/2024.02.15.580505 (2024).
Simmons, J. A. & Silverstein, J. C. Unified Biomedical Knowledge Graph (UBKG) Source Contexts documentation. Unified Biomedical Knowledge Graph (UBKG) documentation pages https://ubkg.docs.xconsortia.org/contexts/#umls-source-context-umls-graph.
Jackson, R. et al. OBO Foundry in 2021: operationalizing open data principles to evaluate ontologies. Database 2021 (2021).
BioPortal. National Center for Biomedical Ontology https://bioportal.bioontology.org/.
UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2023).
Harrow, J. et al. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 22 (2012).
Yates, B., Gray, K. A., Jones, T. E. M. & Bruford, E. A. Updates to HCOP: the HGNC comparison of orthology predictions tool. Brief. Bioinform. 22 (2021).
Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–D1217 (2021).
Seal, R. L. et al. Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 51, D1003–D1009 (2023).
Callahan, T. J. et al. An open source knowledge graph ecosystem for the life sciences. Sci Data 11, 363 (2024).
Groza, T. et al. The International Mouse Phenotyping Consortium: comprehensive knockout phenotyping underpinning the study of human disease. Nucleic Acids Res. 51, D1038–D1045 (2023).
Eppig, J., Blake, J., Bult, C., Kadin, J. & Richardson, J. The Mouse Genome Database (MGD): facilitating mouse as a model for human biology and disease. Nucleic Acids Res. 43, D726–D736 (2014).
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).
Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst 1, 417–425 (2015).
Harrison, P. W. et al. Ensembl 2024. Nucleic Acids Res. 52, D891–D899 (2024).
Dekker, J. et al. The 4D nucleome project. Nature 549, 219–226 (2017).
Asp, M. et al. A Spatiotemporal Organ-Wide Gene Expression and Cell Atlas of the Developing Human Heart. Cell 179, 1647–1660.e19 (2019).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Louden, D. N. MedGen: NCBI’s Portal to Information on Medical Conditions with a Genetic Component. Med. Ref. Serv. Q. 39, 183–191 (2020).
Vasilevsky, N. A. et al. Mondo: Unifying diseases for the world, by the world. bioRxiv https://doi.org/10.1101/2022.04.13.22273750 (2022).
Malone, J. et al. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 26, 1112–1118 (2010).
National Library of Medicine. Medical Subject Headings (MESH). NIH - National Library of Medicine https://www.nlm.nih.gov/mesh/meshhome.html (2020).
Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006).
Rouillard, A. D. et al. The harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. Database 2016 (2016).
Hastings, J. et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Research 44, D1214–D1219 (2016).
Mungall, C. J., Torniai, C., Gkoutos, G. V., Lewis, S. E. & Haendel, M. A. Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13, R5 (2012).
York, W. S. et al. GlyGen: Computational and Informatics Resources for Glycoscience. Glycobiology 30, 72–73 (2020).
GlyGen Datasets. https://data.glygen.org.
Boutet, E. et al. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. Methods Mol. Biol. 1374, 23–54 (2016).
Christine E. Seidman, MD. Harvard Medical School, Boston, MA, USA. National Heart, Lung, and Blood Institute (NHLBI) Bench to Bassinet Program: The Gabriella Miller Kids First Pediatric Research Program of the Pediatric Cardiac Genetics Consortium (PCGC).
Duan, Q. et al. LINCS Canvas Browser: interactive web app to query, browse and interrogate LINCS L1000 gene expression signatures. Nucleic Acids Res. 42, W449–60 (2014).
Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000. Profiles. Cell 171, 1437–1452.e17 (2017).
Baldarelli, R. M., Smith, C. L., Ringwald, M., Richardson, J. E. & Bult, C. J. Mouse Genome Informatics Group Mouse GenomeInformatics: an integrated knowledgebase system for the laboratory mouse. Genetics 227 (2024).
Simmons, J. A. & Silverstein, J. C. Unified Biomedical Knowledge Graph (UBKG) Source Contexts. xconsortia.org https://ubkg.docs.xconsortia.org/contexts/ (2024).
Stear, B. J., Mohseni Ahooyi, T. & Taylor, D. M. Petagraph Project. GitHub https://github.com/TaylorResearchLab/Petagraph.
Callahan, T. J. owl-nets: Transforming OWL for statistical learning. github.com https://github.com/callahantiff/owl-nets.
Van Harmelen, F. & McGuinness, D. L. OWL web ontology language overview. World Wide Web Consortium (W3C) Recommendation 69, 70 (2004).
Callahan, T. J., Tripodi, I. J., Hunter, L. E. & Baumgartner, W. A. A Framework for Automated Construction of Heterogeneous Large-Scale Biomedical Knowledge Graphs. bioRxiv 2020.04.30.071407, https://doi.org/10.1101/2020.04.30.071407 (2020).
OBO_Foundry. OBO Relations Ontology 2023-01-04 Release. OBO Relations Ontology at GitHub https://github.com/oborel/obo-relations, https://doi.org/10.5281/zenodo.32899.
Simmons, J. A. & Silverstein, J. C. UBKG Edge Node Format instructions. xconsortia.org https://ubkg.docs.xconsortia.org/formats/#ubkg-edgesnodes-format.
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–45 (2016).
Neo4j. Neo4j Operations Manual: Import for the Neo4j Admin and Neo4j CLI. Neo4j Operations Manual Documentation https://neo4j.com/docs/operations-manual/current/tools/neo4j-admin/neo4j-admin-import/#import-tool-header-format.
Simmons, J. A. & Silverstein, J. C. UBKG ETL Generation Framework, OWLNETS-UMLS-GRAPH-12.py. github.com https://github.com/x-atlas-consortia/ubkg-etl/blob/main/generation_framework/owlnets_umls_graph/OWLNETS-UMLS-GRAPH-12.py
National Library of Medicine (US). UMLS® Reference Manual [Internet]. (National Library of Medicine, 2009).
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 52, D33–D43 (2024).
Stear, B. J., Mohseni Ahooyi, T. & Taylor, D. M. Petagraph Data Source Descriptions and Schema Reference. https://github.com/TaylorResearchLab/Petagraph/blob/main/petagraph/data_dict.md.
Osumi-Sutherland, D. et al. Cell type ontologies of the Human Cell Atlas. Nat. Cell Biol. 23, 1129–1135 (2021).
Kolde, R. Pheatmap: Pretty Heatmaps R Package Version 1.0.12. (2019).
Posit team. RStudio: Integrated Development Environment for R. (Posit Software, PBC, Boston, MA, 2022).
R Core Team. R: A Language and Environment for Statistical Computing. (R Foundation for Statistical Computing, Vienna, Austria, 2022).
National Institutes of Health. UMLS Metathesaurus. U.S. National Library of Medicine (2009).
Stear, B., Taylor, D. & Ahooyi, T. Petagraph. Center For Open Science https://doi.org/10.17605/OSF.IO/6JTC9 (2023).
Topol, E. J. Failing the Public Health — Rofecoxib, Merck, and the FDA. New England Journal of Medicine 351, 1707–1709 (2004).
Lupo, P. J. et al. Association Between Birth Defects and Cancer Risk Among Children and Adolescents in a Population-Based Assessment of 10 Million Live Births. JAMA Oncol 5, 1150–1158 (2019).
Schraw, J. M. et al. Cancer diagnostic profile in children with structural birth defects: An assessment in 15,000 childhood cancer cases. Cancer 126, 3483–3492 (2020).
Neo4j The Neo4j. Graph Data Science library manual v2.13. https://neo4j.com/docs/graph-data-science/2.13/.
How to License and Access the Unified Medical Language System® (UMLS®) Data. National Library of Medicine: UMLS https://www.nlm.nih.gov/databases/umls.html.
UMLS - Metathesaurus License Agreement. National Library of Medicine: UMLS https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/release/license_agreement.html.
Csardi, G. & Nepusz, T. & Others. The igraph software package for complex network research. InterJournal, complex systems 1695, 1–9 (2006).
Meehan, T. F. et al. Logical Development of the Cell Ontology. BMC Bioinformatics 12, 6 (2011).
Schriml, L. M. et al. The Human Disease Ontology 2022 update. Nucleic Acids Res. 50, D1255–D1261 (2022).
Acknowledgements
Funding for this project is acknowledged from the NIH Common Fund, through the Office of Strategic Coordination/Office of the NIH Director under awards R03OD030600 and OT2OD030162 (D.M.T), and OT2OD026663, OT2OD026675 (J.C.S.); The Department of Biomedical Informatics at The Children’s Hospital of Philadelphia (D.M.T.). We would like to thank Charles Borromeo at the University of Pittsburgh for his early work on UBKG.
Author information
Authors and Affiliations
Contributions
T.M.A. and B.J.S. wrote code for analyses and figures. B.J.S., T.M.A., J.A.S., D.M.T. and J.C.S. wrote and edited the paper. T.M.A., D.M.T. and B.J.S. designed the Petagraph schema. J.C.S. designed the U.B.K.G. schema, based on the NIH UMLS. B.J.S. and T.M.A. implemented and managed Petagraph’s build. J.A.S. and J.C.S. designed the UBKG build process. D.M.T. conceived of, supervised, and funded work on Petagraph.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Stear, B.J., Mohseni Ahooyi, T., Simmons, J.A. et al. Petagraph: A large-scale unifying knowledge graph framework for integrating biomolecular and biomedical data. Sci Data 11, 1338 (2024). https://doi.org/10.1038/s41597-024-04070-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-04070-w