BioCyc and Pathway Tools Blog: orthologs

Friday, October 12, 2018

Pan-Genome PGDBs Unify Genomic and Metabolic Data across Related Strains

Pan-Genome PGDBs are a relatively new feature of BioCyc. These BioCyc databases combine in one place information about multiple sequenced genomes for a given species. For example, the Helicobacter pylori Pan-Genome database covers 158 sequenced strains.

The Pan-Genome PGDBs contain one gene object for each orthologous group of genes in the organism. They also contain the union of all metabolic pathways across all the strains. Thus, a Pan-Genome PGDB allows you to quickly assess the full set of gene functions and metabolic pathways across the the known strains. For example, the gene page shows all orthologs across all strains in the Pan-Genome. The page for the ftsX gene illustrates this for a gene with relatively few orthologs.

Other gene pages list hundreds of orthologs and synonyms.

You can visit a page for the orthologous genes by following the links in the 'Relationship Links' area near the bottom of the page.

Pan-Genomes provide a way to visualize genes in the genome browser as well. It's easier to understand this visualization if you know that Pan-Genomes are constructed starting with the PGDB of a base strain and adding other members of the collection of strains one-by-one. For example, the H. pylori Pan-Genome was constructed by starting with strain 26695. The visualization is based on dividing groups of orthologous genes into two sets: ortholog groups that include genes that occur in the base strain, and ortholog groups that only include genes in other strains.

Ortholog groups that include genes from the base strain are collected on a 'chromosome' that preserves the location and direction of the genes from the base strain. The remaining ortholog groups are mapped in arbitrary order onto an 'artificial replicon'. The names in the artificial replicon display are based on the (arbitrary) order that PGDBs were added to the Pan-Genome.

You can search for genes either by name (e.g., 'abc') or an identifier combining a strain name and id joined with an underscore (e.g., HPHPP11_0013) by entering the name in the search box at the top right. Either quick search or gene search will take you to the page containing the gene and its orthologs in the Pan-Genome PGDB.

The cellular overview diagram provides a way to visualize reactions associated with genes shared by all members of the Pan-Genome. You can also visualize those reactions that are unique to a single organism within the Pan-Genome. In the screen shot below, the reactions shared by all organisms in the H. pylori Pan-Genome are shown in red and those unique to a single organism are purple.

To create a diagram like this on BioCyc web site, select a Pan-Genome PGDB, and bring up the cellular overview. In the operations menu, choose Highlight Genes -> By Pan-Genome Core Genes. The core genes are the set of genes shared by all organism databases in the Pan-Geneome. Chose Highlight Genes -> By Pan-Genome Unique Genes, to highlight reactions associate with genes that are unique to a single organism database.

In the desktop, show the Cellular Overview, then from Overviews -> Highlight -> Highlight the core genome followed by Overviews -> Highlight the unique genes will achieve the same highlighting.

More details on how Pan-Genome PGDBs are created and how to use them are provided here.

You can select a Pan-Genome PGDB by entering the phrase “pan-genome” in the change organism database dialogue. Here's our current list of Pan-Genome PGDBs and the number of strains that each one contains. Over time we will be adding Pan-Genome PGDBs for additional species, and regenerating existing Pan-Genome PGDBs to include additional strains.

Clostridioides difficile 10 strains

Escherichia coli 374 strains

Helicobacter pylori 158 strains

Listeria monocytogenes 35 strains

Mycobacterium tuberculosis 24 strains

Pseudomonas aeruginosa 24 strains

Salmonella enterica 113 strains

Shigella flexneri 9 strains

Vibrio cholerae 81 strains

Friday, June 29, 2018

Generating a SmartTable of Orthologous Genes Across Multiple BioCyc Genomes

This post shows how, given a list of gene names and/or identifiers from one organism, to retrieve the orthologous genes in a second organism. For example, this procedure could be used to find the orthologs in EcoCyc of a set of genes of uncertain function in another organism, potentially providing insights about their functions.

The BioCyc project generates a database of orthologous genes between many of the organism Pathway Genome Databases (PGDBs) we maintain. These orthologs are stored in a central MySQL database that is separate from the individual BioCyc PGDBs. This post describes a way to retrieve orthologs for a list of genes using a SmartTable. We don’t recommend generating a ortholog list for a whole genome because of performance problems that, as of June 2018, we are working to resolve.

Since we will be using a PGDB in addition to EcoCyc, you will need a BioCyc subscription (not just a free account) to follow through this demonstration. Here’s a screenshot of the final SmartTable, which you can access directly here. We will begin with a file of 88 genes from E. coli strain B str. REL606 and determine their orthologs in EcoCyc. The columns in the resulting SmartTable are as follows: the list of gene names from the input file, their ‘ECB’ accession IDs genes, the names of orthologous genes from EcoCyc, and ‘b-number’ accession IDs from EcoCyc. We also show how to add a column containing gene product names.

Here is the step-by-step procedure.

1. Go to the bottom of this post and cut and paste the list of gene identifiers into a text editor such as textedit or atom. Save the file as 88EcoliGenes.txt.

2. Go to BioCyc.org and login and use change organism database to change your organism to Escherichia coli B str. REL606.

3. Open the ‘Smart Tables’ menu and choose the ‘My Smart Tables’ command. It actually doesn’t matter which of the commands in the menu you choose.

4. You’ll find the operations menu in the upper right corner of the SmartTables page. Under the ‘New’ command, you’ll find ‘Smart Table from Uploaded File’. Choose that command.

5. In the resulting pop-up window, click the ‘Choose File’ button and select the file you saved out in step 1. Once you have located and selected the file, review the options below the ‘Choose File’ button in the upload window. Since this is a file of gene identifiers, you should keep the ‘Try to make objects of type’ box checked, along with the radio button next to ‘Gene’. Also leave the other two check boxes checked. Click the ‘Upload’ button and a new SmartTable with a single column will appear. If you see a warning message, ignore it and continue.

6. Add accession numbers for the genes, by locating the ‘ADD PROPERTY COLUMN’ dropdown menu, then choose ‘Accession-1’ from the list. You’ll see a list of ECB identifiers numbers’ for each of the genes on the first page.

7. Now that the table has some basic identifiers for the E. coli B str. REL606 genes, you can proceed to adding orthologs from EcoCyc. Find the ‘ADD TRANSFORM COLUMN’, and choose the second transform ‘Compare – map to other species PGDB’. That will bring up a pop-up list of species, which will be rather long. Scroll down through the list until you find ‘Escherichia coli K-12 substr. MG1655’. Once you have selected the right organism, click the ‘Go’ button.

8. Now a column containing ortholog gene names appears. Each entry is linked to the gene page in its organism’s PGDB. Try it by clicking on thrL, and notice the organism in the upper left of the gene page. Clicking on the browser back button will take you back to the table and restore the current PGDB to EcoCyc.

9. To add a column containing accession numbers of each orthologous gene, select the ortholog column by clicking anywhere in the column header except where the title is displayed. When the column is selected the header turns darker (and the ‘Gene Name’ column lightens as it is no longer the selected column). Then, as in step 6, choose ‘Accession-1’ from the dropdown below ‘Add Property Column.’ A different list of accession numbers appears. Notice that both lists of accession numbers are simply strings that don’t link to any genes. A column of product names for the orthologs can be added in a similar fashion.

10. Finally, suppose you were interested only in those B strain genes having orthologs in the K-12 strain. To remove genes lacking orthologs, select the third column (‘Map to Escherichia…’). Now look over to the operations menu on the right. Select the ‘Filter’ command. A pop-up window titled ‘Column text-value filter’ will appear.

11. This dialog has a series of options chained together in a ‘sentence.’ In the second drop-down, change ‘a copy of …’ to ‘this SmartTable’. In the third option, choose ‘contain an object’. Click the ‘Go’ button. You should be left with a SmartTable containing 84 rows, the B / K-12 strain ortholog pairs, as shown in the screen capture at the top of this post.

12. Finally, you can save the results to a file by selecting the Export command from the Operations menu. Choose ‘to Spreadsheet file’ and since all of the genes have names in column one, you can save as common names rather than frame IDs, which are more readable in a spreadsheet.

As always, I welcome your questions and comments below.

Here is the example data. Cut the area between the horizontal lines and paste into a text editor, such as textedit or atom, and save the file as 88Ecoligenes.txt.

thrL

thrA

thrB

thrC

yaaX

yaaA

yaaJ

talB

mog

satP

yaaI

dnaK

dnaJ

insL-1

mokC

hokC

nhaA

nhaR

ECB_00020

ECB_00021

ECB_00022

insA-1

insB-1

ECB_00025

ECB_00026

rpsT

yaaY

ribF

ileS

lspA

fkpB

ispH

rihC

dapB

carA

carB

caiF

caiE

caiD

caiC

caiB

caiA

caiT

fixA

fixB

fixC

fixX

yaaU

kefF

kefC

folA

apaH

apaG

rsmA

pdxA

surA

lptD

djlA

rluA

hepA

polB

araD

araA

araB

araC

yabI

thiQ

thiP

tbpA

sgrR

setA

leuD

leuC

leuB

leuA

leuL

leuO

ilvI

ilvH

cra

mraZ

rsmH

ftsL

ftsI

murE

murF

mraY

murD

Friday, October 12, 2018

Pan-Genome PGDBs Unify Genomic and Metabolic Data across Related Strains

Friday, June 29, 2018

Generating a SmartTable of Orthologous Genes Across Multiple BioCyc Genomes

Subscribe To