[go: up one dir, main page]

Guide to EcoCyc

Contents

    1  EcoCyc Project Overview

    2  How to Cite EcoCyc

    3  The Roles of EcoCyc in Microbial Genome Annotation

    4  Conditions of E. coli Growth and Non-Growth

    5  Essential Gene Information

    6  EcoCyc Metabolic Flux Model

    7  Update Frequency

    8  Data Sources Incorporated into EcoCyc
        8.1  UniProt Features
        8.2  Gene Ontology
        8.3  RefSeq Collaboration
        8.4  MetaCyc

    9  EcoCyc Accession Numbers
        9.1  Gene Accession Numbers

    10  Other E. coli and Shigella PGDBs in BioCyc

    11  We Encourage Your Feedback

    12  How to Learn More

    13  Acknowledgments

1  EcoCyc Project Overview

EcoCyc1 is a bioinformatics database that describes the genome and the biochemical machinery of E. coli K-12 MG1655. The long-term goal of the project is to describe the molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists, and for biologists who work with related microorganisms.

In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc.

This chapter provides an overview of the data content of EcoCyc, and of the procedures by which these data have been and continue to enter EcoCyc.

EcoCyc is designed for several different modes of interactive use via the EcoCyc.org web site and in conjunction with the downloadable Pathway Tools [1] software (Section 12 tells how to learn how to use the web site and software):

EcoCyc data are also available for download in multiple file formats [2] and can be queried programmatically via web services [3].

Genome. EcoCyc contains the complete genome sequence of E. coli, and describes the nucleotide position and function of every E. coli gene. A staff of five full-time curators updates the annotation of the E. coli genome on an ongoing basis using a literature-based curation (see below) strategy. Mini-review summaries of E. coli gene products can be found in EcoCyc protein and RNA pages. Users can retrieve the nucleotide sequence of a gene, and the amino-acid sequence of a gene product.

Regulation. EcoCyc describes several types of E. coli cellular regulation:

Membrane transporters. EcoCyc annotates E. coli transport proteins, and the associated transport reactions that they mediate.

Metabolism. EcoCyc describes all known metabolic pathways and signal-transduction pathways of E. coli. It describes each metabolic enzyme of E. coli, including its cofactors, activators, inhibitors, and subunit structure. See also the MetaCyc project.

Database links. EcoCyc is linked to other biological databases containing protein and nucleic acid sequence data, bibliographic data, protein structures, and descriptions of different E. coli strains. Literature-Based Curation.

Curation is the process of manually refining and updating a bioinformatics database. The EcoCyc project uses a literature-based curation approach in which database updates are based on evidence in the experimental literature. EcoCyc is largely up to date with respect to its curation activities. As of March 2013, EcoCyc has encoded information from more than 43,542​ ​ publications.

Curators collect gene, protein, pathway, and compound names and synonyms. They classify genes and gene products using the Gene Ontology and MultiFun ontology, and they classify pathways within the Pathway Tools pathway ontology. Protein complex components and the stoichiometry of these subunits are captured; cellular localization of polypeptides and protein complexes is entered, as are experimentally determined protein molecular weights; enzyme activities and any enzyme prosthetic groups, cofactors, activators, or inhibitors are captured. Operon structure and gene regulation information are encoded. Textual summaries with extensive citations are authored by curators. Within the summaries for proteins, RNAs, pathways, and operons, curators capture additional information not captured in the highly structured database fields of EcoCyc. For example, curators use the free-text summary sections to capture phenotypes caused by mutation, depletion, or overproduction of each gene product; any genetic interactions known; protein domain architecture and structural studies; similarity to other proteins; or any functional complementation experiments that have been described. Summaries can also be used to note cases in which the published reports present contradictory results. In such cases, both viewpoints will be presented with proper attribution. This approach assures that no information is lost. Underlying software. The Pathway Tools software that underlies EcoCyc is not specific to E. coli, but has been applied to manage genomic and biochemical data for hundreds of organisms.

2  How to Cite EcoCyc

Please cite EcoCyc in publications that benefited from the use of the EcoCyc database or web site. Please cite EcoCyc as:

Keseler et al., Nuc Acids Res, 39:D583–90 2011.

3  The Roles of EcoCyc in Microbial Genome Annotation

The EcoCyc database can impact two aspects of microbial genome annotation: annotation of gene function, and annotation of metabolic pathways.

We suggest that microbial genome annotation pipelines include a BLAST search (or a search by other sequence similarity tools) against all proteins with experimentally defined functions from EcoCyc. As discussed in our article Multidimensional annotation of the Escherichia coli K-12 genome, E. coli contains more proteins of experimentally determined functions than any other organism. Strong similarity hits to the preceding proteins should be preferred over hits against other proteins during assignment of functions to newly sequenced genes to minimize the chances of annotation errors due to transitive annotations.

4  Conditions of E. coli Growth and Non-Growth

As of 2011 EcoCyc incorporates media that have been shown experimentally to support or not support growth of both wild type and knock-out strains of E. coli K–12. This work has two goals. First is to assemble a comprehensive encyclopedia of E. coli growth conditions for experimentalists. The spectrum of environmental conditions supporting the growth of a bacterium is among its most important phenotypic traits. We cannot expect to understand the functions of all genes in an organism unless we understand the full range of environments in which the cell can grow. Second, a comprehensive collection of E. coli growth media will drive more accurate systems biology modeling of E. coli. The larger is the set of growth media against which these models are validated, the more accurate and comprehensive the models will be.

EcoCyc captures approximately 20 media that are commonly used by E. coli laboratories. It also describes media used in the following high-throughput experiments from Biolog Phenotype Microarrays (PMs) that support respiration in E. coli.

These data on growth conditions can be accessed from the EcoCyc Web site by invoking the command Tools → Search → Growth Media, then clicking on the button “All Growth Media for this Organism.” Individual media are shown in the initial table; PM data are shown in the following tables. The coloring of each cell indicates the degree of growth observed under that condition. Three levels of growth can be recorded: no growth, low growth, and growth (see legend that indicates the colors associated with each level of growth). Click on any growth medium to request a page describing its composition, and to see genes that are essential or not essential for growth under that condition.

5  Essential Gene Information

As of 2011 EcoCyc incorporates several large-scale datasets on gene essentiality in E. coli. Gene essentiality information is useful for

EcoCyc incorporates data on essentiality from the following publications:

When essentiality data is available for a given gene, the EcoCyc gene page includes a table of the conditions under which that gene has been found to be essential, or not essential, for growth. Clicking on the condition will navigate to a growth-medium page that lists all essentiality information under that growth condition.

6  EcoCyc Metabolic Flux Model

A quantitative steady-state metabolic flux model has been derived from EcoCyc using Flux-Balance Analysis (FBA). By running this model with different parameters, scientists can model the growth of E. coli under different nutrient conditions and under different gene knock-outs. Every time the model is executed, the model is freshly generated from EcoCyc, meaning that as the reactions in EcoCyc are updated due to curation, the model evolves to reflect those changes.

To run the model, use the Tools → Metabolism → Run Metabolic Model command. MetaFlux is described in the in the Metabolic Models section of the website user guide.

7  Update Frequency

The EcoCyc.org and BioCyc.org Web sites and downloadable files are updated approximately three times per year. A faster, more powerful EcoCyc that you can install locally on your computer (Macintosh, PC/Windows, PC/Linux) is released semiannually.[EcoCyc release history]

8  Data Sources Incorporated into EcoCyc

8.1  UniProt Features

UniProt protein features (the UniProt KB term is sequence annotations) from the complete proteome of E. coli K-12 MG1655 in SwissProt are imported into EcoCyc for every EcoCyc release. We import all protein features with experimental or non-experimental evidence qualifiers except for the following types: turn, helix, beta strand, and coiled‑coil. The chain type is only imported if it does not span the entire length of the protein. Examples of imported feature types include catalytic domains, phosphorylation sites, and metal ion binding sites. We import citations associated with UniProt protein features if they have an associated PubMed ID. The import of protein features into EcoCyc is done via the UniProt Feature Importer tool within the Pathway Tools software (which can be applied to any PGDB).

8.2  Gene Ontology

For several years, EcoCyc and EcoliWiki have been collaborating on improving and maintaining the GO annotations for E. coli. Since the summer of 2008, we have been periodically generating a file containing all E. coli K-12 GO term annotations, called ecocyc.gaf, that may be obtained from the Gene Ontology Consortium.

GO annotation has become a standard part of the EcoCyc’s manual literature-based curation process. The GO annotations are added to the database objects that represent the functional gene products or protein complexes, not directly to the gene objects, so as to model the biology as accurately as possible. In parallel, manual annotation of E. coli genes with GO is ongoing at EcoliWiki. On a regular basis, the GO annotations are merged. The latest UniProt and EcoliWiki annotations are imported into EcoCyc. Because electronic annotations are not accepted by the GO consortium as part of the gene association file if they are more than one year old, these UniProt annotations are reimported into EcoCyc on a regular basis.

EcoCyc incorporates many electronic and experimental GO term annotations of E. coli K-12 gene products obtained from the “UniProt [multispecies] GO Annotations @ EBI” file downloaded from the Gene Ontology Consortium. When this import was first performed in 2007, about 30,000 new IEA (“Inferred from Electronic Annotation”) GO term assignments were added to EcoCyc, along with approximately 1,000 assignments with experimental evidence codes including assignments from high-throughput protein-interaction studies. During the import of GO terms from UniProt into EcoCyc, a filtering operation is applied to prune out GO term annotations that had solely computational (IEA) evidence, if the EcoCyc gene product already had more specific GO annotations (in other words, GO terms that are children of the GO term being imported), and which had experimental evidence available. For example, if a gene product already contained an experimental annotation of the term “galactose kinase,” the software would not add the computational annotation “carbohydrate kinase.” This filtering leads to the removal of about 1,000 of these less specific and redundant annotations. A gene association file is generated from the quarterly releases of EcoCyc. This file is sent to the EcoliWiki team at Texas A&M for further processing. At EcoliWiki, annotations made in the wiki-based community annotation system since the last EcoCyc update are added to the file, along with annotations containing qualifiers (mainly contributes_to) not yet supported by EcoCyc. Only those annotations that are complete by GO consortium standards are extracted from EcoliWiki; incomplete annotations are left in place with the hope that community members will eventually complete them. EcoliWiki runs the GO consortium validation scripts and deposits the file with the GO consortium via their Concurrent Versioning System.

8.3  RefSeq Collaboration

EcoCyc is involved in a collaboration to update the genome annotation of the GenBank (U00096.3) and RefSeq (NC_000913.3) entries for E. coli K-12 MG1655 on an ongoing basis. The primary collaborators include EcoCyc, EcoGene, UniProtKB/Swiss-Prot, and NCBI. The collaborators routinely share their data and resolve conflicts among the data. Updates of gene names, gene positions, and gene product names are shared among all partners.

8.4  MetaCyc

The EcoCyc and MetaCyc databases exchange data as part of the release processes for both databases. Updates that have occurred to enzymes, genes, pathways, reactions, and metabolites are exchanged between the database based on automated comparisons of update dates to ensure that the latest information and corrections are propagated between databases.

9  EcoCyc Accession Numbers

9.1  Gene Accession Numbers

Three systems of accession numbers are typically available for genes within EcoCyc. Any of these accession numbers may be used when querying EcoCyc genes “by name,” and in the Web site Quick Search.

10  Other E. coli and Shigella PGDBs in BioCyc

EcoCyc is part of the larger BioCyc collection of Pathway/Genome Databases (PGDBs). BioCyc version 16.0 (2012) included more than 130 E. coli and Shigella PGDBs. Most of these PGDBs were generated computationally and lack the extensive manual literature-based curation of the EcoCyc K-12 database. Two of these PGDBs have undergone additional curation: the BioCyc PGDBs for strains W3110 and for E. coli B str. REL606. Both strains underwent a computational annotation normalization procedure in which gene names, product names, heteromultimeric protein complexes, and Gene Ontology terms were propagated from EcoCyc to their orthologous genes in these other two strains. This procedure was performed under the assumption that genome annotation pipelines typically introduce syntactically large but semantically insignificant variation in the naming of genes and gene products. In addition, E. coli B str. REL606 is undergoing literature-based curation to incorporated experimental information regarding the genes and pathways present in this straing but not in the EcoCyc strain MG1655. This curation is supported by the PortEco (formerly EcoliHub) project.

To select a given genome for querying in the BioCyc Web site, click on the word “change” under the Quick Search and Gene Search buttons in the upper right corner of most Web pages.

11  We Encourage Your Feedback

Feedback from the scientific community has been invaluable to improving EcoCyc during its many years of development. We strongly encourage your comments and suggestions for improvements in areas including the following. Please email suggestions or questions to biocyc-support at ai dot sri dot com.

At every EcoCyc release we email a summary of new developments to our biocyc-users mailing list. To subscribe to this mailing list, please see http://biocyc.org/subscribe.shtml.

12  How to Learn More

13  Acknowledgments

The development of EcoCyc is funded by NIH grants GM77678 and GM71962 from the NIH National Institute of General Medical Sciences.

Contributors to EcoCyc are listed on the credits page.

References

[1]   P. D. Karp, S. M. Paley, M. Krummenacker, M. Latendresse, J.M. Dale, T. Lee, P. Kaipa, F. Gilham, A. Spaulding, L. Popescu, T. Altman, I. Paulsen, I.M. Keseler, and R. Caspi. Pathway Tools version 13.0: Integrated software for pathway/genome informatics and systems biology. Brief Bioinform, 11:40–79, 2010. http://bib.oxfordjournals.org/cgi/content/abstract/bbp043.

[2]   BioCyc and Pathway Tools Download Information. Deletetitle. https://biocyc.org/download.shtml.

[3]   Pathway Tools Web Services. Deletetitle. https://biocyc.org/web-services.shtml.

[4]   M. AbuOun, P. F. Suthers, G. I. Jones, B. R. Carter, M. P. Saunders, C. D. Maranas, M. J. Woodward, and M. F. Anjum. Genome scale reconstruction of a Salmonella metabolic model: comparison of similarity and differences with a commensal Escherichia coli strain. J Biol Chem, 284(43):29480–8, 2009.

[5]   D. J. Baumler, R. G. Peplinski, J. L. Reed, J. D. Glasner, and N. T. Perna. The evolution of metabolic networks of E. coli. BMC Systems Biology, 5:182, 2011.

[6]   S. H. Yoon, M. J. Han, H. Jeong, C. H. Lee, X. X. Xia, D. H. Lee, J. H. Shim, S. Y. Lee, T. K. Oh, and J. F. Kim. Comparative multi-omics systems analysis of Escherichia coli strains B and K–12. Genome Biol, 13(5):R37, 2012.

[7]   S. Y. Gerdes, M. D. Scholle, J. W. Campbell, G. Balazsi, E. Ravasz, M. D. Daugherty, A. L. Somera, N. C. Kyrpides, I. Anderson, M. S. Gelfand, A. Bhattacharya, V. Kapatral, M. D’Souza, M. V. Baev, Y. Grechkin, F. Mseeh, M. Y. Fonstein, R. Overbeek, A. L. Barabasi, Z. N. Oltvai, and A. L. Osterman. Experimental determination and system level analysis of essential genes in Escherichia coli MG1655. Journal of Bacteriology, 185(19):5673–5684, Oct 2003.

[8]   A. R. Joyce, J. L. Reed, A. White, R. Edwards, A. Osterman, T. Baba, H. Mori, S. A. Lesely, B. Ø. Palsson, and S. Agarwalla. Experimental and computational assessment of conditionally essential genes in Escherichia coli. Journal of Bacteriology, 188(23):8259–8271, 2006.

[9]   T. Baba, T. Ara, M. Hasegawa, Y. Takai, Y. Okumura, M. Baba, K. A. Datsenko, M. Tomita, B. L. Wanner, and H. Mori. Construction of Escherichia coli K–12 in-frame, single-gene knockout mutants: The Keio collection. Mol Systems Biology, 2:2006.0008, 2006.

[10]   A.M. Feist, C.S. Henry, J.L. Reed, M. Krummenacker, A.R. Joyce, P. D. Karp, L.J. Broadbelt, V. Hatzimanikatis, and B.Ø. Palsson. A genome-scale metabolic reconstruction for Escherichia coli K–12 MG1655 that accounts for 1260 ORFs and thermodynamic information. Mol Systems Biology, 3:121–38, 2007. http://www.nature.com/doifinder/10.1038/msb4100155.

[11]   W. M. Patrick, E. M. Quandt, D. B. Swartzlander, and I. Matsumura. Multicopy suppression underpins metabolic evolvability. Mol Biol Evol, 24(12):2716–22, 2007.

[12]   M. Riley, T. Abe, M. B. Arnaud, M. K. Berlyn, F. R. Blattner, R. R. Chaudhuri, J. D. Glasner, T. Horiuchi, I. M. Keseler, T. Kosuge, H. Mori, N. T. Perna, G. Plunkett, K. E. Rudd, M. H. Serres, G. H. Thomas, N. R. Thomson, D. Wishart, and B. L. Wanner. Escherichia coli K-12: A cooperatively developed annotation snapshot–2005. Nuc Acids Res, 34(1):1–9, 2006.

[13]   I. M. Keseler, J. Collado-Vides, A. Santos-Zavaleta, M. Peralta-Gil, S. Gama-Castro, L. Muniz-Rascado, C. Bonavides-Martinez, S. Paley, M. Krummenacker, T. Altman, P. Kaipa, A. Spaulding, J. Pacheco, M. Latendresse, C. Fulcher, M. Sarker, A. G. Shearer, A. Mackie, I. Paulsen, R. P. Gunsalus, and P. D. Karp. EcoCyc: A Comprehensive Database of Escherichia coli biology. Nuc Acids Res, 39:D583–90, 2011.

[14]   I.M. Keseler, C. Bonavides-Martinez, J. Collado-Vides, S. Gama-Castro, R.P. Gunsalus, D. Aaron Johnson, M. Krummenacker, L.M. Nolan, S. M. Paley, I.T. Paulsen, M. Peralta-Gil, A. Santos-Zavaleta, A.G. Shearer, and P. D. Karp. EcoCyc: A comprehensive view of E. coli biology. Nuc Acids Res, 37:D464–70, 2009. http://nar.oxfordjournals.org/cgi/reprint/gkn751?ijkey=7epgizfnGFYQHCe&keytype=ref.

[15]   P. D. Karp, I.M. Keseler, A. Shearer, M. Latendresse, M. Krummenacker, S. M. Paley, I.T. Paulsen, J. Collado-Vides, S. Gama-Castro, M. Peralta-Gil, A. Santos-Zavaleta, M.I. Penaloza-Spinola, C. Bonavides-Martinez, and J. Ingraham. Multidimensional annotation of the Escherichia coli K-12 genome. Nuc Acids Res, 35:7577–90, 2007. http://nar.oxfordjournals.org/cgi/content/full/35/22/7577.

[16]   I.M. Keseler, J. Collado-Vides, S. Gama-Castro, J. Ingraham, S.Paley, I.T. Paulsen, M. Peralta-Gil, and P. D. Karp. EcoCyc: A comprehensive database resource for E. coli. Nuc Acids Res, 33:D334–7, 2005. http://nar.oupjournals.org/cgi/content/full/33/suppl\_1/D334?ijkey=80p4BbGpEFjLQ\&keytype=ref.

[17]   P. D. Karp, M. Arnaud, J. Collado-Vides, J. Ingraham, I.T. Paulsen, and M.H. Jr. Saier. The E. coli EcoCyc database: No longer just a metabolic pathway database. ASM News, 70(1):25–30, 2004.

[18]   P. D. Karp, M. Riley, M. Saier, I.T. Paulsen, S. Paley, and A. Pellegrini-Toole. The EcoCyc database. Nuc Acids Res, 30(1):56–8, 2002.

[19]   P. D. Karp, M. Riley, M. Saier, I.T. Paulsen, S. Paley, and A. Pellegrini-Toole. The EcoCyc and MetaCyc databases. Nuc Acids Res, 28(1):56–59, 2000.

[20]   P. D. Karp. Using the EcoCyc database. In Nucleic Acid and Protein Databases and How To Use Them, pages 269–280. Academic Press, London, 1999.

[21]   P. D. Karp and M. Riley. EcoCyc: The resource and the lessons learned. In Bioinformatics Databases and Systems, pages 47–62. Kluwer Academic Publishers, Norwell, MA, 1999.

[22]   P. Karp, M. Riley, S. Paley, A. Pellegrini-Toole, and M. Krummenacker. EcoCyc: Electronic encyclopedia of E. coli genes and metabolism. Nuc Acids Res, 27(1):55–58, 1999.

[23]   P. Karp, M. Riley, S. Paley, A. Pellegrini-Toole, and M. Krummenacker. EcoCyc: Electronic encyclopedia of E. coli genes and metabolism. Nuc Acids Res, 26(1):50–53, 1998.

[24]   P. Karp, M. Riley, S. Paley, A. Pellegrini-Toole, and M. Krummenacker. EcoCyc: Electronic encyclopedia of E. coli genes and metabolism. Nuc Acids Res, 25(1):43–50, 1997.

[25]   P. Karp, M. Riley, S. Paley, and A. Pellegrini-Toole. EcoCyc: Electronic encyclopedia of E. coli genes and metabolism. Nuc Acids Res, 24(1):32–40, 1996.


1 “EcoCyc” is pronounced “eeko-sike”. It sounds like “ecology” and like “encyclopedia”.