[go: up one dir, main page]

Skip to content
This repository has been archived by the owner on Jan 16, 2023. It is now read-only.

neonMicrobe: Processing NEON soil microbe marker gene sequence data into ASV tables, duplicated from https://github.com/claraqin/neonMicrobe

License

Notifications You must be signed in to change notification settings

zhulabgroup/qin-2021-ecosphere

Repository files navigation

neonMicrobe

neonMicrobe is a suite of functions for downloading, pre-processing, and assembling heterogeneous data around the NEON soil microbe marker gene sequence data. To do so, neonMicrobe downloads NEON data products from the NEON Data API and processes sequences using the DADA2 workflow. In the future, neonMicrobe will offer a processing-batch infrastructure to encourage explicit versioning of processed data.

How to cite

Please cite this package by citing the associated methods paper:

Qin, C., Bartelme, R., Chung, Y. A., Fairbanks, D., Lin, Y., Liptzin, D., Muscarella, C., Natihani, K., Peay, K., Pellitier, P., St. Rose, A., Werbin, Z., & Zhu, K. (2021). From DNA sequences to microbial ecology: Wrangling NEON soil microbe data with the neonMicrobe R package. Ecosphere, 12(11). https://doi.org/10.1002/ecs2.3842

Installation

The development version of neonMicrobe can be installed directly from this GitHub repo using this code:

install.packages("devtools")
devtools::install_github("claraqin/neonMicrobe")

User-installed dependencies

In addition to the R package dependencies which are installed alongside neonMicrobe, users may also need to complete the following requirements before using some functions in neonMicrobe:

  1. For taxonomic assignment via DADA2, you will need to install the latest taxonomic reference datasets for ITS or 16S sequences. Consult the DADA2 taxonomic reference data webpage for more information. For organizational purposes, we recommend keeping these files in the data/tax_ref subdirectory that is created after you run makeDataDirectories() (see "Input data" below).
  2. For trimming of ITS sequence primers, you will need to install cutadapt. Installation instructions can be found here. Once installed, you can tell neonMicrobe where to look for it by specifying the cutadapt_path argument each time you use the trimPrimerITS function. For an example, see the "Process 16S Sequences" vignette or the "Process ITS Sequences" vignette.

Quick start

The following R script makes use of neonMicrobe to create ASV tables for 16S sequences collected from three NEON sites in the Great Plains:

Analyze NEON Great Plains 16S Sequences

Overview

Tutorials for neonMicrobe are available in the vignettes directory, and some are also linked here:

  1. Download NEON Data – how to use the functions in this package to download the specific scope NEON soil microbe marker gene sequence data and associated data relevant to your analysis. Leverages the neonUtilities R package.
  2. Process 16S Sequences and Process ITS Sequences – how to use the functions in this package to add associated environmental variables to the ASV tables. Leverages the dada2 R package. The dada denoising algorithm partitions reads into amplicon sequence variants (ASVs), which are finer in resolution than OTUs.
  3. Add Environmental Variables to 16S Data – how to use the functions in this package to add associated environmental variables to the ASV tables. Joins the data together in the form of one or more Phyloseq objects.
  4. (Optional) Sensitivity Analysis – how to use the functions in this package to test the effects of quality filtering parameters and decisions on the resulting ecological inference. Can be used to test for an acceptable range of custom parameters.
  5. (Coming soon) Processing Batches - how to use the processing-batch feature to keep track of the parameters used to create various sets of output data.

NEON Ecosphere MS Figure-Making Workspace (3)

File storage structure

Input data

The Download NEON Data vignette demonstrates how to download NEON data, optionally writing to the file system. By default, the input data is downloaded into the following structure, which is created in the working directory after running makeDataDirectories():

NEON Ecosphere MS Figure-Making Workspace (10)

The tree structure in the upper-left represents the data directory structure constructed within the project root directory. Red dotted lines represent explicit linkages between NEON data products via shared data fields. (a) Sequence metadata is downloaded from NEON data product DP1.10108.001 (Soil microbe marker gene sequences) using the downloadSequenceMetadata() function. (b) Raw microbe marker gene sequence data is downloaded from NEON based on the sequence metadata using the downloadRawSequenceData() function. (c) Soil physical and chemical data is downloaded from NEON data product DP1.10086.001 using the downloadRawSoilData() function. (d) Taxonomic reference datasets (e.g. SILVA, UNITE) are added separately by the user.

Output data

The Process (16S/ITS) Sequences and Add Environmental Variables to 16S Data vignettes demonstrate how to process the NEON data inputs into useful sample-abundance tables with accompanying environmental data.

By default, output data from neonMicrobe is written to the outputs/ directory.

─ outputs
  ├── mid_process
  │   ├── 16S
  │   └── ITS
  └── track_reads
      ├── 16S
      └── ITS

The mid_process/ subdirectory contains files in the middle of being processed -- for example, fastq files that have been trimmed or filtered, and sequencing run-specific ASV tables that have not yet been joined together. Once the desired outputs have been created, you may choose to clear the contents of mid_process/, or leave them to retrace your processing steps.

The track_reads/ subdirectory contains tables tracking the number of reads remaining at each step in the pipeline, from the "raw" sequence files downloaded from NEON to the ASV table. These tables can be useful for pinpointing steps and samples for which an unusual number of reads were lost.

(Coming soon: When the processing batch feature is released, the default outputs directory will be switched to batch_outputs. More on this later!)