- Run tailseeker3 on your data
- if required - perform additional demultiplexing based on primer sequences (we used modified sabre for this purpose, available as a fork at https://github.com/smaegol/sabre)
- create samplesheet describing files to be analyzed, and they important features
- for LINE sequences (or other repetitive elements) identification run repeatmasker using
repeatmasker.sh
script (may take a long time) - run
analyze_race_seq.py
script to get tails analysis done - perform additional analyzes in R using attached scripts
-
repeatmasker with the repbase libraries downloaded and installed
-
Parsing-RepeatMasker-Outputs scripts downloaded
-
Python 3
-
Python libraries installed:
- biopython
- pandas
- numpy
- subprocess
-
Perl libraries installed:
- bioperl
- Getopt
- Term::ProgressBar
-
R together with:
- ggplot2
- plyr
- dplyr
- ggseqlogo
-
Basecalling, deduplication, identification of A-tails
This step is done using tailseeker software. Configuration of tailseeker run is provided in the tailseeker.yaml file which can be found in the flowcell1/flowcell2 folders. Before running the tailseeker modify the file to provide the proper path to flowcell data (dir variable). Tailseeker will generate R5 and R3 files, which are deduplicated (based on UMI sequences), and provide information about lengths of A-tails (and possible additions at the end of a tail). Tailseeker is run using command:
tseek -j
-
Demultiplexing (only for Seq2-5)
Demultiplexing is done using sabre. The code of sabre was modified to include primer sequences in the output (apropriate pull request sent to the sabre developer).
Script prepared for this purpose require the barcode files used for demultiplex (which can be found in
flowcell2/sabre_barcodes
) are located in the same folder as fastq files which will be demultiplexed (folderfastq
in the output of tailseeker).To perform demultiplexing copy
demultiplex_sabre.sh
to thefastq
folder and run:
./demultiplex_sabre.sh
-
LINE1 sequences identification
To identify LINE1 sequences in demultiplexed reads RepeatMasker is used.
Fastq sequences are first converted to fasta using
fastq_to_fasta
from fastx_toolkit. Then, RepeatMasker is run over the LINE1-specific database. Obtained hits are parsed usingparseRM_simple.pl
from Parsing-RepeatMasker-Outputs and analyzed usingidentify_LINE_repeatmasker_softclip.py
to get information about location of LINE1 in the sequencing reads and about non-templated nucleotides (possible tails).Scripts
identify_LINE_repeatmasker_softclip.py
andidentify_LINE_repeatmasker_softclip_R3.py
must be copied to theprocessing_out_sabre
folder.repeatmasker.sh
should be run in the same folder.
./repeatmasker.sh
This part of analysis can be time-consuming.
-
Tails analysis
In the next step the actual analysis is done. For the LINE1 sequences the information about non-templated nucleotides (possible tails) is already obtained. For other sequences (GAPDH, reporter LINE1) it is retrieved by mapping using
bowtie2
with--very-sensitive-local
option to get soft-clipping. Soft-clipped fragments are then retrieved usingget_softclipped_reads_from_sam.pl
script.Analysis is run on the all files with names ending with
_R5.fastq
in the folder specified with the--inputdir
option. More specific selection can be done with the--glob
option of the analysis script.For the all analyzed files a samplesheet is required, which contains all information regarding the samples, including experimental conditions, primer used, transcript, etc. This file must be prepared before the analysis. Example is located in the
flowcell2
folder (samplesheet.csv
). Default path to the samplesheet is provided in the script, but the alternative one can be provided with the--samplesheet
option.Script performs also calculation of the terminal nucleotides composition (used for the generation of sequence logos). To change the default sequence window used for calculation (7 nt) one can use the
--window
optionAs the output (specified with the
--output
option) a tsv file is generated, containing tailing information for each sequence analysed, as well as additional data regarding the procedure.Before running the script it is required to customize settings (at the beginning of the script), like the path to
bowtie2
, number of threads it can use, names and locations of bowtie2 indexes.Analysis for the flowcell2 can be run using:
./analyze_race_seq.py --inputdir processing_out_sabre/ --output flowcell2_output.tsv
- Statistical analysis, plots
Further analysis is done using provided R scripts. Computations may require around 20GB or RAM memory. Functions used to import and filter data and calculate various parameters (mostly the uridylation level) are located in the R/raceseq.R
file
Note that for the LINE-1 reporter overexpression experiments, reads should be filtered and include only those, which end in the LINE-1 3'-UTR (mapping_position_max = 9010), to exclude plasmid-borne and artifically elongated reads.
For the identification of LINE1 sequences we used RepeatMasker run with the library containing only human LINE1 sequences. It can be easily prepared using scripts available in the RepeatMasker distribution and HMMER. The detailed procedure is shown below:
-
in the
util
subfolder in the RepeatMasker base directory run./queryRepeatDatabase.pl -class "LINE" -species "Homo sapiens" > LINE_sequences.fasta
-
extract LINEs names:
cat LINE_sequences.fasta | grep "^>" | cut -d' ' -f1 | sed -r 's/>(.*)/\1/' > LINE_names.txt
-
extract hmms for LINEs from homo sapiens LINEs library (located in
Libraries/Dfam_2.0/homo_sapiens/
) using hmmfetchhmmfetch -f masklib.hmm lines_names.txt > LINEs.hmm
-
press obtained hmm database
hmmpress LINEs.hmm
-
use this database in repeatmasker by specifying the exact path with the -d option