[go: up one dir, main page]

Skip to content

Scripts and data used during the analysis of LINE-1 RACE seq tail-seq output, for the publication "Uridilation by TUT4/7 restricts retrotransposition of human LINE-1"s (Warkocki et al 2018, Cell, available at https://doi.org/10.1016/j.cell.2018.07.022)

License

Notifications You must be signed in to change notification settings

mywanuo/LINE_1_RACE_seq_analysis

 
 

Repository files navigation

RACE-seq

Overview of the procedure

  • Run tailseeker3 on your data
  • if required - perform additional demultiplexing based on primer sequences (we used modified sabre for this purpose, available as a fork at https://github.com/smaegol/sabre)
  • create samplesheet describing files to be analyzed, and they important features
  • for LINE sequences (or other repetitive elements) identification run repeatmasker using repeatmasker.sh script (may take a long time)
  • run analyze_race_seq.py script to get tails analysis done
  • perform additional analyzes in R using attached scripts

Requirements

Detailed procedure

  1. Basecalling, deduplication, identification of A-tails

    This step is done using tailseeker software. Configuration of tailseeker run is provided in the tailseeker.yaml file which can be found in the flowcell1/flowcell2 folders. Before running the tailseeker modify the file to provide the proper path to flowcell data (dir variable). Tailseeker will generate R5 and R3 files, which are deduplicated (based on UMI sequences), and provide information about lengths of A-tails (and possible additions at the end of a tail). Tailseeker is run using command:

 tseek -j
  1. Demultiplexing (only for Seq2-5)

    Demultiplexing is done using sabre. The code of sabre was modified to include primer sequences in the output (apropriate pull request sent to the sabre developer).

    Script prepared for this purpose require the barcode files used for demultiplex (which can be found in flowcell2/sabre_barcodes) are located in the same folder as fastq files which will be demultiplexed (folder fastq in the output of tailseeker).

    To perform demultiplexing copy demultiplex_sabre.sh to the fastq folder and run:

 ./demultiplex_sabre.sh
  1. LINE1 sequences identification

    To identify LINE1 sequences in demultiplexed reads RepeatMasker is used.

    Fastq sequences are first converted to fasta using fastq_to_fasta from fastx_toolkit. Then, RepeatMasker is run over the LINE1-specific database. Obtained hits are parsed using parseRM_simple.pl from Parsing-RepeatMasker-Outputs and analyzed using identify_LINE_repeatmasker_softclip.py to get information about location of LINE1 in the sequencing reads and about non-templated nucleotides (possible tails).

    Scripts identify_LINE_repeatmasker_softclip.py and identify_LINE_repeatmasker_softclip_R3.py must be copied to the processing_out_sabre folder. repeatmasker.sh should be run in the same folder.

 ./repeatmasker.sh
This part of analysis can be time-consuming.
  1. Tails analysis

    In the next step the actual analysis is done. For the LINE1 sequences the information about non-templated nucleotides (possible tails) is already obtained. For other sequences (GAPDH, reporter LINE1) it is retrieved by mapping using bowtie2 with --very-sensitive-local option to get soft-clipping. Soft-clipped fragments are then retrieved using get_softclipped_reads_from_sam.pl script.

    Analysis is run on the all files with names ending with _R5.fastq in the folder specified with the --inputdir option. More specific selection can be done with the --glob option of the analysis script.

    For the all analyzed files a samplesheet is required, which contains all information regarding the samples, including experimental conditions, primer used, transcript, etc. This file must be prepared before the analysis. Example is located in the flowcell2 folder (samplesheet.csv). Default path to the samplesheet is provided in the script, but the alternative one can be provided with the --samplesheet option.

    Script performs also calculation of the terminal nucleotides composition (used for the generation of sequence logos). To change the default sequence window used for calculation (7 nt) one can use the --window option

    As the output (specified with the --output option) a tsv file is generated, containing tailing information for each sequence analysed, as well as additional data regarding the procedure.

    Before running the script it is required to customize settings (at the beginning of the script), like the path to bowtie2, number of threads it can use, names and locations of bowtie2 indexes.

    Analysis for the flowcell2 can be run using:

 ./analyze_race_seq.py --inputdir processing_out_sabre/ --output flowcell2_output.tsv
  1. Statistical analysis, plots

Further analysis is done using provided R scripts. Computations may require around 20GB or RAM memory. Functions used to import and filter data and calculate various parameters (mostly the uridylation level) are located in the R/raceseq.R file

Note that for the LINE-1 reporter overexpression experiments, reads should be filtered and include only those, which end in the LINE-1 3'-UTR (mapping_position_max = 9010), to exclude plasmid-borne and artifically elongated reads.

How to get LINE-specific repeatmasker library

For the identification of LINE1 sequences we used RepeatMasker run with the library containing only human LINE1 sequences. It can be easily prepared using scripts available in the RepeatMasker distribution and HMMER. The detailed procedure is shown below:

  • in the util subfolder in the RepeatMasker base directory run

      ./queryRepeatDatabase.pl -class "LINE" -species "Homo sapiens" > LINE_sequences.fasta
    
  • extract LINEs names:

    cat LINE_sequences.fasta | grep "^>"  | cut -d' ' -f1 | sed -r 's/>(.*)/\1/' > LINE_names.txt
    
  • extract hmms for LINEs from homo sapiens LINEs library (located in Libraries/Dfam_2.0/homo_sapiens/) using hmmfetch

      hmmfetch -f masklib.hmm lines_names.txt > LINEs.hmm
    
  • press obtained hmm database

      hmmpress LINEs.hmm
    
  • use this database in repeatmasker by specifying the exact path with the -d option

About

Scripts and data used during the analysis of LINE-1 RACE seq tail-seq output, for the publication "Uridilation by TUT4/7 restricts retrotransposition of human LINE-1"s (Warkocki et al 2018, Cell, available at https://doi.org/10.1016/j.cell.2018.07.022)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 46.1%
  • Python 43.6%
  • Perl 5.6%
  • Shell 4.7%