vignettes/articles/Helpers_and_Utilities.Rmd
Helpers_and_Utilities.Rmd
scCustomize has several helper functions to simplify/streamline common tasks in scRNA-seq analysis. Let’s load packages and raw data object for this tutorial.
# Load Packages
library(ggplot2)
library(dplyr)
library(magrittr)
library(patchwork)
library(Seurat)
library(scCustomize)
pbmc <- pbmc3k.SeuratData::pbmc3k
We’ll add some random meta data variables to pbmc data form use in this vignette
pbmc$sample_id <- sample(c("sample1", "sample2", "sample3", "sample4", "sample5", "sample6"), size = ncol(pbmc),
replace = TRUE)
pbmc$batch <- sample(c("Batch1", "Batch2"), size = ncol(pbmc), replace = TRUE)
As discussed in QC Plotting vignette one the first steps after creating object if often to calculate and add mitochondrial and ribosomal count percentages per cell/nucleus.
Add_Mito_Ribo()
scCustomize contains easy wrapper function to automatically add both
Mitochondrial and Ribosomal percentages to meta.data slot. If you are
using mouse, human, marmoset, zebrafish, rat, drosophila, or rhesus
macaque data all you need to do is specify the species
parameter.
# These defaults can be run just by providing accepted species name
pbmc <- Add_Mito_Ribo(object = pbmc, species = "human")
To view list of accepted values for default species names simply set
list_species_names = TRUE
.
Add_Mito_Ribo(list_species_names = TRUE)
Mouse_Options | Human_Options | Marmoset_Options | Zebrafish_Options | Rat_Options | Drosophila_Options | Macaque_Options | |
---|---|---|---|---|---|---|---|
1 | Mouse | Human | Marmoset | Zebrafish | Rat | Drosophila | Macaque |
2 | mouse | human | marmoset | zebrafish | rat | drosophila | macaque |
3 | Ms | Hu | CJ | DR | RN | DM | Rhesus |
4 | ms | hu | Cj | Dr | Rn | Dm | macaca |
5 | Mm | Hs | cj | dr | rn | dm | mmulatta |
6 | mm | hs | NA | NA | NA | NA | NA |
However custom prefixes can be used for
non-human/mouse/marmoset/rat/zebrafish/drosophila/macaque species with
different annotations. Simply specify species = other
and
supply feature lists or regex patterns for your species of interest.
NOTE: If desired please submit issue on GitHub for additional
default species. Please include regex pattern or list of genes for both
mitochondrial and ribosomal genes and I will add additional built-in
defaults to the function.
# Using gene name patterns
pbmc <- Add_Mito_Ribo(object = pbmc, species = "other", mito_pattern = "regexp_pattern", ribo_pattern = "regexp_pattern")
# Using feature name lists
mito_gene_list <- c("gene1", "gene2", "etc")
ribo_gene_list <- c("gene1", "gene2", "etc")
pbmc <- Add_Mito_Ribo(object = pbmc, species = "other", mito_features = mito_gene_list, ribo_features = ribo_gene_list)
# Using combination of gene lists and gene name patterns
pbmc <- Add_Mito_Ribo(object = pbmc, species = "Human", mito_features = mito_gene_list, ribo_pattern = "regexp_pattern")
The added benefit of Add_Mito_Ribo
is that it will
return informative warnings if no Mitochondrial or Ribosomal features
are found using the current species, features, or pattern
specification.
# For demonstration purposes we can set `species = mouse` for this object of human cells
pbmc <- Add_Mito_Ribo(object = pbmc, species = "mouse")
## Error in `Add_Mito_Ribo()`:
## ! No Mito or Ribo features found in object using patterns/feature list
## provided.
## ℹ Please check pattern/feature list and/or gene names in object.
# Or if providing custom patterns/lists and features not found
pbmc <- Add_Mito_Ribo(object = pbmc, species = "other", mito_pattern = "^MT-", ribo_pattern = "BAD_PATTERN")
## Warning: No Ribo features found in object using pattern/feature list provided.
## ℹ No column will be added to meta.data.
Add_Mito_Ribo
will also return warnings if columns are
already present in @meta.data
slot and prompt you to
provide override if you want to run the function.
pbmc <- Add_Mito_Ribo(object = pbmc, species = "human")
## Error in `Add_Mito_Ribo()`:
## ! Columns with "percent_mito" and/or "percent_ribo" already present in
## meta.data slot.
## ℹ *To run function and overwrite columns set parameter `overwrite = TRUE` or
## change respective `mito_name`, `ribo_name`, and/or `mito_ribo_name`*
Add_Mito_Ribo()
scCustomize Add_Mito_Ribo
also works seemlessly with
LIGER objects.
liger_obj <- Add_Mito_Ribo(object = liger_obj, species = "human")
In addition to metrics like number of features and UMIs it can often be helpful to analyze the complexity of expression within a single cell. scCustomize provides functions to add two of these metrics to meta data.
scCustomize contains easy shortcut function to add a measure of cell complexity/novelty that can sometimes be useful to filter low quality cells. The metric is calculated by calculating the result of log10(nFeature) / log10(nCount).
# These defaults can be run just by providing accepted species name
pbmc <- Add_Cell_Complexity(object = pbmc)
NOTE: The function also works seemlessly with LIGER objects.
Additionally, (or alternatively), scCustomize contains another metric
of complexity which is the top percent expression. The user supplies an
integer value for num_top_genes
(default is 50) which
species the number of genes and the function returns percentage of
counts occupied by top XX genes in each cell.
# These defaults can be run just by providing accepted species name
pbmc <- Add_Top_Gene_Pct_Seurat(seurat_object = pbmc, num_top_genes = 50)
To simplify the process of adding cell QC metrics scCustomize
contains a wrapper function which can be customized to add all or some
of the available QC metrics. The default parameters of the function
Add_Cell_QC_Metrics
will add:
pbmc <- Add_Cell_QC_Metrics(seurat_object = pbmc, species = "human")
scCustomize contains a set of functions to aid in use of meta data both within and outside of objects.
Fetch_Meta()
functions as simple getter function to
obtain meta data from object and return data.frame.
meta_data <- Fetch_Meta(object = pbmc)
head(meta_data, 10)
orig.ident | nCount_RNA | nFeature_RNA | seurat_annotations | sample_id | batch | |
---|---|---|---|---|---|---|
AAACATACAACCAC | pbmc3k | 2419 | 779 | Memory CD4 T | sample3 | Batch2 |
AAACATTGAGCTAC | pbmc3k | 4903 | 1352 | B | sample5 | Batch2 |
AAACATTGATCAGC | pbmc3k | 3147 | 1129 | Memory CD4 T | sample4 | Batch2 |
AAACCGTGCTTCCG | pbmc3k | 2639 | 960 | CD14+ Mono | sample1 | Batch2 |
AAACCGTGTATGCG | pbmc3k | 980 | 521 | NK | sample4 | Batch1 |
AAACGCACTGGTAC | pbmc3k | 2163 | 781 | Memory CD4 T | sample6 | Batch2 |
AAACGCTGACCAGT | pbmc3k | 2175 | 782 | CD8 T | sample3 | Batch1 |
AAACGCTGGTTCTT | pbmc3k | 2260 | 790 | CD8 T | sample4 | Batch2 |
AAACGCTGTAGCCA | pbmc3k | 1275 | 532 | Naive CD4 T | sample6 | Batch2 |
AAACGCTGTTTCTG | pbmc3k | 1103 | 550 | FCGR3A+ Mono | sample3 | Batch2 |
While cell-level meta data is helpful in some situations often all
that is required is sample-level meta data. This can easily be extracted
and filtered using Extract_Sample_Meta()
.
sample_meta <- Extract_Sample_Meta(object = pbmc, sample_name = "sample_id")
orig.ident | seurat_annotations | sample_id | batch | |
---|---|---|---|---|
1 | pbmc3k | CD14+ Mono | sample1 | Batch2 |
2 | pbmc3k | NA | sample2 | Batch2 |
3 | pbmc3k | Memory CD4 T | sample3 | Batch2 |
4 | pbmc3k | Memory CD4 T | sample4 | Batch2 |
5 | pbmc3k | B | sample5 | Batch2 |
6 | pbmc3k | Memory CD4 T | sample6 | Batch2 |
As you can see by default Extract_Sample_Meta
removes a
default set of columns (see documentation) which do not provide
meaningful sample-level information (e.g., nFeature_RNA). However, you
may want to remove other columns too. This can be achieved using either
positive or negative selection using variables_include
or
variables_exclude
parameters.
sample_meta <- Extract_Sample_Meta(object = pbmc, sample_name = "sample_id", variables_exclude = c("nFeature_RNA",
"nCount_RNA", "seurat_annotations", "orig.ident"))
sample_id | batch | |
---|---|---|
1 | sample1 | Batch2 |
2 | sample2 | Batch2 |
3 | sample3 | Batch2 |
4 | sample4 | Batch2 |
5 | sample5 | Batch2 |
6 | sample6 | Batch2 |
While some original number columns are not valid at sample-level it
can be valuable to get summary information for those variables. This can
be achieved by merging outputs with Median_Stats
function.
sample_meta <- Extract_Sample_Meta(object = pbmc, sample_name = "sample_id", variables_exclude = c("nFeature_RNA",
"nCount_RNA", "seurat_annotations", "orig.ident"))
sample_median <- Median_Stats(seurat_object = pbmc, group_by_var = "sample_id")
sample_merged <- right_join(x = sample_meta, y = sample_median)
sample_id | batch | Median_nCount_RNA | Median_nFeature_RNA | |
---|---|---|---|---|
1 | sample1 | Batch2 | 2237.0 | 830.0 |
2 | sample2 | Batch2 | 2223.0 | 810.0 |
3 | sample3 | Batch2 | 2197.0 | 816.0 |
4 | sample4 | Batch2 | 2234.5 | 815.5 |
5 | sample5 | Batch2 | 2110.0 | 797.5 |
6 | sample6 | Batch2 | 2168.0 | 820.0 |
7 | Totals (All Cells) | NA | 2196.0 | 816.0 |
scCustomize provides easy function to add sample-level meta data to object without the need to first convert it to cell-level meta data. This makes adding meta data from summary or supplemental tables to cell level object data very easy.
In order to add meta data you will need to specify:
@meta.data
column that matches
sample-level meta data (often “orig.ident”).This is example command:
obj <- Add_Sample_Meta(seurat_object = obj, meta_data = sample_meta, join_by_seurat = "orig.ident",
join_by_meta = "sample_id")
Starting in Seurat V5 each assay now possess it’s own meta.data slot which is feature-level meta data. During course of normal analysis this is where information on variable features is stored. However, we can also use it to store alternate feature names, in most cases this is Ensembl IDs matching the symbols used in object creation/analysis.
scCustomize provides the function Add_Alt_Feature_ID()
to automatically match and add these features using the same files used
in object creation. Users only need to supply either path to the
features.tsv.gz file or the hdf5 file produced from Cell Ranger
output.
# Using features.tsv.gz file
obj <- Add_Alt_Feature_ID(seurat_object = obj,
features_tsv = "sample01/outs/filtered_feature_bc_matrix/features.tsv.gz", assay = "RNA")
# Using hdf5 file
obj <- Add_Alt_Feature_ID(seurat_object = obj,
hdf5_file = "sample01/outs/outs/filtered_feature_bc_matrix.h5"", assay = "RNA")
NOTE: If using features.tsv.gz file the file from either filtered or raw outputs can be used as they are identical.
NOTE: If using hdf5 file the file from either filtered_feature_bc or raw_feature_bc can be used as the features slot is identical. Though it is faster to load filtered_feature_bc file due to droplet filtering.
scCustomize also makes forward-facing a number of utilities that are used internally in functions but may also have utility on their own.
Gene_Present()
to check for features.
Gene_Present
is fairly basic function to check if
feature exists in data. It can be used with Seurat or LIGER objects as
well as generic data formats (Matrix, data.frame, tibble).
In addition to some warning messages Gene_Present
returns a list with 3 entries when run:
bad_features
> 0 then
Gene_Present
will convert
the gene list
bad_features` to all upper case and to
sentence case and check against all possible features to see if wrong
case was provided.
# Example gene list with all examples (found genes, wrong case (lower) and misspelled (CD8A
# forgetting to un-shift when typing 8))
gene_input_list <- c("CD14", "CD3E", "Cd4", "CD*A")
genes_present <- Gene_Present(data = pbmc, gene_list = gene_input_list)
## Warning: `Gene_Present()` was deprecated in scCustomize 2.1.0.
## ℹ Please use `Feature_Present()` instead.
## ℹ Please adjust code now to prepare for full deprecation.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The following features were omitted as they were not found:
## ℹ Cd4 and CD*A
## Warning: NOTE: However, the following features were found: CD4
## ℹ Please check intended case of features provided.
Now let’s look at the output:
genes_present
## $found_features
## [1] "CD14" "CD3E"
##
## $bad_features
## [1] "Cd4" "CD*A"
##
## $wrong_case_found_features
## [1] "CD4"
By default Gene_Present
has 3 sets of warnings/messages
it prints to console when it finds issues. If using the function
yourself on its own or wrapped inside your own function and you prefer
no messages each of these can be toggled using optional parameters.
case_check_msg
prints and list of features if alternate
case features are found in data.omit_warn
prints warning and list of all features not
found in data.print_msg
prints message if all features in
gene_list
are found in data.In order to keep run times down and support offer greater support for
offline use Gene_Present
does not include a check for
updated gene symbols. If you’re dataset is from human cells/donors you
can simply supply the not found features from Gene_Present
to Seurat’s UpdateSymbolList
function.
gene_input_list <- c("CD14", "CD3E", "Cd4", "CD*A", "SEPT1")
genes_present <- Gene_Present(data = pbmc, gene_list = gene_input_list)
## Warning: The following features were omitted as they were not found:
## ℹ Cd4 and CD*A
## Warning: NOTE: However, the following features were found: CD4
## ℹ Please check intended case of features provided.
check_symbols <- UpdateSymbolList(symbols = genes_present[[2]], verbose = TRUE)
## Warning: No updated symbols found
It can often be advantageous to merge raw data before creating
analysis objects vs creating lots of objects and merging them all later.
scCustomize features a modified version of the internal LIGER function
MergeSparseDataAll()
.
Merge_Sparse_Data_All()
will combine a list of sparse
matrices and return single sparse matrix. Additionally, by specifying
the add_cell_ids
parameter you can specify a prefix to be
added to the barcodes from each entry in the list (using “_” as
delimiter).
This function can be especially useful when combined with any of the
scCustomize’s Read_
data functions which automatically
return named lists of matrices and the ability to specify sample
orig.ident
when creating Seurat objects. See Read
& Write Vignette for more info on the data import functions.
# Read in data
GEO_10X <- Read10X_GEO(data_dir = "assets/GSE152183_RAW_Marsh/")
# Merge data and add sample prefix
GEO_10X_merged <- Merge_Sparse_Data_All(matrix_list = GEO_10X, add_cell_ids = names(GEO_10X))
# Create Seurat Object and specify orig.ident location
GEO_10X_Seurat <- Seurat::CreateSeuratObject(counts = GEO_10X_merged, names.field = 1, names.delim = "_",
min.features = 200, min.cells = 5)
Sometimes it can be advantageous to create a list of multiple Seurat Objects in order to run similar pipeline on all objects in loop.
To facilitate ease in merging such lists into single object
scCustomize contains simple wrapper Merge_Seurat_List
that
uses purrr::reduce()
to merge all objects in list into
single combined object
list_of_objects <- list(obj1, obj2, obj2, ..., obj10)
merged_seurat <- Merge_Seurat_List(list_seurat = list_of_objects)
# Can also add sample specific ids to each object during the merge
cell_ids <- c("sample1", "sample2", "sample3", ..., "sample10")
merged_seurat <- Merge_Seurat_List(list_seurat = list_of_objects, add.cell.ids = cell_ids)
Seurat V5 objects now have the ability to split within the object
into layers. However, I find that the syntax to do this is not the most
intuitive and can be simplified with a new simple wrapper function:
Split_Layers()
pbmc <- Split_Layers(seurat_object = pbmc, split.by = "sample_id")
## • Splitting layers within assay: RNA into 6 parts by "sample_id"
## ℹ RNA is not Assay5, converting to Assay5 before splitting.
Split_Layers()
defaults to “RNA” assay but can be used
for any assay present in object (users should check whether splitting
assay other than “RNA” is valid before proceeding).
Seurat objects contain an extra empty slot that can be used to store
any extra information desired.
scCustomize contains two functions Store_Misc_Info_Seurat
and a wrapper around that function Store_Palette_Seurat
to
make this process easy.
# Data can be vectors or data.frames
misc_info <- "misc_vector_dataframe_list_etc"
# Add data to the @misc slot in Seurat Object
pbmc <- Store_Misc_Info_Seurat(seurat_object = pbmc, data_to_store = misc_info, data_name = "misc_info_name")
If you are storing a list in the @misc
slot there is
additional parameter that dictates whether to store the information as a
list or whether to store each entry in the list separately.
# Create list
misc_info <- list("misc_item1", "misc_item2", etc)
# Store the list directly
pbmc <- Store_Misc_Info_Seurat(seurat_object = pbmc, data_to_store = misc_info, data_name = "misc_info_name",
list_as_list = TRUE)
# Store each entry in list as separate entity in `@misc` slot
pbmc <- Store_Misc_Info_Seurat(seurat_object = pbmc, data_to_store = misc_info, data_name = "misc_info_name",
list_as_list = FALSE)
One of the most common times I use this function is to store color
palettes associated with clustering or subclustering.
To make it easier to remember function call in this situation
scCustomize contains a wrapper function
Store_Palette_Seurat
.
# Data can be vectors or data.frames
annotated_color_palette <- c("color1", "color2", "color3", "etc")
# Add data to the @misc slot in Seurat Object
pbmc <- Store_Palette_Seurat(seurat_object = pbmc, palette = annotated_color_palette, palette_name = "Round01_Color_Pal")
# Then you can easily call that palette (with tab completion) when plotting without ever
# needing to reload the palette in current environment
DimPlot(object = pbmc, cols = pbmc@misc$Round01_Color_Pal)
Sometimes, especially with public data, you may want to modify the cell barcode names before creating analysis object.
scCustomize contains a selection of functions to simplify this process:
Replace_Suffix
can be used on single matrix/data.frame
or list of matrices/data.frames to modify to remove suffixes
# For single object
data_mod <- Replace_Suffix(data = raw_data, current_suffix = "-1", new_suffix = "-2")
# For list of objects containing same suffix
raw_data_list <- list(raw_data1, raw_data2, raw_data3, ..., raw_data10)
new_suffixes <- c("-1", "-2", "-3", ..., "-10")
data_mod <- Replace_Suffix(data = raw_data_list, current_suffix = "-1", new_suffix = new_suffixes)
# For list of objects containing different suffixes
raw_data_list <- list(raw_data1, raw_data2, raw_data3, ..., raw_data10)
old_suffixes <- c("-A", "-B", "-C", ..., "-J")
new_suffixes <- c("-1", "-2", "-3", ..., "-10")
data_mod <- Replace_Suffix(data = raw_data_list, current_suffix = old_suffixes, new_suffix = new_suffixes)
Replace_Suffix
can also be used to strip suffixes from
data
# For single object
data_mod <- Replace_Suffix(data = raw_data, current_suffix = "-1", new_suffix = "")
scCustomize has 3 functions to facilitate changing the type of delimiters present in cell barcodes.
Change_Delim_Prefix()
Change just the suffix
delimiter.Change_Delim_Suffix()
Change just the prefix
delimiter.Change_Delim_All()
Change all delimiters.These functions all take identical inputs and can be applied to either single matrix/data.frames or lists of matrices/data.frames.
data_mod <- Change_Delim_Prefix(data = raw_data, current_delim = ".", new_delim = "_")
data_mod <- Change_Delim_Suffix(data = raw_data, current_delim = ".", new_delim = "_")
data_mod <- Change_Delim_All(data = raw_data, current_delim = ".", new_delim = "_")