rfca

The rfca package contains functions that trains a Random Forest Model from single-cell RNA-Seq data, and allows the prediction of cell types and cell states in unlabelled datasets. It also contains a pre-trained Random Forest model, as well as example datasets.

This can be used for automatically labelling cell types for downstream applications and subsetting, looking at proportions of cell subtypes, and finding similarities in cell types, sub cell types, or cell states in different datasets.

Introduction

Manual cell annotation of scRNAseq datasets, typically based on marker genes, can be time-consuming and biased. Being able to automatically predict cell types/states in a cell-by-cell and cluster-unbiased way is useful for fast and accurate phenotyping.

In addition, despite the increasing amounts of scRNAseq datasets being generated, thorough analysis of these datasets is lagging, and/or done in silos. This package comes with a preloaded Random Forest model based on different datasets and cell types/states, that will be constantly updated.

Installation

You can install this package from GitHub with devtools::install_github("kimberle9/rfca")

Functions

Load packages and examples

library(rfca)
library(Seurat)
data("exampleSeuratObjectUnlabelled") # example dataset
data("exampleSeuratObjectLabelled") # example dataset

createSeuratObjectPipeline()

createSeuratObjectPipeline() is a function that creates a Seurat object from 10X cell ranger output, specifically the filtered_feature_bc_matrix directory containing the three files: barcodes.tsv.gz, matrix.mtx.gz, features.tsv.gz. It performs several steps in the Seurat pipeline, and returns a Seurat object with a computer pca and umap clusters as Idents (identities). These cluster Idents may be used as labels for training and building the Random Forest model with createRFModel(seuratObject) to identify similar cells in other datasets. Alternatively, Idents (such as cell type) can be manually added and used as labels.

# Create Seurat Object with PCA and UMAP calculated
mySeuratObject <- createSeuratObjectPipeline(data.dir = "~/filtered_feature_bc_matrix", 
                                             nFeature_RNA_lower = 500, 
                                             nFeature_RNA_upper = 5000, 
                                             percent.mt = 5, 
                                             nfeatures = 2000, 
                                             dims = 20, 
                                             clusterResolution = 0.8)

createRFModel()

This function should be used when there is a well-labelled dataset containing cell types you want to predict. createRFModel() is a function that generates a Random Forest model from a labelled Seurat Object, with each cell as a datapoint, and gene expression levels of the most variable genes (HVGs) as features. The default nfeatures is 200. The Idents (use table(Idents(seuratObject))) of each cell are its labels, and the values that will be predicted when the model is used in predictCells(seuratObject, theRFModel). A low OOB error rate of the Random Forest model (<5%) implies the classified cell types from the input training dataset are well defined and separated.

# Create Random Forest Model with your labelled Seurat Object
myRandomForestModel <- createRFModel(exampleSeuratObjectLabelled, nfeatures = 200)
#> 
#> Call:
#>  randomForest(formula = V1 ~ ., data = merged, importance = TRUE,      proximity = TRUE, ntree = 400, na.action = na.roughfix) 
#>                Type of random forest: classification
#>                      Number of trees: 400
#> No. of variables tried at each split: 13
#> 
#>         OOB estimate of  error rate: 4.46%
#> Confusion matrix:
#>                  Astrocytes Endothelial Microglia Neurons Oligodendrocytes OPCs
#> Astrocytes               45           0         0       0                2    0
#> Endothelial               0          51         2       1                0    0
#> Microglia                 0           0        47       0                0    0
#> Neurons                   0           0         0      48                1    2
#> Oligodendrocytes          0           0         0       1               46    0
#> OPCs                      0           0         1       1                1   47
#> Tcells                    0           1         0       1                0    0
#> VSMCs                     1           0         1       0                0    0
#>                  Tcells VSMCs class.error
#> Astrocytes            0     0  0.04255319
#> Endothelial           0     0  0.05555556
#> Microglia             0     0  0.00000000
#> Neurons               1     0  0.07692308
#> Oligodendrocytes      0     0  0.02127660
#> OPCs                  1     0  0.07843137
#> Tcells               50     0  0.03846154
#> VSMCs                 0    52  0.03703704

markerGeneList()

Given a Random Forest model, the markerGeneList() function is used to create lists of marker genes for each of the labels in the model. The dafault number of genes to return is 10. For each label in the Random Forest model, the top 10 most important genes for that label is returned, with the label as the name of that gene list. The returned list of list can be used in the cellMarkerPlots() function to visualize the combined/average expression of a set of genes in a single UMAP plot.

# Create marker gene list from random forest model
markerGeneList <- createGeneLists(myRandomForestModel)
#> [1] "Adding gene list for Astrocytes"
#> [1] "Adding gene list for Endothelial"
#> [1] "Adding gene list for Microglia"
#> [1] "Adding gene list for Neurons"
#> [1] "Adding gene list for Oligodendrocytes"
#> [1] "Adding gene list for OPCs"
#> [1] "Adding gene list for Tcells"
#> [1] "Adding gene list for VSMCs"

cellMarkerPlots()

The cellMarkerPlots() function takes in a Seurat object and a list of gene list (list of list) and calculates the average expression of each gene list for each cell. A feature plot will be generated for each gene list, titled with the name of the name of the gene list, usually the labels of the Random Forest model if the list of gene list is generated with createGeneList(). Check documentation for params to customize the look of the feature plot. Alternatively, the list of gene lists can be generated manually (eg. with your favorite cell type markers) and can be used for quicker annotation of clusters in an unlabelled dataset.

# Visualize Feature Plot based on marker gene list
myPlot <- cellMarkerPlots(exampleSeuratObjectLabelled, geneList = markerGeneList)
#> [1] "Calculating average expression of genes in Astrocytes list."
#> [1] "Calculating average expression of genes in Endothelial list."
#> [1] "Calculating average expression of genes in Microglia list."
#> [1] "Calculating average expression of genes in Neurons list."
#> [1] "Calculating average expression of genes in Oligodendrocytes list."
#> [1] "Calculating average expression of genes in OPCs list."
#> [1] "Calculating average expression of genes in Tcells list."
#> [1] "Calculating average expression of genes in VSMCs list."
myPlot

# Make your own marker gene list
myList <- list()
myList$Oligodendrocytes <- c("Mbp", "Mobp", "Plp1")
myList$Microglia <- c("Iba1", "Cst7")

predictCells()

This function takes a Seurat object and predicts the cell type for each cell is in a cluster-unbiased way. It returns the Seurat object with new labels as Idents. The Random Forest model used can be:

  1. User-specified, where labels can be anything: cell types, cell subtypes, cell states, cluster identities etc.
  2. In-built models include mouseBrain (default), mouseLiver, mouseHeart, mouseKidney, mouseMuscle, humanPBMC to choose from (eg. predictCells(exampleSeuratObjectUnlabelled, mouseMuscle)). If no model is supplied, mouseBrain will be used as the default.
# Predict cells based on your own Random Forest Model created above
autoLabelledSeuratObject <- predictCells(exampleSeuratObjectUnlabelled, myRandomForestModel)
#> 
#>       Astrocytes      Endothelial        Microglia          Neurons 
#>               55               45               58               43 
#> Oligodendrocytes             OPCs           Tcells            VSMCs 
#>               55               49               48               43

# Predict cells based on my pre-loaded and pre-trained Random Forest Model
autoLabelledSeuratObject <- predictCells(exampleSeuratObjectUnlabelled)
#> 
#>       Astrocytes      Endothelial        Microglia          Neurons 
#>               53               46               53               48 
#> Oligodendrocytes             OPCs           Tcells            VSMCs 
#>               53               49               48               46

# Visualize autoLabelledSeuratObject
DimPlot(autoLabelledSeuratObject)

Datasets

The example datasets included, exampleSeuratObjectUnlabelled and exampleSeuratObjectLabelled are from mouse brain tissues. They are ~400 cells each and are in the Seurat object format. For exampleSeuratObjectUnlabelled, Idents are cluster numbers. For exampleSeuratObjectLabelled, Idents are cell types.

Some data used for training the different tissue models are from: https://tabula-muris.ds.czbiohub.org/

Examples

Some examples can be found in the README.