SingleCellNet
VignetteTraditionally, single cell annotation is based on the expression of cell type-specific markers, which obscures a quantitative assessment of the cell identity. Despite many attempts to address the issue, classification or cell annotation methods remain largely restricted in a platform-specific manner. Our computational tool, SingleCellNet (SCN), implements a Random Forest classifier to learn cell type-specific gene pairs from cross-platform and cross-species datasets and thus quantitatively assesses cell identity at a single-cell resolution. See our publication for more details.
This vignette demonstrate 1) Training 2) Query 3) Visualiztion features of SCN. If you want to use bulk RNAseq data, please go to bulk CellNet.
SCN
provides utility functions to load in rda and loom file. To load rda file, users only need to provide the filenames that lead to sample table and expression matrix. To load loom file, users need to provide the path of the loom file, and specifies the annotation column needed as sample annotation.
# load rda file
sampTab <- utils_loadObject(fname)
expDat <- utils_loadObject(fname)
# load loom file
lfile <- loadLoomExpCluster(path , cellNameCol = "obs_names", xname = "cluster")
sampTab = lfile$sampTab
expDat = lfile$expDat
Though the direct input for SCN
is basic S3 class object, SCN
incorporates utility functions to ease the transition SingleCellExperiment
object and Seurat
object to SCN
analysis.
#exp_type options can be: counts, data, and scale.data if they are available in your sce object
scefile <- extractSCE(sce_object, exp_slot_name = "counts")
sampTab = scefile$sampTab
expDat = scefile$expDat
#exp_type options can be: counts, normcounts, and logcounts, if they are available in your sce object
seuratfile <- extractSeurat(seurat_object, exp_slot_name = "counts")
sampTab = seuratfile$sampTab
expDat = seuratfile$expDat
In this vigenette, A subset of mouse Tabula Muris 10x data set The Tabula Muris Consortium et al will be used as training data and a subset of human peripheral blood mononuclear cells (PBMC) 10x data set as query data 10x Genomics. You can also download the example datasets here: training metadata, training expression matrix , query metadata, query expression matrix, human-mouse orthologs.
library(singleCellNet)
#loading training data
stTM <- utils_loadObject(fname = "../data/stTM_raw_060719_demo.rda") #metadata
expTMraw <- utils_loadObject(fname = "../data/expTM_raw_060719_demo.rda") #expression matrix
#loading query data
stQuery <- utils_loadObject(fname = "../data/stPBMC_demo.rda") #metadata
expQuery <- utils_loadObject(fname = "../data/expPBMC_demo.rda") #expression matrix
Since training data is mouse scRNA-Seq data and query data is human scRNA-Seq data, we convert orthlogs between the two species using the ortholog table determined by by HCOP ( Seal et al., 2011.) and subset the training data and query data by the genes that are common between them.
oTab <- utils_loadObject(fname = "../data/human_mouse_genes_Jul_24_2018.rda")
aa = csRenameOrth(expQuery = expQuery, expTrain = expTMraw, orthTable = oTab)
#> query genes in ortholog table = 15532
#> training genes in ortholog table and query data = 14550
expQueryOrth <- aa[['expQuery']]
expTrainOrth <- aa[['expTrain']]
However, if your training data and query data are of the same specie, you can skip this ortholog conversion step and just subset the datasets based on common genes before proceeding to training.
expTrain <- utils_loadObject(fname) #expression matrix
expQuery <- utils_loadObject(fname)
commonGenes<-intersect(rownames(expTrain), rownames(expQuery))
expTrain <- expTrain[commonGenes, ]
expQuery <- expQuery[commonGenes, ]
We use the same number of cells per cell type, i.e. balanced data, to train Top-Pair Random Forest classifier. The remaining of the data or the held-out data will be used as validation data to evaluate the performance of the TP-RF classifier. Empirically we have found 50 cells to be a minimal cell number to create a classifier with good performance, however it may vary depend on the quality of your reference data, so it is really important to assess the performance of your classifier before proceeding to query your sample of interest.
stList<-splitCommon(sampTab = stTM, ncells = 50, dLevel = "newAnn")
#> B cell : 100
#> endothelial cell : 100
#> erythroblast : 100
#> granulocyte : 100
#> hematopoietic precursor cell : 100
#> late pro-B cell : 100
#> limb_mesenchymal : 100
#> macrophage : 100
#> mammary_basal_cell : 100
#> monocyte : 100
#> natural killer cell : 100
#> T cell : 100
#> trachea_epithelial : 100
#> trachea_mesenchymal : 100
stTrain<-stList[[1]]
expTrain <- expTrainOrth[,stTrain$cell]
stTest <- stList[[2]]
expTest <- expTrainOrth[,stTest$cell]
This training step includes: 1) normalize and log-transform & scale the training data, 2) find top gene pairs and transform the training data into a binary matrix, 3) train the Top-pair Random Forest classifier
Firstly, it is important to have the traning data in its raw counts form prior to this normalization and transformation step. Variability in normalization and transformation methods can affect the performance of the classifier.
Once we have the properly normalized and transformed training expression matirx and metadata, we are ready to look for the topX number of most discriminating gene-pairs for each cell type. To be computational efficient, we start the gene-pairs search in the most differentially expressed genes. Then within that most differentially expressed genes, we search for the topX number of top gene-pairs or most discriminating gene-pairs within the subset of expression matrix of topX most differentially expressed genes. It is important to note that as the number of unique cell types within each training data set increases, the topX for both differential expressed genes and top gene-pairs should be increased accordingly. After we identified the topX gene-pairs, we use template matching to transform the training data into a binary matrix.
After the training data binarized using the top gene-pairs, we use it to built a multi-class Random Forest classifier with 1000 trees. We create 70 randomized single cell profile by randomly permutating the expression matrix within the training data. The addition category in the multi-class Random Forest classifier serves as a “Garagbe Collector” so that query cells whose true cell identity is not contained in the training data set will not be coerced into the most similar cell type within the training.
system.time(class_info<-scn_train(stTrain = stTrain, expTrain = expTrain, nTopGenes = 10, nRand = 70, nTrees = 1000, nTopGenePairs = 25, dLevel = "newAnn", colName_samp = "cell"))
#> Sample table has been prepared
#> Expression data has been normalized
#> Finding classification genes
#> Done testing
#> There are 237 classification genes
#> Finding top pairs
#> nPairs = 190 for B cell
#> nPairs = 190 for endothelial cell
#> nPairs = 190 for erythroblast
#> nPairs = 190 for granulocyte
#> nPairs = 190 for hematopoietic precursor cell
#> nPairs = 190 for late pro-B cell
#> nPairs = 190 for limb_mesenchymal
#> nPairs = 190 for macrophage
#> nPairs = 190 for mammary_basal_cell
#> nPairs = 190 for monocyte
#> nPairs = 190 for natural killer cell
#> nPairs = 190 for T cell
#> nPairs = 190 for trachea_epithelial
#> nPairs = 190 for trachea_mesenchymal
#> There are 350 top gene pairs
#> Finished pair transforming the data
#> Number of missing genes 0
#> All Done
#> user system elapsed
#> 22.585 2.005 24.782
After the building of Top-pair Random Forest (TP-RF) classifier, it is cruicial to assess its performance with its own heldout data or external dataset with known truth or well-annotated cell identiy. Here, we use the held-out data from the Tabula Muris dataset to assess the TP-RF classifer. We first template match the held-out data or the test data to binarize the test data according the top gene pairs selected above. Then we query or classify the test data set through the TP-RF classifier. After assessing it Precision-Recall curve, we can see almost all the performances of each category within the TP-RF classifier are near perfect (area under PR curve = 1). Then we can confidently proceed to query data set of interest to the TP-RF classifier.
#apply heldout data
system.time(classRes_val_all <- scn_predict(class_info[['cnProc']], expTest, nrand = 50))
#> Loaded in the cnProc
#> All Done
#> user system elapsed
#> 0.208 0.012 0.220
#assessment
tm_heldoutassessment <- assess_comm(ct_scores = classRes_val_all, stTrain = stTrain, stQuery = stTest, dLevelSID = "cell", classTrain = "newAnn", classQuery = "newAnn")
plot_PRs(tm_heldoutassessment)
One of the kep features of TP-RF is that the classification step is robust to the normalization and transformation steps of the query data sets. One can even directly use raw data to query and still obtains accurate classification. Refer to SCN manuscript for more details on how cell cycle regression may affect your classification result.
In this example, we binarize the human PBMC 10x raw query data by template matching it with the topX gene-pair selected from the previous steps. Then we classify the human PBMC 10x query data with the TP-RF classifier which showed favorable performance in the previous step. We create another 50 randomized single cell RNA-seq profiles which serve as positive controls that should be mostly classified as Rand or unknown category in the TP-RF classifer.
system.time(crPBMC <- scn_predict(class_info[['cnProc']], expQueryOrth, nrand = 50))
#> Loaded in the cnProc
#> All Done
#> user system elapsed
#> 0.122 0.006 0.128
This classifies a cell with the category that has a classification score higher than 0.5 or the catgory with the highest classification score. The annotation result can be found in a column named category in the query sample table.
stQuery <- assign_cate(classRes = crPBMC, sampTab = stQuery, cThresh = 0.5)
There are three major ways to visualize the classification output or the classification matrix. The first one is to visualize the result with a classification heatmpa. We create a vector of the labels for each cell in the query data. Here since we know the true idenity of each single cell, we use that as label. Most cases where the true identity of the query is unknown, clustering assignment will be used instead.
#create labels for classification heatmap
sgrp<-as.vector(stQuery$description)
names(sgrp)<-rownames(stQuery)
grpRand<-rep("rand", 50)
names(grpRand)<-paste("rand_", 1:50, sep='')
sgrp<-append(sgrp, grpRand)
sc_hmClass(crPBMC, sgrp, max=5000, isBig=TRUE, cCol=F, font=8)
Second major visualization display for classification score is violin plot. Here we decide to lay out the violin plot horizontally by selecting the number of column in display to be the same number of the cell type, which is 8. This visualization is best to demonstrate the specificity of the classification method.
sc_violinClass(sampTab = stQuery,classRes = crPBMC, cellIDCol = "sample_name", dLevel = "description", ncol = 9)
To zoom into a specific cell type of interest, one can also restrict the display of classification score using sub_cluster parameter.
sc_violinClass(sampTab = stQuery, classRes = crPBMC, cellIDCol = "sample_name", dLevel = "description", ncol = 9, sub_cluster = "B cell")
The third major way of visualization is an attribution plot, which is best at displaying the composition of the cell types within clusters, tissues, organs, samples, and time points.
plot_attr(sampTab = stQuery, classRes = crPBMC, nrand=50, sid="sample_name", dLevel="description")
You can also visualize the average gene expression of Top pair genes in both query data and average expression in training data.
gpTab <- compareGenePairs(query_exp = expQueryOrth, training_exp = expTrainOrth, training_st = stTrain, classCol = "newAnn", sampleCol = "cell", RF_classifier = class_info$cnProc$classifier, numPairs = 20, trainingOnly = FALSE)
#> [1] "All Good"
sgrp<-as.vector(stQuery$prefix)
names(sgrp)<-rownames(stQuery)
train <- findAvgLabel(gpTab, stTrain = stTrain, dLevel = "newAnn")
sgrp <- append(sgrp, train)
hm_gpa_sel(gpTab, genes = class_info$cnProc$xpairs, grps = sgrp, maxPerGrp = 5)
- Data preparation - `splitCommon()` -- split sample table into training and validation through random sampling - Classifier - `scn_train()` -- train random forest classifier using top gene pairs - `scn_predict()` -- predict class for input data - `assign_cate()` -- assign cell annotation based on highest classification score - Assessment - `assess_comm()` -- summarize data for precision recall curve (plot_PRs()) and Cohen's kappa (plot_metrics()) - `sc_hmClass()` -- heatmap showing classification scores for each input cluster - `plot_attr()` -- barplot showing fraction of each input cluster that is classified into a category - `hm_gpa_sel()` -- heatmap showing average expression for top gene pairs - `sc_violinClass()` -- violin plot showing distribution of classification scores for each input cluster
- Training data - `stTM` -- sample table for Tabula Muris containing cell metadata - `expTM` -- raw counts matrix for Tabula Muris - Query data - `stPark` -- sample table for Park et al 2018 containing mouse kidney cells - `expPark` -- raw counts for Park et al 2018 - Training - Train - `stList` -- list containing cells for training and assessment - `stTrain` -- training cells metadata - `expTrain` -- training cells counts - `class_info` -- random forest classifier - Assess - `stTestList` -- list containing cells for assessment and un-assessed cells - `stTest` -- assessment cells metadata - `expTest` -- assessment cells counts - `classRes_validation` -- matrix of classification results for assessment cells - `tm_heldoutassessment` -- summary data for Tabula Muris assessment - Predicting query - `classRes_Park` -- matrix of classification results for Park et al 2018
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.4
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] singleCellNet_0.1.0 cowplot_1.0.0 reshape2_1.4.3
#> [4] pheatmap_1.0.12 dplyr_0.8.3 ggplot2_3.2.1
#> [7] BiocStyle_2.12.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.2 pillar_1.4.2 compiler_3.6.1
#> [4] BiocManager_1.30.4 RColorBrewer_1.1-2 plyr_1.8.4
#> [7] tools_3.6.1 boot_1.3-23 digest_0.6.20
#> [10] lattice_0.20-38 manipulate_1.0.1 evaluate_0.14
#> [13] tibble_2.1.3 gtable_0.3.0 pkgconfig_2.0.2
#> [16] rlang_0.4.0 Matrix_1.2-17 parallel_3.6.1
#> [19] yaml_2.2.0 expm_0.999-4 mvtnorm_1.0-11
#> [22] xfun_0.9 withr_2.1.2 stringr_1.4.0
#> [25] knitr_1.24 DescTools_0.99.28 grid_3.6.1
#> [28] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
#> [31] foreign_0.8-72 rmarkdown_1.15 bookdown_0.13
#> [34] purrr_0.3.2 magrittr_1.5 MASS_7.3-51.4
#> [37] scales_1.0.0 htmltools_0.3.6 randomForest_4.6-14
#> [40] assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
#> [43] stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0
#> [46] crayon_1.3.4