Sparse Differential Clustering

Martin Barron

2018-01-04

The SparseDC package implements the sparse differential clustering algorithm described in “A sparse differential clustering algorithm for tracing cell type changes via single-cell RNA-Sequencing data”. This algorithm clusters samples (cells) from two conditions (identifying cell types), links the clusters across conditions and identifies variables (genes) which are markers for these changes.This vignette will guide you through the steps neccessary to apply the algorithm to your data.

Section 1 - Preliminaries

SparseDC takes as input data which is drawn from two conditions, for single-cell RNA-Sequencing (scRNA-seq) data, an example would be cells from populations before and after treatment. SparseDC assumes the data have been properly normalized beforehand.

The first step in the analysis is to load the sparseDC package:

library(SparseDC)
set.seed(10)

Real Data Example - Biase Data

For this vignette we will use the scRNA-seq data created by Biase et al. to study cell fate inclination in mouse embryos (Biase, Cao, and Zhong 2014). This dataset contains gene expression, FPKM, measurements for 49 cells and 16,514 genes. The cells in the dataset come from three different cell types, zygote, two-cell embryos and four-cell embryos. While the cells in this dataset are all from a single condition we have dveloped an approach to split the data into two conditions so that we can test the linking of clusters across conditions where we know the absolute truth about which clusters are present in each condition (please see the original manuscript for more details). For this dataset we put the zygote cells and half of the two-cell cells into condition 1 and the remaining two-cell cells and the four-cell cells are put into condition 2.

Splitting the data

We first view the top of the dataset to ensure it is in the correct format. Here the genes are rows and the columns are cells as desired.

data(data_biase)
head(data_biase[,1:5])
##                    GSM1377859 GSM1377860 GSM1377861 GSM1377862 GSM1377863
## ENSMUSG00000000001   3.288693   3.631147   2.290201   3.241467  3.4727581
## ENSMUSG00000000028   4.547849   4.533851   4.560077   4.682483  4.8076946
## ENSMUSG00000000037   3.662392   3.154039   3.192203   3.767524  3.2131756
## ENSMUSG00000000049   0.000000   0.000000   0.000000   0.000000  0.0000000
## ENSMUSG00000000056   2.544338   1.889089   2.143146   1.975677  1.9743810
## ENSMUSG00000000078   1.127634   1.278873   1.085679   2.132017  0.9719827

We then store the gene names for the data:

gene_names_biase <- row.names(data_biase)

We next examine the cell types of the cells present in the data:

summary(as.factor(cell_type_biase))
## Four-cell Embryo  Two-cell Embryo           Zygote 
##               20               20                9

With all the cell types as expected we then view the number of cells in each condition:

summary(as.factor(condition_biase))
##  A  B 
## 19 30

The next step is to split the data into their respective conditions:

data_A <- data_biase[ , which(condition_biase == "A")]
data_B <- data_biase[ , which(condition_biase == "B")]

Then check that the dimensions are correct for each of the datasets:

dim(data_A)
## [1] 16514    19
dim(data_B)
## [1] 16514    30

And check they are in the correct format:

head(data_A[ ,1:5])
##                    GSM1377859 GSM1377860 GSM1377861 GSM1377862 GSM1377863
## ENSMUSG00000000001   3.288693   3.631147   2.290201   3.241467  3.4727581
## ENSMUSG00000000028   4.547849   4.533851   4.560077   4.682483  4.8076946
## ENSMUSG00000000037   3.662392   3.154039   3.192203   3.767524  3.2131756
## ENSMUSG00000000049   0.000000   0.000000   0.000000   0.000000  0.0000000
## ENSMUSG00000000056   2.544338   1.889089   2.143146   1.975677  1.9743810
## ENSMUSG00000000078   1.127634   1.278873   1.085679   2.132017  0.9719827
head(data_B[ ,1:5])
##                    GSM1377878 GSM1377879 GSM1377880 GSM1377881 GSM1377882
## ENSMUSG00000000001  1.9891407  0.4542419   3.070172 3.15709403 4.20193210
## ENSMUSG00000000028  4.8749530  4.9535851   4.502423 4.97275342 5.23127973
## ENSMUSG00000000037  3.4309147  3.9816964   3.156383 3.02379967 2.64779051
## ENSMUSG00000000049  0.2570478  0.8972465   0.000000 0.00000000 0.17842313
## ENSMUSG00000000056  0.8593775  0.3787494   1.058426 0.06336928 0.03809113
## ENSMUSG00000000078  0.0000000  0.0000000   0.000000 0.03189917 0.00000000

As the dimension and format of these datasets are correct we can procede to pre-processing the data and estimating the parameters for using SparseDC.

Section 2 - Pre-processing the data

SparseDC requires that data be normalized for sequencing depth and centered prior to running SparseDC. We also recommend that the data is log-transformed. To do this we have included a function that can easily pre-process the data. For the normalization it is recommended that users make use of one of the many methods that exist for normalizing scRNA-Seq data. The centering of the data is crucially important to the function of SparseDC and is vital to accurately clustering the data and identifying marker genes. We recommend that all users use this function to center their data and that only experienced users set “center=FALSE”.

The biase data are FPKM measurements of gene expression and so have been normalized using an alternate method as advised. This means we can set “norm = FALSE”. The biase data then needs to be both log transformed and centered so we can set “log =TRUE”" and “center = TRUE”:

pre_data <- pre_proc_data(data_A, data_B, norm = FALSE, log = TRUE, center = TRUE)

The pre-processing function outputs the two processed datasets as a list so to extract the data we can run:

pdata_A <- pre_data[[1]]
pdata_B <- pre_data[[2]]

And view the processed data:

head(pdata_A[,1:5])
##                      GSM1377859   GSM1377860    GSM1377861  GSM1377862
## ENSMUSG00000000001  0.162775402  0.239597984 -0.1022580244  0.15170254
## ENSMUSG00000000028 -0.001604684 -0.004130993  0.0005968605  0.02237327
## ENSMUSG00000000037  0.459405948  0.343958400  0.3531038162  0.48170453
## ENSMUSG00000000049 -0.130220364 -0.130220364 -0.1302203638 -0.13022036
## ENSMUSG00000000056  0.511663747  0.307253398  0.3915364804  0.33678380
## ENSMUSG00000000078  0.441440323  0.510111080  0.4215243542  0.82810722
##                     GSM1377863
## ENSMUSG00000000001  0.20479864
## ENSMUSG00000000028  0.04416868
## ENSMUSG00000000037  0.35809405
## ENSMUSG00000000049 -0.13022036
## ENSMUSG00000000056  0.33634818
## ENSMUSG00000000078  0.36546939
head(pdata_B[,1:5])
##                     GSM1377878  GSM1377879   GSM1377880  GSM1377881
## ENSMUSG00000000001 -0.19822065 -0.91872184  0.110478568  0.13160968
## ENSMUSG00000000028  0.05568304  0.06897856 -0.009826495  0.07219301
## ENSMUSG00000000037  0.40848344  0.52564787  0.344522660  0.31210403
## ENSMUSG00000000049  0.09854563  0.51018328 -0.130220364 -0.13022036
## ENSMUSG00000000056 -0.13344603 -0.43251091 -0.031746393 -0.69224534
## ENSMUSG00000000078 -0.31357011 -0.31357011 -0.313570114 -0.28216916
##                     GSM1377882
## ENSMUSG00000000001  0.35582352
## ENSMUSG00000000028  0.11456670
## ENSMUSG00000000037  0.21399903
## ENSMUSG00000000049  0.03395685
## ENSMUSG00000000056 -0.71630419
## ENSMUSG00000000078 -0.31357011

Finally let us check the centering was succesful by examing the rowsums of the pooled data:

summary(rowSums(cbind(pdata_A,pdata_B)))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -5.773e-15 -1.110e-15  0.000e+00 -1.578e-17  1.110e-15  5.551e-15

As these values are all close to zero we can see that the centering has been successful and we are now ready to move on to eatimating the parameters.

Section 2 - Estimating the parameters

There are two parameters to be calculated when using SparseDC, \(\lambda_{1}\) and \(\lambda_{2}\), which control the level of sparsity of the marker genes for each cluster and the level of difference in marker genes for each cluster across the conditions. The estimation of these parameters is described in the original manuscript.

Estimating \(\lambda_{1}\)

To estimate \(\lambda_{1}\) all that needs to be provided is the total number of clusters present in the data and two pre-processed centered datasets. For the Biase data there are three clusters present in the dataset so we set ‘n_cluster’ equal to three.

lambda1_value <- lambda1_calculator(pdata_A, pdata_B, ncluster = 3)
lambda1_value
## [1] 2.203603

Estimating \(\lambda_{2}\)

To estimate \(\lambda_{2}\) we again input the centered data from the two conditions and the number of clusters in the data.

lambda2_value <- lambda2_calculator(pdata_A, pdata_B, ncluster =3)
lambda2_value
## [1] 0.868194

With the \(\lambda_{1}\) and \(\lambda_{2}\) parameters calculated it is now time to run the SparseDC algorithm.

Section 3 - Running SparseDC

The SparseDC function requires us to input the pre-processed centered data sets, the number of clusters and the previously calculated \(\lambda_{1}\) and \(\lambda_{2}\) values:

sdc_res <- sparsedc_cluster(pdata_A, pdata_B, ncluster = 3, 
                              lambda1 = lambda1_value, lambda2 = lambda2_value)
## The number of unique start values is:  18

Viewing Results

The results of SparseDC are stored as a list containing the clustering results for each condition, the cluster centers for each condition, and the scores for each of the iterations. These can be accessed by:

clusters_1 <- sdc_res$clusters1  # Clusters for condition 1 data
clusters_2 <- sdc_res$clusters2  # Clusters for condition 2 data
centers_1 <- sdc_res$centers1  # Centers for condition 1 data
centers_2 <- sdc_res$centers2  # Centers for condition 2 data

The results can then be seen as:

summary(as.factor(clusters_1))
##  2  3 
##  9 10
summary(as.factor(clusters_2))
##  1  3 
## 20 10

We can then visualize the accuracy of the clustering by comparing the clusters to the true cell types:

table(cell_type_biase[which(condition_biase == "A")], clusters_1)
##                  clusters_1
##                    2  3
##   Two-cell Embryo  0 10
##   Zygote           9  0
table(cell_type_biase[which(condition_biase == "B")], clusters_2)
##                   clusters_2
##                     1  3
##   Four-cell Embryo 20  0
##   Two-cell Embryo   0 10
table(c(cell_type_biase),c(clusters_1,clusters_2))
##                   
##                     1  2  3
##   Four-cell Embryo 20  0  0
##   Two-cell Embryo   0  0 20
##   Zygote            0  9  0

Here see that each of the cells was clustered to the correct cell type, with the zygote cells making up cluster 1, the two-cell cells making up cluster 2 and the four-cell cells making up cluster 3.

Marker Genes

The marker gene results are stored as the centers with column 1 corresponding to the marker gene results for cluster 1 and so on. Any center value not equal to zero is a marker gene for the cluster, while center values equal to zero indicate null genes. Gene with a positive center value are up-regualted for the cluster while genes with a negative center value are down-regulated for the cluster. To view the top-10 up-regulated marker genes for the zygote cell type we can run:

zygote_top_10_index <- which(centers_1[,1] >= tail(sort(centers_1[,1]),10)[1])
zygote_top_10 <- gene_names_biase[zygote_top_10_index]
zygote_top_10
##  [1] "ENSMUSG00000000308" "ENSMUSG00000000903" "ENSMUSG00000015247"
##  [4] "ENSMUSG00000032494" "ENSMUSG00000060591" "ENSMUSG00000083068"
##  [7] "ENSMUSG00000084174" "ENSMUSG00000087881" "ENSMUSG00000093678"
## [10] "ENSMUSG00000095834"

To view the center values for each of the top-10 genes we can run:

zyg_t10_res <- cbind(zygote_top_10, centers_1[zygote_top_10_index])
zyg_t10_res
##       zygote_top_10                           
##  [1,] "ENSMUSG00000000308" "0.339386992951137"
##  [2,] "ENSMUSG00000000903" "0.345497519535562"
##  [3,] "ENSMUSG00000015247" "0.36862191083375" 
##  [4,] "ENSMUSG00000032494" "0.477640809239895"
##  [5,] "ENSMUSG00000060591" "0.367896779790343"
##  [6,] "ENSMUSG00000083068" "0.409185654814694"
##  [7,] "ENSMUSG00000084174" "0.572456920609876"
##  [8,] "ENSMUSG00000087881" "0.365451348202132"
##  [9,] "ENSMUSG00000093678" "0.340362695952872"
## [10,] "ENSMUSG00000095834" "0.402631577182611"

Condition-Specific and Condition-Dependent Marker Genes

To detect condition-specific and condition-dependent marker genes (as defined in the original manuscript) we must examine the differences between the centers for clusters which are present in both conditions. As only the two-cell cells are present in each condition we examine their marker genes for condition specific and condition dependent marker genes.

As the two-cell cells were clustered in cluster 2 we compare the two-center vectors for cluster 2 to identify condition-specific and condition-dependent marker genes:

diff_gene_index_2cell <- which(centers_1[,2] != centers_2[,2])
diff_gene_index_2cell
## integer(0)

However, for this dataset as the cells in both conditions are the same SparseDC correctly did not detect any genes as condtion-specific or condition-dependent.

Estimating the Number of Clusters in the Data via the Gap Statistic

One of the more succesful methods for estimating the number of clusters present in supervised clustering analysis is the gap statistic (Tibshirani, Walther, and Hastie 2001). To aid users who may not know the number of clusters in their data prior to analysis we have included an implementation of the gap statistic for SparseDC in the R-package. Please see the original paper for full details on the method.

Users should beware that the gap statistic function can take some time to run especially if using a large cluster range, a high number of bootstrap samples or a high dimensional dataset. Please uncomment the code below if you would like to run it. Here “min_clus” is the minimum number of clusters to try while “max_clus” is the maximum. “nboots” controls the number of bootstrap samples used, with a default value of 200.

#gap_stat <- sparsedc_gap(pdata_A, pdata_B,
#                                 min_clus = 2, max_clus=4,
#                                 nboots = 200, nitter = 20, nstarts = 10)
#plot(gap_stat$gap_stat, xlab = "Cluster Number", ylab = "Gap Statistic",
#     main = "Gap Statistic Plot")
#arrows(1:length(gap_stat$gap_stat),gap_stat$gap_stat-gap_stat$gap_se,
#       1:length(gap_stat$gap_stat),gap_stat$gap_stat+gap_stat$gap_se, 
#       code=3, length=0.02, angle = 90)

References

Biase, Fernando H., Xiaoyi Cao, and Sheng Zhong. 2014. “Cell Fate Inclination Within 2-Cell and 4-Cell Mouse Embryos Revealed by Single-Cell RNA Sequencing.” Genome Research 11 (4). Genome Research: 1787–96. doi:10.1101/gr.177725.114.

Tibshirani, Robert, Guenther Walther, and Trevor Hastie. 2001. “Estimating the Number of Clusters in a Data Set via the Gap Statistic.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2). Blackwell Publishers Ltd.: 411–23. doi:10.1111/1467-9868.00293.