scRNA-系统学习单细胞转录组测序scRNA-Seq(五)

刘小泽写于19.4.5

许多单细胞分析包中都会遇到这个对象,那么说明书当然是最好的学习工具了

说明书地址:https://bioc.ism.ac.jp/packages/3.6/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html

安装

options(BioC_mirror="https://mirrors.ustc.edu.cn/bioc/") 
library(BiocManager)
BiocInstaller::biocLite("SingleCellExperiment")
BiocInstaller::biocLite("scRNAseq")

两种创建方式

# 第一种
> counts <- matrix(rpois(100, lambda = 10), ncol=10, nrow=10)
> sce <- SingleCellExperiment(assays = list(counts = counts))
> sce
class: SingleCellExperiment 
dim: 10 10 
metadata(0):
assays(1): counts
rownames: NULL
rowData names(0):
colnames: NULL
colData names(0):
reducedDimNames(0):
spikeNames(0):
# 第二种
> se <- SummarizedExperiment(list(counts=counts))
> as(se, "SingleCellExperiment")

演示数据集

使用 scRNAseq 中的allen数据集

> data(allen)
> allen
class: SummarizedExperiment 
dim: 20908 379 
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s

现在是一个SummarizedExperiment,那么使用第二种方法:

> sce <- as(allen, "SingleCellExperiment")
> sce
class: SingleCellExperiment 
dim: 20908 379 
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(0):

可以对SingleCellExperiment进行的操作

添加spike-in信息

#代码的意思很清楚:先提取 rownames(sce)中的ERCC,结果返回逻辑值,然后传递给isSpike,如果结果是TRUE的部分,就在sce中记做”ERCC“
> isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
> sce
class: SingleCellExperiment 
dim: 20908 379 
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(1): ERCC

有了这个信息,我们也可以非常方便进行提取,包括数量和名称信息

> table(isSpike(sce, "ERCC")) #提取数量
FALSE  TRUE 
20816    92 
> spikeNames(sce) # 提取名称
[1] "ERCC"

虽然大多数实验中只用到一个spike-in信息,但这个函数还可以继续加spike-in,比如要添加adam基因家族作为spike-in

> isSpike(sce, "Adam") <- grepl("^Adam[0-9]", rownames(sce)) #添加新spike-in
> sce
class: SingleCellExperiment 
dim: 20908 379 
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(2): ERCC Adam
> table(isSpike(sce, "Adam"))

FALSE  TRUE 
20875    33 

要取消之前定义的spike-in也很简单

> isSpike(temp,"ERCC") <- NULL
> isSpike(temp,"Adam") <- NULL
> spikeNames(temp)
character(0)

添加Size factors

一般就是计算总reads数作为size factors,更复杂的计算情况可以参考: scran

> sizeFactors(sce) <- colSums(assay(sce))
> head(sizeFactors(sce))
SRR2140028 SRR2140022 SRR2140055 SRR2140083 SRR2139991 SRR2140067 
   5173863    6445002    2343379    5438526    4757468    2364851 

可以根据其他信息计算size factors并存储在一个对象中,然后给他一个名称,比如这里要计算ERCC的size factors

> sizeFactors(sce, "ERCC") <- colSums(assay(sce)[isSpike(sce, "ERCC"),])
# 那么要看ERCC的size factors就要指定ERCC名称。不指定还是默认看之前的
> head(sizeFactors(sce, "ERCC"))
SRR2140028 SRR2140022 SRR2140055 SRR2140083 SRR2139991 SRR2140067 
    224648     186208     162370     512991     278034      64975 

提取colData和rowData

colData and rowData 分别存储样本和基因信息

# 简单的提取
colData(sce)
rowData(sce)
# 如果想提取出size facotr
colData(sce,internal = T)
# 如果想提取出spike-in
rowData(sce,internal = T)

对研究的基因数量降维(取子集)

一般表达矩阵都有一万多基因,为了分析的简便和速度,可以提取前100基因,但是这也不是随便选取的。它的要求是:log转换后最大方差的前100

> library(magrittr)
> assay(sce) %>% log1p %>% rowVars -> vars 
# log1p(x) =  log(1+x);rowVars:each row variance in a matrix 
> names(vars) <- rownames(sce)
> vars <- sort(vars, decreasing = TRUE)  # 为了找到前100的名字

> sce_sub <- sce[names(vars[1:100]),]
> sce_sub
class: SingleCellExperiment 
dim: 100 379 
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(100): Lamp5 Fam19a1 ... Rnf2 Zfp35
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(2): ERCC Adam

看到reducedDimNames还是空值,因此下一步就是利用得到的100个方差最大的基因求PCA和tSNE

> library(Rtsne)
> set.seed(5252)

> pca_data <- prcomp(t(log1p(assay(sce_sub))))
> tsne_data <- Rtsne(pca_data$x[,1:50], pca = FALSE) ## 选取前50个主成分;之前做过了PCA,这里就F(默认T)
# t-SNE for visualization, PCA for pseudo-time inference

# reducedDims就像assays一样,可以存储许多不同维度的数据
> reducedDims(sce_sub) <- SimpleList(PCA=pca_data$x, TSNE=tsne_data$Y)
> sce_sub
class: SingleCellExperiment 
dim: 100 379 
metadata(2): SuppInfo which_qc
assays(4): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm
rownames(100): Lamp5 Fam19a1 ... Rnf2 Zfp35
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(2): PCA TSNE
spikeNames(2): ERCC Adam

reducedDimNames存储的是坐标,可以按名称或index提取。每一行表示一个基因,每一列表示一个坐标

> reducedDims(sce_sub)
List of length 2
names(2): PCA TSNE
> reducedDimNames(sce_sub)
[1] "PCA"  "TSNE"
> head(reducedDim(sce_sub, "PCA")[,1:2])
                 PC1        PC2
SRR2140028 17.557295  -7.717162
SRR2140022 21.468975  -1.198212
SRR2140055  4.303756 -11.360330
SRR2140083 21.440479  -9.435868
SRR2139991 15.592089 -11.043989
SRR2140067 16.539336  -9.831779
> head(reducedDim(sce_sub, "TSNE")[,1:2])
                [,1]      [,2]
SRR2140028 -4.547370 -16.76029
SRR2140022 -4.957524 -12.40431
SRR2140055  2.046637 -15.23982
SRR2140083 -3.609021 -14.32577
SRR2139991 -2.158066 -13.48076
SRR2140067 -2.361944 -15.70875

sce_sub取子集也就是直接对细胞降维的结果取子集

> dim(reducedDim(sce_sub, "PCA"))
[1] 379 100
> dim(reducedDim(sce_sub[,1:10], "PCA"))
[1]  10 100

快速命名assays并获取

SingleCellExperiment对象中,可以随意命名assays的entry。但是为了同其他包保持协调,官方给出了建议:

  • counts: 原始表达量,比如某个基因的的reads数或转录本数
  • normcounts: 利用原始表达量进行标准化,结果是原始值按一定比例缩小后得到的(中心化)。例如原始表达量除以每个细胞特定的size factor得到的就是中心化数据
  • logcounts: log转换后的counts值,多数情况下被定义为log-transformed normcounts, 例如log2(count+1)
  • cpm: Counts-per-million. 每个细胞中每个基因的read count值,再除以每个细胞的文库大小(单位是M)
  • tpm: Transcripts-per-million. 每个细胞中每个基因的转录本数量,再除以每个细胞的总转录本数(单位是M)
# 这里将counts这个entry,定义成快捷访问tophat_counts
> counts(sce) <- assay(sce, "tophat_counts")
> sce
class: SingleCellExperiment 
dim: 20908 379 
metadata(2): SuppInfo which_qc
assays(5): tophat_counts cufflinks_fpkm rsem_counts rsem_tpm counts
rownames(20908): 0610007P14Rik 0610009B22Rik ... Zzef1 Zzz3
rowData names(0):
colnames(379): SRR2140028 SRR2140022 ... SRR2139341 SRR2139336
colData names(22): NREADS NALIGNED ... Animal.ID passes_qc_checks_s
reducedDimNames(0):
spikeNames(2): ERCC Adam

这样以后直接用counts提取出tophat_counts ,而不用写后面一大串

Yunze Liu
Yunze Liu
Bioinformatics Sharer

Co-founder of Bioinfoplanet(生信星球)

Next
Previous

Related