235-一个非常良心的TCGA workshop推荐

刘小泽写于2021.2.25 今天翻看TCGAbiolinks说明书,看到几个相关的workshop,感觉很不错,推荐给大家

0 前言

直达workshop【更新于2019-08-30】:https://rpubs.com/tiagochst/TCGAworkshop

这个workshop的目的是:

  • 学会从Genomic Data Commons (GDC)下载TCGA数据
  • 如何将数据导入并分析

1 认识GDC

TCGA的数据主要存放在两个不同的数据库:

  • The legacy(译为:”遗留“) archive:https://portal.gdc.cancer.gov/legacy-archive/search/f,主要存放早期的一些数据。官方的介绍是:legacy data is not actively maintained, processed, or harmonized by the GDC

  • Harmonized database : https://portal.gdc.cancer.gov/,也就是大家最常见的网站。它是使用了统一的流程,并使用了hg38进行处理得到的结果,数据比较新

有一篇文章也介绍了二者的不同:

  • Gao, Galen F., et al. “Before and After: Comparison of Legacy and Harmonized TCGA Genomic Data Commons’ Data.” Cell systems 9.1 (2019): 24-34. (https://doi.org/10.1016/j.cels.2019.06.006)
  • 主要结论就是:
    • TCGA早期的大部分样本是比对到hg19(还有少部分比对到hg18),叫做legacy omics data
    • 后来各种软件流程逐渐成熟,GDC重新进行了一次操作,比对到了hg38,叫做harmonized omics data
    • hg19和hg38版本得到的数据差别不大:The GRCh37 (hg19) and GRCh38 (hg38) TCGA data versions are highly concordant

2 认识TCGA

2.1 数据的两种权限

  • Open: includes high level genomic data that is not individually identifiable, as well as most clinical and all biospecimen data elements.
  • Controlled: includes individually identifiable data such as low-level genomic sequencing data, germline variants, SNP6 genotype data, and certain clinical data elements

2.2 TCGA样本命名

  • 每个样本都有一个独特的名字,叫做:barcode
  • 图片来自:https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/

当然,每一部分还有更为详细的说明:https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables

3 GDC主要的数据类型

在这里可以看到 https://portal.gdc.cancer.gov/repository

  • Project:比如不同的大型项目,如TCGA、TARGET
  • Data category:比如Transcriptome Profiling, DNA methylation, Clinical
  • Data type:比如Gene Expression Quantification, Isoform Expression Quantification, Methylation Beta Value
  • Experimental strategy:比如miRNA-Seq, RNA-Seq
  • Workflow Type
  • Platform
  • Access type

总而言之,它们的层级关系是:

A project => has data on several categories => each category contains several data types => each type might have been produced with different workflows, experimental strategy and platforms

4 一个比较重要的R数据结构

SummarizedExperiment 是非常常用的,它的示意图是(来自http://bioconductor.org/packages/SummarizedExperiment/):

You can access samples metadata with colData function, genomics data with assays and genomics metadata with rowRanges.

5 实战

将从Harmonized GDC database (hg38 version)下载数据

主要包括以下几个方面:

  1. Access TCGA clinical data
  2. Search, download and make an R object from TCGA data
  • RNA-seq data
  • DNA methylation
  1. Download samples metadata
  2. Download and visualize mutations
  3. Download and visualize copy number alterations

5.0 加载包

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
  BiocManager::install("ComplexHeatmap") # from bioconductor

if (!requireNamespace("TCGAbiolinks", quietly = TRUE))
  BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") # from github

library(TCGAbiolinks)
library(MultiAssayExperiment)
library(maftools)
library(dplyr)
library(ComplexHeatmap)

5.1 临床信息

GDC的临床信息有这几个来源:

  • XML files: original source of the data。包含更多的信息,比如radiation, drugs information, follow-ups, biospecimen, etc。下面的index数据就是XML的一个子集而已
  • indexed clinical: a refined clinical data that is created using the XML files. 它包含的信息是不断更新的,比如一个病人活着的时候被收集了临床信息,但是后来继续调查时不幸因病去世,这时index数据就会显示死亡;而XML中就会比较冗余(首先是记录之前还活着,然后又记录了后来死亡)
  • BCR Biotab: tsv files parsed from XML files

示例来自:https://portal.gdc.cancer.gov/cases/127f738d-afa1-42f6-a9a0-f33806667b84

使用GDCquery_clinic获取
clinical <- GDCquery_clinic("TCGA-COAD")
head(clinical)

就以前面看到的病人TCGA-AA-3562为例:

clinical %>% 
  dplyr::filter(submitter_id == "TCGA-AA-3562") %>% 
  t %>% 
  as.data.frame

获取Biotab信息

除了上面的index信息,还可以获取基于XML的Biotab文件(tab分隔),不过需要TCGAbiolinks版本高于2.13.5

query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)

> names(clinical.BCRtab.all)
[1] "clinical_radiation_acc"         
[2] "clinical_omf_v4.0_acc"          
[3] "clinical_patient_acc"           
[4] "clinical_follow_up_v4.0_acc"    
[5] "clinical_follow_up_v4.0_nte_acc"
[6] "clinical_nte_acc"               
[7] "clinical_drug_acc" 

# 提取其中药物治疗信息
clinical.BCRtab.all$clinical_drug_acc  %>% 
  head  %>% 
  as.data.frame

5.2 获取RNA-Seq数据

之前说过,GDC后来将TCGA数据重新分析了一遍,采用了统一的流程,对于RNA-Seq就是

(图片来自:https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/)

然后有几种不同的RNA-Seq结果类型:

GDCquery中,可以这么设置参数,尤其是workflow.type有多种选择:

  • data.category: “Transcriptome Profiling”
  • data.type: “Gene Expression Quantification”
  • workflow.type: “HTSeq - Counts”, “HTSeq - FPKM”, “HTSeq - FPKM-UQ”
query.exp.hg38 <- GDCquery(project = "TCGA-GBM", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts",
                  barcode =  c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
raw.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)

5.3 获取突变数据

使用GDCquery_Maf:download MAF aligned against hg38

# Pipelines 可选参数:  muse, varscan2, somaticsniper, mutect
maf <- GDCquery_Maf("COAD", pipelines = "muse")
maf %>% head %>% as.data.frame

# 可视化
# using maftools for data summary 
maftools.input <- maf %>% read.maf

# Check summary
plotmafSummary(maf = maftools.input, 
               rmOutlier = TRUE, 
               addStat = 'median', 
               dashboard = TRUE)


oncoplot(maf = maftools.input, 
         top = 10, 
         removeNonMutated = TRUE)


# classifies Single Nucleotide Variants into Transitions and Transversions
titv = titv(maf = maftools.input, 
            plot = FALSE, 
            useSyn = TRUE)
plotTiTv(res = titv)

#  extract sample summary
getSampleSummary(maftools.input)

5.4 拷贝数变异数据

pipeline在:https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/

Numeric focal-level Copy Number Variation (CNV) values were generated with “Masked Copy Number Segment” files from tumor aliquots using GISTIC2 on a project level.

只保留了蛋白编码基因,并且设置的CNV values noise cutoff 为0.3

  • Genes with focal CNV values smaller than -0.3 are categorized as a “loss” (-1)
  • Genes with focal CNV values larger than 0.3 are categorized as a “gain” (+1)
  • Genes with focal CNV values between and including -0.3 and 0.3 are categorized as “neutral” (0)
query <- GDCquery(project = "TCGA-GBM",
             data.category = "Copy Number Variation",
             data.type = "Gene Level Copy Number Scores",              
             access = "open")
GDCdownload(query)
scores <- GDCprepare(query)
scores[1:5,1:5]

可以进行可视化

scores.matrix <- scores %>% 
  dplyr::select(-c(1:3)) %>%  # Removes metadata from the first 3 columns
  as.matrix

rownames(scores.matrix) <- paste0(scores$`Gene Symbol`,"_",scores$Cytoband)

# gain in more than 100 samples
gain.more.than.hundred.samples <- which(rowSums(scores.matrix == 1) > 100)

# loss in more than 100 samples
loss.more.than.hundred.samples <- which(rowSums(scores.matrix == -1) > 100)

lines.selected <- c(gain.more.than.hundred.samples,loss.more.than.hundred.samples)

Heatmap(scores.matrix[lines.selected,],
        show_column_names = FALSE, 
        show_row_names = TRUE,
         row_names_gp = gpar(fontsize = 8),
        col = circlize::colorRamp2(c(-1,0,1), colors = c("red","white","blue")))

5.5 甲基化数据

pipeline在:https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Methylation_LO_Pipeline/

数值在0-1之间,0 表示unmethylated , 1表示 fully methylated

query_met.hg38 <- GDCquery(project = "TCGA-GBM", 
                           data.category = "DNA Methylation", 
                           platform = "Illumina Human Methylation 27", 
                           barcode = c("TCGA-02-0116-01A","TCGA-14-3477-01A-01D"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38,summarizedExperiment = TRUE)

data.hg38
## class: RangedSummarizedExperiment 
## dim: 27578 2 
## metadata(1): data_release
## assays(1): ''
## rownames(27578): cg00000292 cg00002426 ... cg27662877 cg27665659
## rowData names(7): Composite.Element.REF Gene_Symbol ...
##   CGI_Coordinate Feature_Type
## colnames(2): TCGA-02-0116-01A-01D-0199-05
##   TCGA-14-3477-01A-01D-0915-05
## colData names(113): sample patient ...
##   subtype_Telomere.length.estimate.in.blood.normal..Kb.
##   subtype_Telomere.length.estimate.in.tumor..Kb.

# 看看探针信息
library(SummarizedExperiment)
data.hg38 %>% rowRanges %>% as.data.frame

然后样本注释信息

data.hg38 %>% colData %>% as.data.frame

DNA甲基化探针信号值信息

data.hg38 %>% assay %>% head %>% as.data.frame

看看top10变化最大的探针

# plot 10 most variable probes
data.hg38 %>% 
  assay %>% 
  rowVars %>% 
  order(decreasing = TRUE) %>% 
  head(10) -> idx

pal_methylation <- colorRampPalette(c("#000436",
                                      "#021EA9",
                                      "#1632FB",
                                      "#6E34FC",
                                      "#C732D5",
                                      "#FD619D",
                                      "#FF9965",
                                      "#FFD32B",
                                      "#FFFC5A"))(100)

Heatmap(assay(data.hg38)[idx,],
        show_column_names = TRUE, 
        show_row_names = TRUE,
         name = "Methylation Beta-value", 
         row_names_gp = gpar(fontsize = 8),
         column_names_gp = gpar(fontsize = 8),
        col = circlize::colorRamp2(seq(0, 1, by = 1/99), pal_methylation))

6 还有更多内容

Yunze Liu
Yunze Liu
Bioinformatics Sharer

Co-founder of Bioinfoplanet(生信星球)

Next
Previous

Related