TCGA是由National Cancer Institute ( NCI, 美国国家癌症研究所) 和National Human Genome Research Institute (NHGRI, 国家人类基因组研究所) 合作建立的癌症研究项目,通过收集整理癌症相关的各种组学数据,提供了一个大型的,免费的癌症研究参考数据库。挖掘里面的数据发生信文章近年来非常火热,但是,要想发挖掘数据发文章第一步就是要学习如何下载TCGA上的数据;这里介绍一下利用R包TCGAbiolinks下载数据。TCGAbiolinks包是从TCGA数据库官网接口下载数据的R包。它的一些函数能够轻松地帮我们下载数据和整理数据格式。其实就是broad研究所的firehose命令行工具的R包装!
local({r <- getOption("repos")
r["CRAN"] <- "http://mirrors.tuna.tsinghua.edu.cn/CRAN/"
options(repos=r)})
if (!requireNamespace("BiocManager", quietly=TRUE)){
install.packages("BiocManager")
}
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
BiocManager::install("TCGAbiolinks")
#使用参数
GDCquery(project, data.category, data.type, workflow.type,
legacy = FALSE, access, platform, file.type, barcode, data.format,
experimental.strategy, sample.type)
#简单的使用举例
query <- GDCquery(project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Copy Number Segment")
可以通过getGDCprojects()$project_id,获取TCGA中最新的不同癌种的项目号,更新项目信息对应癌症名称:https://www.omicsclass.com/article/1061
> getGDCprojects()$project_id
[1] "TCGA-MESO" "TCGA-READ" "TCGA-SARC"
[4] "TCGA-ACC" "TCGA-LGG" "TCGA-THCA"
[7] "TARGET-CCSK" "TARGET-NBL" "BEATAML1.0-CRENOLANIB"
[10] "TARGET-AML" "TCGA-SKCM" "TCGA-CHOL"
[13] "TCGA-KIRC" "TCGA-BRCA" "VAREPOP-APOLLO"
[16] "HCMI-CMDC" "ORGANOID-PANCREATIC" "TCGA-GBM"
[19] "TCGA-OV" "FM-AD" "TCGA-UCEC"
[22] "TARGET-ALL-P3" "CGCI-BLGSP" "TARGET-ALL-P2"
[25] "TCGA-LAML" "TCGA-DLBC" "TCGA-KICH"
[28] "TCGA-THYM" "TCGA-UVM" "TCGA-PRAD"
[31] "TCGA-LUSC" "TCGA-TGCT" "CPTAC-3"
[34] "BEATAML1.0-COHORT" "TCGA-STAD" "TCGA-LIHC"
[37] "TCGA-COAD" "TARGET-OS" "TARGET-RT"
[40] "CTSP-DLBCL1" "TCGA-HNSC" "TCGA-ESCA"
[43] "TCGA-CESC" "TCGA-PCPG" "TCGA-KIRP"
[46] "TCGA-UCS" "TCGA-PAAD" "TCGA-LUAD"
[49] "TARGET-WT" "MMRF-COMMPASS" "TCGA-BLCA"
[52] "NCICCR-DLBCL" "TARGET-ALL-P1"
> TCGAbiolinks:::getProjectSummary("TCGA-ACC")
$data_categories
case_count file_count data_category
1 80 397 Transcriptome Profiling
2 92 361 Copy Number Variation
3 92 744 Simple Nucleotide Variation
4 80 80 DNA Methylation
5 92 105 Clinical
6 92 352 Sequencing Reads
7 92 517 Biospecimen
$case_count
[1] 92
$file_count
[1] 2556
$file_size
[1] 3.920606e+12
这个参数受到上一个参数的影响,不同的data.category,会有不同的data.type,如下表所示:
如果下载表达数据,常用的设置如下:
#下载rna-seq转录组的表达数据
data.type = "Gene Expresion Quantification"
#下载miRNA表达数据数据
data.type = "miRNA Expression Quantification"
#下载Copy Number Variation数据
data.type = "Copy Number Segment"
Harmonized data options (legacy = FALSE)
Legacy archive data options (legacy = TRUE)
不同的的数据(新老Legacy or Harmonized),里面存储的数据会有差异,会影响前面data.category、 data.type 、 workflow.type参数的设置详细参考:http://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html 这里贴一下如果是Harmonized data options (legacy = FALSE),前面三个参数可以设置的值如下:
Filter by access type. Possible values: controlled, open,筛选数据是否开放,这个一般不用设置,不开放的数据也没必要了,所以都设置成:access=“open"
7.platform
涉及到数据来源的平台,如芯片数据,甲基化数据等等平台的筛选,一般不做设置,除非要筛选特定平台的数据:
如果是在GDC Legacy Archive(legacy=TRUE)下载数据的时候使用,可以参考官网说明:http://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html
如果在GDC Data Portal,这个参数不用设置
A list of barcodes to filter the files to download,可以指定要下载的样品,例如:
barcode =c"TCGA-14-0736-02A-01R-2005-01""TCGA-06-0211-02A-02R-2005-01"
可以设置的选项为不同格式的文件:("VCF", "TXT", "BAM","SVS","BCR XML","BCR SSF XML", "TSV", "BCR Auxiliary XML", "BCR OMF XML", "BCR Biotab", "MAF", "BCR PPS XML", "XLSX"),通常情况下不用设置,默认就行;
用于过滤不同的实验方法得到的数据:
Harmonized: WXS, RNA-Seq, miRNA-Seq, Genotyping Array.
Legacy: WXS, RNA-Seq, miRNA-Seq, Genotyping Array, DNA-Seq, Methylation array, Protein expression array, WXS,CGH array, VALIDATION, Gene expression array,WGS, MSI-Mono-Dinucleotide Assay, miRNA expression array, Mixed strategies, AMPLICON, Exon array, Total RNA-Seq, Capillary sequencing, Bisulfite-Seq
对样本的类型进行过滤,例如,原发癌组织,复发癌等等;
学习完成了所有的参数,这里也有举例使用:
query <- GDCquery(project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Copy Number Segment")
## Not run:
query <- GDCquery(project = "TARGET-AML",
data.category = "Transcriptome Profiling",
data.type = "miRNA Expression Quantification",
workflow.type = "BCGSC miRNA Profiling",
barcode = c("TARGET-20-PARUDL-03A-01R","TARGET-20-PASRRB-03A-01R"))
query <- GDCquery(project = "TARGET-AML",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c("TARGET-20-PADZCG-04A-01R","TARGET-20-PARJCR-09A-01R"))
query <- GDCquery(project = "TCGA-ACC",
data.category = "Copy Number Variation",
data.type = "Masked Copy Number Segment",
sample.type = c("Primary solid Tumor"))
query.met <- GDCquery(project = c("TCGA-GBM","TCGA-LGG"),
legacy = TRUE,
data.category = "DNA methylation",
platform = "Illumina Human Methylation 450")
query <- GDCquery(project = "TCGA-ACC",
data.category = "Copy number variation",
legacy = TRUE,
file.type = "hg19.seg",
barcode = c("TCGA-OR-A5LR-01A-11D-A29H-01"))
query <-GDCquery(project = "TCGA-GBM",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
legacy = TRUE)
GDCdownload(query, method = "client", files.per.chunk = 10, directory="D:/data")
具体参数说明如下,主要设置的参数:query,为GDCquery查询的结果,files.per.chunk = 10,设置同时下载的数量,如果网速慢建议设置的小一些, directory="D:/data" 数据存储的路径;
GDCprepare可以自动的帮我们获得基因表达数据:
data <- GDCprepare(query = query,
save = TRUE,
directory = "D:/data", #注意和GDCdownload设置的路径一致GDCprepare才可以找到下载的数据然后去处理。
save.filename = "GBM.RData") #存储一下,方便下载直接读取