ChIP-Seq数据质控之ChIPQC使用

原创豆豆花花生信星球

今天是生信星球陪你的第653天

大神一句话，菜鸟跑半年。我不是大神，但我可以缩短你走弯路的半年~

就像歌儿唱的那样，如果你不知道该往哪儿走，就留在这学点生信好不好~

这里有豆豆和花花的学习历程，从新手到进阶，生信路上有你有我！

豆豆写于2020.6.18
首先是ChIP-Seq分析的前言介绍部分：
1：了解ChIP-seq的实验流程
2：继续了解ChIP-Seq
3：ChIP-seq的实验对照与偏差来源
4：ChIP-Seq的实验设计补充
5：ChIP-Seq数据库及实战数据介绍
然后开始实战部分：
6：ChIP-Seq计算资源准备与实战数据下载
7：ChIP-Seq数据质控和过滤
8：ChIP-Seq数据比对注意事项
9：ChIP-Seq数据比对实战
这次将介绍如何针对性地对ChIP-Seq数据进行质控。
这一次先学会用，下次会详细解释其中每个指标的含义

主要有两大函数：

ChIPQCsample() ：得到单个样本的QC报告
ChIPQC()：对多个样本进行ChIPQCsample()操作

两个函数均需要提供比对数据（一个或者多个比对bam文件），至于peak文件就是可选的，没有的话后面的peak相关的统计数据就不显示

如果是处理单个样本：

sample = ChIPQCsample("chip.bam")
sample
ChIPQCreport(sample)

如果是处理多个样本：

exampleExp = ChIPQC(samples,annotaiton="hg19")
ChIPQCreport(exampleExp)

而这个samples需要提供一个实验设计数据框，看一下说明书给的例子：

其中包括了比对bam和peak文件的位置

samples = read.csv(file.path(system.file("extdata", package="ChIPQC"),
                             "example_QCexperiment.csv"))
View(samples)

下面的解释来自：https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionV/lessons/05_combine_chipQC_and_metrics.html

SampleID: Identifier string for sample
Tissue, Factor, Condition: Identifier strings for up to three different factors (You will need to have all columns listed. If you don’t have infomation, then set values to NA) 但好像CHIPQC的帮助文档并没有把三个全部都列出来
Replicate: Replicate number of sample
bamReads: file path for BAM file containing aligned reads for ChIP sample
ControlID: an identifier string for the control sample
bamControl: file path for bam file containing aligned reads for control sample
Peaks: path for file containing peaks for sample
PeakCaller: Identifier string for peak caller used. Possible values include “raw”, “bed”, “narrow”, “macs”

注意：

当运行多个样本时，ChIPQC()会调用BiocParallel包，默认并行计算
ChIPQC()提供了一系列参数：
annotaiton参数表示比对使用的参考基因组；
chromosomes参数指定对哪条染色体进行统计，可以加快计算速度，并减小得到的ChIPQCexperiment对象大小；
mapQCth参数表示对比对结果质量过滤的阈值，默认15
blacklist参数可以指定一个blacklist文件或者一个GRange对象（存储相应的blacklist区域信息），后面统计时会过滤掉这部分区域；默认情况下，使用annotaiton=hg19 会自动使用一个blacklist，不需要额外提供

给出统计数据：QCmetrics()

画出peaks中reads的数量：plotRap()

plotRap(exampleExp, facetBy="Condition")

当然，facetBy可以指定多个值，只要存在在samples这个数据框的列名即可，例如 plotRap(tamoxifen,facetBy=c("Tissue","Condition")) 会同时按照两个因素进行区分

样本相关性热图：plotCorHeatmap()

plotCorHeatmap(exampleExp, attributes = "Condition:Replicate", lineBy = "Replicate")

PCA图：plotPrincomp()

plotPrincomp(exampleExp, attributes = "Condition", label="Replicate", dotSize=1, labelSize=0.75)

cross-coverage：plotCC()

plotCC(exampleExp, facetBy="Condition")
# 除了facetBy ，还有colourBy，lineBy，例如
# plotCC(exampleExp,facetBy="Sample",colourBy="Factor",lineBy="Tissue")

# 统计量的计算
# FragCC = CrossCoverageScore_max
FragmentLengthCrossCoverage(exampleExp)
# CrossCoverageScore_readLength
ReadLengthCrossCoverage(exampleExp)
# RelCC = FragCC/CrossCoverageScore_readLength
RelativeCrossCoverage(exampleExp)

Peaks的占比：plotFrip()

 plotFrip(exampleExp, facetBy="Condition")

富集区域分布：plotRegi()

plotRegi(exampleExp, facet=F)
# 单独查看
regi(exampleExp)["All5utrs",]

peak profile：plotPeakProfile()

plotPeakProfile(exampleExp)
# 好的实验会像下图：IP有峰，对照较平

保存以上全部的结果

ChIPQCreport(exampleExp, facetBy = "Condition", lineBy = "Replicate")

点击底部的“阅读原文”，获得更好的阅读体验哦😻

初学生信，很荣幸带你迈出第一步

🤓生信星球 🌎~ 一个不拽术语、通俗易懂的生信知识平台

继续滑动看下一个