在上一部分的文章中(PANDA姐的转录组入门(7):差异基因分析-1),我们尝试了三种R包进行差异分析,对得到的差异基因进行比较。假设分析方法没有错误的情况下,三个软件分析得到共有的差异基因个数有135个,总数排序是edgeR>limma>deseq2。今天我修改了一下之前的deseq2代码,并且进行了简单的数据可视化探索,仅供参考。
进对差异表达的结果进行检查:可视化或聚类分析
Counts plot
MA-plot
Gene clustering
Independent filtering
## DESeq2
# Analyzing RNA-seq data with DESeq2:
# http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
### data input
# DESeq2内置library size的矫正,只需要reads计数文件即可。
# count文件所在目录
directory <- "D:/rna_seq/work/matrix"
# 样品的count文件名
sampleFiles <- grep("Mus_musculus.*.count",list.files(directory),value=TRUE)
#########################################################
## [1] "Mus_musculus_E14_cells_Akap95_shRNA_rep1_sorted.count"
## [2] "Mus_musculus_E14_cells_Akap95_shRNA_rep2_sorted.count"
## [3] "Mus_musculus_E14_cells_control_shRNA_rep1_sorted.count"
## [4] "Mus_musculus_E14_cells_control_shRNA_rep2_sorted.count"
########################################################
# 样品名
sampleNames <-sub("Mus_musculus_E14_cells_(.*)_sorted.*","\\1",sampleFiles)
# [1] "Akap95_shRNA_rep1" "Akap95_shRNA_rep2" "control_shRNA_rep1" "control_shRNA_rep2"
# 样品的处理条件
sampleCondition <- sub("Mus_musculus_E14_cells_(.*)_shRNA_rep.*","\\1",sampleFiles)
# [1] "Akap95" "Akap95" "control" "control"
sampleTable <- data.frame(sampleName = sampleNames,
fileName = sampleFiles,
condition = sampleCondition)
############################################################################################
## sampleName fileName condition
## 1 Akap95_shRNA_rep1 Mus_musculus_E14_cells_Akap95_shRNA_rep1_sorted.count Akap95
## 2 Akap95_shRNA_rep2 Mus_musculus_E14_cells_Akap95_shRNA_rep2_sorted.count Akap95
## 3 control_shRNA_rep1 Mus_musculus_E14_cells_control_shRNA_rep1_sorted.count control
## 4 control_shRNA_rep2 Mus_musculus_E14_cells_control_shRNA_rep2_sorted.count control
##########################################################################################
library("DESeq2") # 1.16.1
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = directory,
design= ~ condition)
# 提示使用factor(…,levels=…) 或者 relevel() 来设置reference level
ddsHTSeq$condition <- relevel(ddsHTSeq$condition, ref="control")
# ddsHTSeq$condition <- factor(ddsHTSeq$condition, levels=c("control","Akap95"))
# 也可以在后面的results() 设置contrast参数
### Pre-filtering
dds <- ddsHTSeq[ rowSums(counts(ddsHTSeq)) > 1, ] # 将所有样本基因表达量之和小于0的基因过滤掉
# 48321个基因 --> 26426个基因,45.3%的基因被过滤掉了
# dds[rowSums(fpm(dds)>1)>=2]
# 更加严格的过滤在差异分析结束后的 results()函数中自动进行Independent filtering of results
### Differential expression analysis
dds <- DESeq(dds)
# DESeq()相当于以下三个步骤
# dds <- estimateSizeFactors(dds)
# dds <- estimateDispersions(dds)
# dds <- nbinomWaldTest(dds)
### Count data transformations
# 为下游的可视化和聚类分析做准备
## rlog()、rlogTransformation()以及与ves()、varianceStabilizingTransformation()的功能是一样的,
## 样本数超过100时,使用vst() 比 rlog() 快. 因为rlog需要拟合每个样本和每个基因的收缩项很耗费时间;
# DESeq()已经估计过离散值了,所以不用再设置blind参数。
rld <- rlogTransformation(dds, blind=FALSE)
write.csv(assay(rld) ,file = 'mm.DESeq2.pseudo.counts.csv')
# res <- results(dds)
res <- results(dds, contrast=c("condition","Akap95","control"))
# 详情见:?results() 当有多个
# res
# Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons:
# 1. If within a row, all samples have zero counts, the baseMean column will be zero,
# and the log2 fold change estimates, p value and adjusted p value will all be set to NA.
# 2. If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA.
# These outlier counts are detected by Cook’s distance.
# 3. If a row is filtered by automatic independent filtering, for having a low mean normalized count,
# then only the adjusted p value will be set to NA.
resLFC <- lfcShrink(dds,coef = 2,res=res)
write.table(resLFC,"mm.deseq2.resLFC.csv",quote = F,sep = "\t")
# DESeq(dds,betaPrior=TRUE)跟 lfcShrink()的功能是一样的;
# 1.61版本以上的才有把log2 fold changes修改功能分离为lfcShrink()。
# lfcShrink()运行后结果会少了一列lfccSE,betaPrior=TRUE会保留该行,结果不太一致,但log2fold changes绝对值缩小了。
# Question: New function lfcShrink() in DESeq2:https://support.bioconductor.org/p/95695/
# resLFC
### Exporting results to CSV files
# 在利用数据比较分析两个样品中同一个基因是否存在差异表达的时候,
# 一般选取两个标准:FoldChange和FDR校正后的p值
# 取padj值小于0.05,log2FoldChange大于1的基因作为差异基因集
sig.deseq2_LFC <-subset(resLFC,padj <0.05&(log2FoldChange >1| log2FoldChange < -1))
# log2FoldChange >1| log2FoldChange < -1 也可以写成 abs(log2FoldChange) >= 1)
sig.deseq2_LFC <- as.data.frame(sig.deseq2_LFC)
sig.deseq2<- sig.deseq2_LFC[order(sig.deseq2_LFC$padj),] # 按照padj进行排序
write.csv(sig.deseq2,"mm.deseq2.results.csv",quote = F) # 把结果输出
# padj最小的基因和感兴趣的基因Gapdh,Actb
# 基因在不同组别中标准化后的基因count数
plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
plotCounts(dds, gene="Gapdh", intgroup="condition")
plotCounts(dds, gene="Actb", intgroup="condition")
这个后面两个基因组内差异怎么那么大,看起来,一个天一个地,那如果不做重复的话,就太不靠谱了吧。
# 也可以用ggplot2
library(ggplot2)
geneCounts <- plotCounts(dds, gene = which.min(res$padj), intgroup ="condition",returnData = TRUE)
ggplot(geneCounts, aes(x = condition, y = count)) + geom_point(size = 3)
# 也可以用plotly
# install.packages("plotly")
# library(plotly)
# 绘制可以交互的图
MA-plot (R. Dudoit et al. 2002) ,也叫 mean-difference plot或者Bland-Altman plot,用来估计模型中系数的分布。 X轴, the “A” ( “average”);Y轴,the “M” (“minus”) – subtraction of log values is equivalent to the log of the ratio。
plotMA(res, ylim = c(-5,5))
topGene <- rownames(res)[which.min(res$padj)]
with(res[topGene, ], {
points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})
data:image/s3,"s3://crabby-images/1d1cd/1d1cd17f07548956fe2c4e524037417408add969" alt="图片"' fill='%23FFFFFF'%3E%3Crect x='249' y='126' width='1' height='1'%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)
resLFC <- lfcShrink(dds,coef = 2,res=res)
# resLFC <- lfcShrink(dds,contrast = c("condition","Akap95","control"),res=res)
plotMA(resLFC, ylim=c(-5,5))
topGene <- rownames(resLFC)[which.min(res$padj)]
with(resLFC[topGene, ], {
points(baseMean, log2FoldChange, col="dodgerblue", cex=2, lwd=2)
text(baseMean, log2FoldChange, topGene, pos=2, col="dodgerblue")
})
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
data:image/s3,"s3://crabby-images/1d1cd/1d1cd17f07548956fe2c4e524037417408add969" alt="图片"' fill='%23FFFFFF'%3E%3Crect x='249' y='126' width='1' height='1'%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)
hist(res$pvalue[res$baseMean > 1], breaks = 0:20/20,
col = "grey50", border = "white")
Histogram of p values for genes with mean normalized count larger than 1.
热图
library("genefilter")
topVarGenes <- head(order(rowVars(assay(rld)), decreasing = TRUE), 20)
mat <- assay(rld)[ topVarGenes, ]
# mat <- mat - rowMeans(mat)
anno <- as.data.frame(colData(rld)[,c("condition","sizeFactor")])
# install.packages('pheatmap')
library(pheatmap)
pheatmap(mat, annotation_col = anno)
我想把sizeFactor去掉,但去掉就画不了,奇怪。
下面的图是对mat <- mat - rowMeans(mat)
减去了一个平均值,让数值更集中了。
火山图
# 这是一行perl单行命令,把差异分析结果中显著基因进行标注并删除padj为NA的基因
# perl -F'\t' -alne '$F[5]=~s/\r//;if(/baseMean/){print "gene\tlog2FoldChange\tpadj\tsignificant"} \
else{unless(/NA/){if($F[2] >=1 && $F[5]<0.05){print "$F[0]\t$F[2]\t$F[5]\tup"} \
elsif($F[2]<=-1 && $F[5]<0.05){print "$F[0]\t$F[2]\t$F[5]\tdown"} \
else{print "$F[0]\t$F[2]\t$F[5]\tno"}}}' mm.deseq2.resLFC.csv >volcano.txt
# 以下是画火山图的代码
data <- read.table(file="volcano.txt",header = TRUE, row.names =1,sep = "\t")
volcano <- ggplot(data, aes(x= log2FoldChange, y= -1*log10(padj))) + geom_point(aes(color=significant))
+ scale_color_manual(values =c("red","grey", "blue"))
+ labs(title="Volcanoplot",x=expression(log[2](FC)), y=expression(-log[10](padj)))
+ geom_hline(yintercept=1.3,linetype=4)+geom_vline(xintercept=c(-1,1),linetype=4)
data:image/s3,"s3://crabby-images/1d1cd/1d1cd17f07548956fe2c4e524037417408add969" alt="图片"' fill='%23FFFFFF'%3E%3Crect x='249' y='126' width='1' height='1'%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E)
这个图,我就模仿一个公众号的代码画的,不知道对不对,大概就是这样吧~~~
qs <- c(0, quantile(resLFC$baseMean[resLFC$baseMean > 0], 0:6/6))
bins <- cut(resLFC$baseMean, qs)
levels(bins) <- paste0("~", round(signif((qs[-1] + qs[-length(qs)])/2, 2)))
fractionSig <- tapply(resLFC$pvalue, bins, function(p)
mean(p < .05, na.rm = TRUE))
barplot(fractionSig, xlab = "mean normalized count",
ylab = "fraction of small p values")
The ratio of small p values for genes binned by mean normalized count
rwzx君教你画火山图
RNA-seq workflow: gene-level exploratory analysis and differential expression:https://bioconductor.org/help/workflows/rnaseqGene/
~ 以上代码不能看,我也拯救不了了,转到电脑上看吧 ~
有问题请联系我
个人微信ID:
Shenmengyuan1993