-
Notifications
You must be signed in to change notification settings - Fork 638
Limitations of regress_out as a standard processing step? #526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Great, thank you for summarizing this so neatly. We could even link from the PBMC3k tutorial to this. Quick answer: yes, there are many situations in which I'm sure @LuckyMD has a lot to say on this. :) |
Hi @grst, We've been working on a best-practices workflow/tutorial for scRNA-sesq data for the past year. The tutorial is based on scanpy, but it also integrates tools from R. You can take a look at a case study which implements the workflow here. The revised manuscript should be submitted this week as well, so I hope it will be out soon. If you like, I can send it to you, if you pass me your email address. Regarding regressing out covariates. @falexwolf has already mentioned that this is very dataset dependent. It is also dependent on the downstream analysis. This is especially true for the covariates you suggest. MT gene expression and cell cycle are both biological, rather than technical covariates. Regressing out biological covariates is generally done to isolate particular processes in the data that you are interested in, while losing global structure in the data. This is helpful especially for trajectory inference, but maybe less so for global exploratory analysis with clustering. For example, cell cycle stage can be a major determinant in the difference between two cell types (e.g. stem cells and proliferating cells like transit amplifying cells). Removing this effect, hides the distinction. MT gene expression is also a biological covariate (as well as a technical indicator of cell stress). Higher levels of MT gene expression can indicate increased respiration in e.g., asthmatic conditions. An additional unwanted effect to regressing out biological covariates is that biological processes are not independent. Thus, regressing out one process, will partially remove effects of other, downstream processes as well. If you're interested, we can discuss this in more detail. Hope this helps a bit, |
Thanks for your comments. @LuckyMD, I would be really interested in the manuscript. I'll send you an email regarding that. Best, |
Hi @LuckyMD , |
I’m not sure how |
According to the PBMC 3k tutorial, which I consider as the 'best practice' tutorial for scanpy, regressing out the fraction of mitochondrial reads and the number of detected genes is recommended as a 'standard processing step'.
Having analysed two different datasets, I am so sure anymore if this is a good idea.
number of detected genes
I loaded these datasets into scanpy and processed them according to the 3k PBMC tutorial:
Regress-out seems to perfectly do its jobs on the Savas et al. dataset, that contains closely related cell types (1st row of figure): The 2nd PC is confounded by the number of detected genes and this effect is reduced.
On the Lambrechts et al dataset, that contains all kinds of cells (cancer, stromal, immune), this looks differently: Neither of the first 2 PCs seems to be related to the number of detected genes and it actually seems to me that I am 'loosing' information by applying
regress_out
(everything is now a single blob).cell cycle
The lower part of the plot shows regress out applied to the cell cycle (following the scanpy tutorial and the 'alternative approach' described in the seurat vignette, i.e. I regressed out the difference between the G2M and S phase scores):
Like that, the differences between dividing and non-dividing cells should be preserved.
Again, in the Savas dataset, after regressing out the cell cycle effects, G1 is correctly separated from G2M/S. In Lambrechts, there is no clear separation. Having eyeballed at the UMAP-plot (below) it seems that the cell-cycle labels correlate with the cell type (i.e. cancer cells and myeloid cells got the G1 label assigned more likely than T cells).
What is 'best practice'?
I quickly discussed this offline with @flying-sheep, and he encouraged me to create this issue.
regress_out
should be applied regardlessregress_out
be skipped and only applied in a more downstream step when focusing on a single cell type?regress_out
could do more harm than good?PCA plots before and after

regress_out
UMAP-plots


The cell cycle label correlates with the cell type (other dataset, but to show what I mean):
The text was updated successfully, but these errors were encountered: