Skip to content

Limitations of regress_out as a standard processing step? #526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
grst opened this issue Mar 10, 2019 · 5 comments
Closed

Limitations of regress_out as a standard processing step? #526

grst opened this issue Mar 10, 2019 · 5 comments
Labels

Comments

@grst
Copy link
Contributor

grst commented Mar 10, 2019

According to the PBMC 3k tutorial, which I consider as the 'best practice' tutorial for scanpy, regressing out the fraction of mitochondrial reads and the number of detected genes is recommended as a 'standard processing step'.

Having analysed two different datasets, I am so sure anymore if this is a good idea.

number of detected genes
I loaded these datasets into scanpy and processed them according to the 3k PBMC tutorial:

Regress-out seems to perfectly do its jobs on the Savas et al. dataset, that contains closely related cell types (1st row of figure): The 2nd PC is confounded by the number of detected genes and this effect is reduced.

On the Lambrechts et al dataset, that contains all kinds of cells (cancer, stromal, immune), this looks differently: Neither of the first 2 PCs seems to be related to the number of detected genes and it actually seems to me that I am 'loosing' information by applying regress_out (everything is now a single blob).

cell cycle
The lower part of the plot shows regress out applied to the cell cycle (following the scanpy tutorial and the 'alternative approach' described in the seurat vignette, i.e. I regressed out the difference between the G2M and S phase scores):

adata.obs["cell_cycle_diff"] = adata.obs["S_score"] - adata.obs["G2M_score"]
sc.pp.regress_out(adata, ['cell_cycle_diff'])

Like that, the differences between dividing and non-dividing cells should be preserved.
Again, in the Savas dataset, after regressing out the cell cycle effects, G1 is correctly separated from G2M/S. In Lambrechts, there is no clear separation. Having eyeballed at the UMAP-plot (below) it seems that the cell-cycle labels correlate with the cell type (i.e. cancer cells and myeloid cells got the G1 label assigned more likely than T cells).

What is 'best practice'?
I quickly discussed this offline with @flying-sheep, and he encouraged me to create this issue.

  • Is it just a problem with visualizing the first PC's and regress_out should be applied regardless
  • Should regress_out be skipped and only applied in a more downstream step when focusing on a single cell type?
  • Are there any other situations where regress_out could do more harm than good?

PCA plots before and after regress_out
regress_out

UMAP-plots
The cell cycle label correlates with the cell type (other dataset, but to show what I mean):
2019-03-10_11:20:31_384x234
2019-03-10_11:25:30_428x231

@falexwolf
Copy link
Member

Great, thank you for summarizing this so neatly. We could even link from the PBMC3k tutorial to this.

Quick answer: yes, there are many situations in which pp.regress_out can do more harm than good. In most cases, I personally don't correct for mitochondrial gene expression, for instance. Quite a few people will agree with that. Whether a certain processing step makes sense or not depends on the data. You should choose with subject knowledge. Because, of that, I found it hard to come up with a best practice tutorial; what came into life as Benchmark with Seurat, remained Seurat's way of defining best practice. Many papers stick to this. But I'm sure that also many Seurat users won't always regress out mitochondrial genes and number of counts per cell.

I'm sure @LuckyMD has a lot to say on this. :)

@LuckyMD
Copy link
Contributor

LuckyMD commented Mar 11, 2019

Hi @grst,

We've been working on a best-practices workflow/tutorial for scRNA-sesq data for the past year. The tutorial is based on scanpy, but it also integrates tools from R. You can take a look at a case study which implements the workflow here. The revised manuscript should be submitted this week as well, so I hope it will be out soon. If you like, I can send it to you, if you pass me your email address.

Regarding regressing out covariates. @falexwolf has already mentioned that this is very dataset dependent. It is also dependent on the downstream analysis. This is especially true for the covariates you suggest. MT gene expression and cell cycle are both biological, rather than technical covariates. Regressing out biological covariates is generally done to isolate particular processes in the data that you are interested in, while losing global structure in the data. This is helpful especially for trajectory inference, but maybe less so for global exploratory analysis with clustering. For example, cell cycle stage can be a major determinant in the difference between two cell types (e.g. stem cells and proliferating cells like transit amplifying cells). Removing this effect, hides the distinction. MT gene expression is also a biological covariate (as well as a technical indicator of cell stress). Higher levels of MT gene expression can indicate increased respiration in e.g., asthmatic conditions.

An additional unwanted effect to regressing out biological covariates is that biological processes are not independent. Thus, regressing out one process, will partially remove effects of other, downstream processes as well.

If you're interested, we can discuss this in more detail. Hope this helps a bit,

@grst
Copy link
Contributor Author

grst commented Mar 11, 2019

Thanks for your comments.
In this case, I will indeed use the function more cautiously in the future.

@LuckyMD, I would be really interested in the manuscript. I'll send you an email regarding that.

Best,
Gregor

awnimo pushed a commit to dpeerlab/scanpy that referenced this issue Dec 17, 2019

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Closes scverse#526
@shihsama
Copy link

Hi @LuckyMD ,
I'm dealing with a large dataset. I just found that pp.regress_out would densify a scipy.sparse._csc.csc_matrix, which might lead to death of Jupyter kernel due to running out of memory. Then I expanded the RAM of my cloud server, and it ran pp.regress_out smoothly.But, the size of the anndata object came to 39 GB from 8GB. That was shocking! I tried scipy.sparse.csr_matrix(adata.X) to recover it to sparse matrix, the kernel died again. I have no idea then.
I had used Seurat long before and the step of regression in Seurat does not change the type of matrix nor the size of the object.
Do you have any idea about how to avoid densifying the matrix?
Thanks for your help~

@flying-sheep
Copy link
Member

I’m not sure how regress_out could avoid densifying the matrix. Can you show an example of running Seurat’s regress_out on a sparse matrix that retains a similar amount of zeros in the matrix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants