Limitations of regress_out as a standard processing step? #526

grst · 2019-03-10T10:31:59Z

According to the PBMC 3k tutorial, which I consider as the 'best practice' tutorial for scanpy, regressing out the fraction of mitochondrial reads and the number of detected genes is recommended as a 'standard processing step'.

Having analysed two different datasets, I am so sure anymore if this is a good idea.

number of detected genes
I loaded these datasets into scanpy and processed them according to the 3k PBMC tutorial:

Savas et al., 2018, ~6k cells, CD3+ T cells, BRCA
Lambrechts et al., 2018, ~32k cells, whole tissue NSCLC

Regress-out seems to perfectly do its jobs on the Savas et al. dataset, that contains closely related cell types (1st row of figure): The 2nd PC is confounded by the number of detected genes and this effect is reduced.

On the Lambrechts et al dataset, that contains all kinds of cells (cancer, stromal, immune), this looks differently: Neither of the first 2 PCs seems to be related to the number of detected genes and it actually seems to me that I am 'loosing' information by applying regress_out (everything is now a single blob).

cell cycle
The lower part of the plot shows regress out applied to the cell cycle (following the scanpy tutorial and the 'alternative approach' described in the seurat vignette, i.e. I regressed out the difference between the G2M and S phase scores):

adata.obs["cell_cycle_diff"] = adata.obs["S_score"] - adata.obs["G2M_score"]
sc.pp.regress_out(adata, ['cell_cycle_diff'])

Like that, the differences between dividing and non-dividing cells should be preserved.
Again, in the Savas dataset, after regressing out the cell cycle effects, G1 is correctly separated from G2M/S. In Lambrechts, there is no clear separation. Having eyeballed at the UMAP-plot (below) it seems that the cell-cycle labels correlate with the cell type (i.e. cancer cells and myeloid cells got the G1 label assigned more likely than T cells).

What is 'best practice'?
I quickly discussed this offline with @flying-sheep, and he encouraged me to create this issue.

Is it just a problem with visualizing the first PC's and regress_out should be applied regardless
Should regress_out be skipped and only applied in a more downstream step when focusing on a single cell type?
Are there any other situations where regress_out could do more harm than good?

PCA plots before and after regress_out

UMAP-plots
The cell cycle label correlates with the cell type (other dataset, but to show what I mean):

The text was updated successfully, but these errors were encountered:

falexwolf · 2019-03-10T15:51:33Z

Great, thank you for summarizing this so neatly. We could even link from the PBMC3k tutorial to this.

Quick answer: yes, there are many situations in which pp.regress_out can do more harm than good. In most cases, I personally don't correct for mitochondrial gene expression, for instance. Quite a few people will agree with that. Whether a certain processing step makes sense or not depends on the data. You should choose with subject knowledge. Because, of that, I found it hard to come up with a best practice tutorial; what came into life as Benchmark with Seurat, remained Seurat's way of defining best practice. Many papers stick to this. But I'm sure that also many Seurat users won't always regress out mitochondrial genes and number of counts per cell.

I'm sure @LuckyMD has a lot to say on this. :)

LuckyMD · 2019-03-11T10:39:55Z

Hi @grst,

We've been working on a best-practices workflow/tutorial for scRNA-sesq data for the past year. The tutorial is based on scanpy, but it also integrates tools from R. You can take a look at a case study which implements the workflow here. The revised manuscript should be submitted this week as well, so I hope it will be out soon. If you like, I can send it to you, if you pass me your email address.

Regarding regressing out covariates. @falexwolf has already mentioned that this is very dataset dependent. It is also dependent on the downstream analysis. This is especially true for the covariates you suggest. MT gene expression and cell cycle are both biological, rather than technical covariates. Regressing out biological covariates is generally done to isolate particular processes in the data that you are interested in, while losing global structure in the data. This is helpful especially for trajectory inference, but maybe less so for global exploratory analysis with clustering. For example, cell cycle stage can be a major determinant in the difference between two cell types (e.g. stem cells and proliferating cells like transit amplifying cells). Removing this effect, hides the distinction. MT gene expression is also a biological covariate (as well as a technical indicator of cell stress). Higher levels of MT gene expression can indicate increased respiration in e.g., asthmatic conditions.

An additional unwanted effect to regressing out biological covariates is that biological processes are not independent. Thus, regressing out one process, will partially remove effects of other, downstream processes as well.

If you're interested, we can discuss this in more detail. Hope this helps a bit,

grst · 2019-03-11T15:02:33Z

Thanks for your comments.
In this case, I will indeed use the function more cautiously in the future.

@LuckyMD, I would be really interested in the manuscript. I'll send you an email regarding that.

Best,
Gregor

Closes scverse#526

shihsama · 2025-03-19T13:31:07Z

Hi @LuckyMD ,
I'm dealing with a large dataset. I just found that pp.regress_out would densify a scipy.sparse._csc.csc_matrix, which might lead to death of Jupyter kernel due to running out of memory. Then I expanded the RAM of my cloud server, and it ran pp.regress_out smoothly.But, the size of the anndata object came to 39 GB from 8GB. That was shocking! I tried scipy.sparse.csr_matrix(adata.X) to recover it to sparse matrix, the kernel died again. I have no idea then.
I had used Seurat long before and the step of regression in Seurat does not change the type of matrix nor the size of the object.
Do you have any idea about how to avoid densifying the matrix?
Thanks for your help~

flying-sheep · 2025-03-29T12:42:19Z

I’m not sure how regress_out could avoid densifying the matrix. Can you show an example of running Seurat’s regress_out on a sparse matrix that retains a similar amount of zeros in the matrix?

flying-sheep added the Question label Nov 13, 2019

flying-sheep closed this as completed in 259bdb2 Nov 13, 2019

nictru mentioned this issue Feb 16, 2024

Add option for regressing out cycling, mitochondrial and ribosomal influences Mye-InfoBank/scRAFIKI#41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Limitations of regress_out as a standard processing step? #526

Limitations of regress_out as a standard processing step? #526

grst commented Mar 10, 2019

falexwolf commented Mar 10, 2019

Uh oh!

LuckyMD commented Mar 11, 2019 •

edited

Loading

Uh oh!

grst commented Mar 11, 2019

Uh oh!

shihsama commented Mar 19, 2025

Uh oh!

flying-sheep commented Mar 29, 2025

Uh oh!

Limitations of regress_out as a standard processing step? #526

Limitations of regress_out as a standard processing step? #526

Comments

grst commented Mar 10, 2019

falexwolf commented Mar 10, 2019

Uh oh!

LuckyMD commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grst commented Mar 11, 2019

Uh oh!

shihsama commented Mar 19, 2025

Uh oh!

flying-sheep commented Mar 29, 2025

Uh oh!

LuckyMD commented Mar 11, 2019 •

edited

Loading