Skip to content

Unclear what IntegrateLayers (v5) is actually doing vs IntegrateData (v4) #8653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Jojace opened this issue Mar 20, 2024 · 4 comments
Closed
Labels
documentation Error in documentation

Comments

@Jojace
Copy link

Jojace commented Mar 20, 2024

Hello,

In the v4 integration workflow, it was very clear what was going on in each integration step, as documented at the bottom of the documentation page for each command. A basic workflow looked like this:

1) SelectIntegrationFeatures()
Summary:
Choose the features to use when integrating multiple datasets. This function ranks features by the number of datasets they are deemed variable in, breaking ties by the median variable feature rank across datasets. It returns the top scoring features by this ranking.

2) FindIntegrationAnchors()
Summarized as this:
Perform dimensional reduction on the dataset pair as specified via the reduction parameter. If l2.norm is set to TRUE, perform L2 normalization of the embedding vectors.

Identify anchors - pairs of cells from each dataset that are contained within each other's neighborhoods (also known as mutual nearest neighbors).

Filter low confidence anchors to ensure anchors in the low dimension space are in broad agreement with the high dimensional measurements. This is done by looking at the neighbors of each query cell in the reference dataset using max.features to define this space. If the reference cell isn't found within the first k.filter neighbors, remove the anchor.

Assign each remaining anchor a score. For each anchor cell, determine the nearest k.score anchors within its own dataset and within its pair's dataset. Based on these neighborhoods, construct an overall neighbor graph and then compute the shared neighbor overlap between anchor and query cells (analogous to an SNN graph). We use the 0.01 and 0.90 quantiles on these scores to dampen outlier effects and rescale to range between 0-1

3)IntegrateData()
Summary:
For pairwise integration:

Construct a weights matrix that defines the association between each query cell and each anchor. These weights are computed as 1 - the distance between the query cell and the anchor divided by the distance of the query cell to the k.weightth anchor multiplied by the anchor score computed in FindIntegrationAnchors. We then apply a Gaussian kernel width a bandwidth defined by sd.weight and normalize across all k.weight anchors.

Compute the anchor integration matrix as the difference between the two expression matrices for every pair of anchor cells

Compute the transformation matrix as the product of the integration matrix and the weights matrix.

Subtract the transformation matrix from the original expression matrix.

It is clear then that the output of this integration workflow is a corrected matrix used for downstream analysis.

In v5, integration is completed with one function, IntegrateLayers()
The output of this function is not a corrected matrix, but instead a dimensionality reduction generated by "correcting" a dimensionality reduction already performed on the unintegrated data (e.g. unintegrated pca). This function also requires an initial dimensionality reduction to produce the corrected reduction. For example, it would require pca embeddings to correct, and output integrated.cca embeddings.

The input and output requirements of the new workflow are very different.

Here are my questions:

  1. What are the actual steps implemented to produce a corrected dimensionality reduction, from which downstream integrated analysis is done?
  2. How do they differ from the workflow described above?
  3. Is there documentation for the steps involved? Right now, it seems the best way to figure out what's happening is by trying to read the code of each function.

I would like to use v5 for data analysis, but can't do so unless I can explain what it's doing.
Thanks for your help!

@Jojace Jojace added the documentation Error in documentation label Mar 20, 2024
@rsatija
Copy link
Collaborator

rsatija commented Mar 25, 2024

Thanks for your comment. The integration workflow in v5 is streamlined from a code/usability perspective, but the steps are the same as the workflow you paste above.

As we say in the documentation (and as you note above), the major difference is that in Seurat v5, we perform the integration in low-dimensional space. What that means is that prior to performing any integration - we run a PCA on the full dataset - and use those values (instead of the gene expression values themselves) as input for correction. This is represented by the orig.reduction parameter in IntegrateLayers

I hope that helps - and again to summarize, we did not intend to change every step of the integration workflow when using IntegrateLayers, but we do streamline it for users (and perform correction in low-dimensional space rather than on gene expression levels).

@JABioinf
Copy link

What are the expected output of a seurat object after IntegrateLayers()? Notably is it still suppose to generate an "integrated" assay? I have been having an issue in recent version (5.0.2) where the integrated assay is not available anymore after integration? Is that expected or not? (if not I will try to generate a reproducible example of the bug)

@rsatija
Copy link
Collaborator

rsatija commented Mar 27, 2024

We no longer generate an integrated assay, and instead generate an integrated dimensional reduction embedding - that can directly be used as input for clustering and identification of cell types and states across datasets.

If you do want to generate an integrated assay, the old integration workflows (using IntegrateData instead of IntegrateLayers) are still supported in Seurat v5.

@zofieLin
Copy link

We no longer generate an integrated assay, and instead generate an integrated dimensional reduction embedding - that can directly be used as input for clustering and identification of cell types and states across datasets.

If you do want to generate an integrated assay, the old integration workflows (using IntegrateData instead of IntegrateLayers) are still supported in Seurat v5.

Does it means that if using IntegrateLayers(), not need to use FindIntegrationAnchors() and IntegrateData() ahead? I still confused about it. If I have multiple samples, should I deal with normalization on each sample and integrate with FindIntegrationAnchors() and IntegrateData() first, then follow with IntegrateLayers() after IntegrateData() for directly merge all samples and then to IntegrateLayers()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Error in documentation
Projects
None yet
Development

No branches or pull requests

4 participants