Skip to content

SentenceTransformer API vs. Transformer API + pooling #405

Open
@githubrandomuser2017

Description

@githubrandomuser2017

In your documentation you mention two approaches to using your package to create sentence embeddings.

First, from the Quickstart, you wrote:

model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']

#Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)
print(sentence_embeddings.shape)
# (3, 768)

Second, from Sentence Embeddings with Transformers, you wrote:

model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
# Model is of type: transformers.modeling_bert.BertModel

#Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

#Perform pooling. In this case, mean pooling
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print(sentence_embeddings.shape)
# torch.Size([3, 768])

What are the important differences between these two approaches? The only thing I can see is that in the second approach, the BertModel model returns token embeddings and then you manually perform pooling (mean or max). If I use this second approach, what would I be missing from using SentenceTransformer.encode()?

Activity

nreimers

nreimers commented on Sep 2, 2020

@nreimers
Member

SentenceTransformer.encode() performs various optimization steps to ensure to encode the input as fast as possible. The tokenization and the embedding computation can run in parallel (if a GPU exists), further, it batches the input such that only minimal padding is needed which also improves the performance.

In general, SentenceTransformer.encode() is in my opinion way more convenient to use than using the AutoModel approach if you want to get embeddings for input texts.

githubrandomuser2017

githubrandomuser2017 commented on Sep 2, 2020

@githubrandomuser2017
Author

In your section Sentence Embeddings with Transformers, you wrote:

Most of our pre-trained models are based on Huggingface.co/Transformers and are also hosted in the models repository from Hugginface.

In the HuggingFace models repository, I see a lot of different models, including those from your sentence-transformers package:

  • bert-base-uncased
  • gpt2
  • deepset/roberta-base-squad2
  • sentence-transformers/bert-base-nli-mean-tokens

Is it possible to use any of these models, or just those with sentence-transformers? Am I correct in assuming that your models are specifically configured to return the token embeddings? That model output can then be run through a pooler function.

nreimers

nreimers commented on Sep 3, 2020

@nreimers
Member

In theory you could use any of them, however, out of the box, they do not produce good sentence embeddings.

The sentence-transformers models were specifically trained to produce meaningful sentence embeddings.

The other models also return token embeddings. However, when you average them, the representation does not necessarily make sense.

MathewAlexander

MathewAlexander commented on Sep 8, 2020

@MathewAlexander

@nreimers
When I fine-tuned the XLNET large on STS-B, I got a Pearson correlation coefficient of +0.917 on the development set. Also in the leaderboard given here and here, I see many models above 90. Doesn't that mean they are better than the sentence-transformers models?

nreimers

nreimers commented on Sep 9, 2020

@nreimers
Member

Hi @MathewAlexander

It depends on your use case: If you just want to compute the similarity for two sentences, then using BERT & Co. works better. For this, you don't need this package. You pass both sentences to BERT and get a score, that indicates the similarity.

However, this scales badly. Assume you have 10k sentences, and you want to find the most similar pair. 10k sentences lead to about 500k different combinations, so you would need to apply BERT cross-encoder on 500k sentence combinations, which takes quite long.

With sentence transformer, you compute an embedding for each of the 10k sentence and then perform cosine similarity. This takes only seconds.

The performance will be worse, but you get the within seconds and don't have to wait hours or even days to get the result.

MathewAlexander

MathewAlexander commented on Sep 9, 2020

@MathewAlexander

Hi @nreimers
That makes sense. Thanks for the explanation

githubrandomuser2017

githubrandomuser2017 commented on Sep 9, 2020

@githubrandomuser2017
Author

@nreimers

With sentence transformer, you compute an embedding for each of the 10k sentence and then perform cosine similarity.

If you compute an embedding for each sentence individually, how do update the BERT weights during training backprop? Your paper does say that you update BERT (in Section 3):

In order to fine-tune BERT / RoBERTa, we create siamese and triplet networks (Schroff et al., 2015) to update the weights

nreimers

nreimers commented on Sep 10, 2020

@nreimers
Member

As mentioned in the paper, by using siamese or triplet networks, depending on the loss.

You pass a sentence pair (or triplet) for training and a label, measure the error and do backprop

githubrandomuser2017

githubrandomuser2017 commented on Sep 19, 2020

@githubrandomuser2017
Author

@nreimers Why don't you use GPT2 as the basis of a Sentence Transformer model?

nreimers

nreimers commented on Sep 21, 2020

@nreimers
Member

@githubrandomuser2017
When SBERT was created, GPT2 was not available.

I never tested GPT2, but I think Mask Language Modeling as used in BERT is the better pre-training task to get sentence embeddings than the causal language model used by GPT2.

But it will be easy to fine-tune and test GPT2 with sentence-transformers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @nreimers@githubrandomuser2017@MathewAlexander

        Issue actions

          SentenceTransformer API vs. Transformer API + pooling · Issue #405 · UKPLab/sentence-transformers