Skip to content

tokenizer is slow after adding new tokens #615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidnarganes opened this issue Feb 3, 2021 · 26 comments · Fixed by huggingface/transformers#13220
Closed

tokenizer is slow after adding new tokens #615

davidnarganes opened this issue Feb 3, 2021 · 26 comments · Fixed by huggingface/transformers#13220

Comments

@davidnarganes
Copy link

davidnarganes commented Feb 3, 2021

Hi,

I'm redirecting this issue to here as suggested in huggingface/transformers#9958
I'll just copy paste, here it goes:

The tokenizer is slow when adding new tokens even with the Fast class:

from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2TokenizerFast, GPT2Tokenizer

# Maybe this url for the files:
# https://huggingface.co/transformers/v3.1.0/_modules/transformers/tokenization_gpt2.html
paths = dict()
paths["tokenizer"] = "whatever/is/the/path/to/pretrained/vocab.json/merges.txt"

# They have to be sorted in reverse by length, otherwise the tokens arent 
newtokens = range(0, 20000)
newtokens = list(newtokens)
newtokens.sort(reverse=True)
newtokens = ["new_" + str(x) for x in newtokens]

# loading tokenizer from the saved model path
tokenizers = dict()
tokenizers["fast"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["fast_custom"] = GPT2TokenizerFast.from_pretrained(paths["tokenizer"])
tokenizers["slow_custom"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])
tokenizers["slow"] = GPT2Tokenizer.from_pretrained(paths["tokenizer"])

tokenizer.add_special_tokens({
  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"
})

# Add new vocab
# https://huggingface.co/transformers/v2.11.0/main_classes/tokenizer.html
# https://github.com/deepset-ai/FARM/issues/157
for k in tokenizers:
    if "custom" in k:
        print(k)
        print("Vocab length before:", len(tokenizers[k].get_vocab()))
        tokenizers[k].add_tokens(newtokens)
        print("Vocab length after:", len(tokenizers[k].get_vocab()))

# creating the configurations from which the model can be made
config = GPT2Config(
  vocab_size=len(tokenizer),
  bos_token_id=tokenizer.bos_token_id,
  eos_token_id=tokenizer.eos_token_id
)

# creating the model
# https://huggingface.co/transformers/_modules/transformers/configuration_gpt2.html
model = TFGPT2LMHeadModel(config)

# Differences when tokenising the text...
text = "this is a sentence containing new_200"
for k,v in tokenizers.items():
    print(k, v.tokenize(text))

and then profiling the speed in jupyter:

for k in tokenizers:
    print(k)
    %timeit tokenizers[k].tokenize(text)

any ideas why this may be happening? I understand that I'm increasing the vocab size by ~20% and that may slow things down but in this code there's a performance difference of 1000 fold in the speed. That doesn't seem right?

Just a note: it's crucial to add that many new tokens. I'm not considering reducing the number of new tokens.
Many thanks!

@n1t0
Copy link
Contributor

n1t0 commented Feb 3, 2021

Hi @davidnarganes.

This is due to the fact that these added tokens must be dealt with beforehand because we can't add them to the initial vocabulary. So this has nothing to do with the classic tokenization algorithm, and it doesn't scale the same.

@davidnarganes
Copy link
Author

I found a workaround by manually adding the new tokens to the vocab.json file and updating the steps to generate these new tokens in the merges.txt file, following the BytePairEncoding format.
It works for me, it's efficient.
Is there any better way to do it?

@n1t0
Copy link
Contributor

n1t0 commented Feb 3, 2021

Interesting, can you give more information on the process you followed to update the merges.txt file?

@n1t0
Copy link
Contributor

n1t0 commented Feb 11, 2021

@davidnarganes Did this keep working as you expect it to work? You picked my curiosity last time, and I'd love to hear more about this!

@rdemorais
Copy link

rdemorais commented Apr 12, 2021

I'm also looking forward to hear about it. I've added new tokens and now the time process 1.5 GB of documents jumped from 5 min to 8 h.

update

I've added new tokens by using add_tokens and, after that, model.resize_token_embeddings(len(tokenizer)). Also I saved both tokenizer and model.

The difference is that I deleted the added_tokens.json file, and manually append the new tokens to the vocab file.

I don't know if what I did is the way to go, but it is working now.

@manugarri
Copy link

@rdemorais @davidnarganes would you mind sharing a guide on the steps you took? we are facing the same issue. adding a single custom token to BertTokenizer goes from tokenizing in a few minutes to many many hours (having to stop it manually)

@n1t0
Copy link
Contributor

n1t0 commented Apr 16, 2021

@manugarri Can you share a way to reproduce? Adding a single custom token should definitely not have this kind of impact.

@rdemorais
Copy link

rdemorais commented Apr 17, 2021

@manugarri I believe that you should consider the following:

  1. How will you define the scope of the new tokens. To did that I follow this article
  2. Then, you should add the new tokens to the model.

Regardless the way you will choose to get to the new tokens list, the below steps is what I did, but be aware that I'm not sure it is the way to go. On the other hand, it works for me, and when I say "works" it means that I was able to train the model using MLM, and use the results to run pipeline to guess the masked word. Maybe I should run the MLM with a number of epochs to get a good set of weights to the new vocab, but that is another story.

A. To add new tokens I've just merge the list generated by following the article. Here is the code I've used:

old_vocab = [k for k,v in tokenizer.get_vocab().items()]
new_vocab = [token for token in new_tokens]
idx_old_vocab_list = list()
same_tokens_list = list()
different_tokens_list = list()

for idx_new,w in enumerate(new_vocab):
    try:
        idx_old = old_vocab.index(w)
    except:
        idx_old = -1
    if idx_old>=0:
        idx_old_vocab_list.append(idx_old)
        same_tokens_list.append((w,idx_new))
    else:
        different_tokens_list.append((w,idx_new))

B. Then, I added the new vocab to the tokenizer and resize the matrix:

## add new tokens to the existing vocabulary (only those not already presents)
logger.info("[ BEFORE ] tokenizer vocab size:", len(tokenizer))
added_tokens = tokenizer.add_tokens([i[0] for i in different_tokens_list])
logger.info("[ AFTER ] tokenizer vocab size:", len(tokenizer))
logger.info('added_tokens:',added_tokens)

# resize the embeddings matrix of the model
model.resize_token_embeddings(len(tokenizer))

model.save_pretrained('./cn-v1')
tokenizer.save_pretrained('./cn-v1')

C. Here is the catch. After save_pretrained, you will find a added_tokens.json in the folder. You will also see that the vocab.txt remain the same.

When you go to use the model with the new tokens it will explode the time as you are seeing. I believe it happens because the tokenizer tries to use the added_tokens.json.

What I did, and once again I quote that I don't know if it is the correct way, was the following:

  1. I went into the added_tokens.json and:
import os
import json
with open('added_tokens.json') as json_file:
    added_tokens = json.load(json_file)
    sorted_tokens = dict(sorted(added_tokens.items(), key=lambda item: item[1]))
    for tk in sorted_tokens.keys():
        print(tk)
  1. I manually copy the printed results and pasted right after the last line of the vocab.txt. You would like to append the results using a python code.
  2. Then I removed the added_tokens.json file.

After all this I was able to use the tokenizer:

CN_MODEL = 'cn-v1'

config = AutoConfig.from_pretrained(CN_MODEL)

tokenizer = AutoTokenizer.from_pretrained(CN_MODEL)
model = AutoModelForMaskedLM.from_pretrained(
  CN_MODEL,
  config=config
)
tokenizer('teste para diabetes e tamnbém para haloperidol')
{'input_ids': [101, 3515, 221, 30809, 122, 316, 22287, 22285, 22295, 312, 221, 607, 326, 1840, 243, 22290, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The token with id 30809 is a new token.

The base model I've used was the bert-base-uncased.

Hope it can help.

@manugarri
Copy link

@rdemorais i think that is exactly what is happening. will give a try to your "hack" :)

@rdemorais
Copy link

@rdemorais i think that is exactly what is happening. will give a try to your "hack" :)

let me know.

@lacls
Copy link

lacls commented May 6, 2021

Hi,
thanks for your great work!
I'm still facing the same , follow exact setup in readme

Error:-
ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'
def add_from_file(self, f):
        """
        Loads a pre-existing dictionary from a text file and adds its symbols to this instance.
        """
        if isinstance(f, str):
            try:
                with open(f, "r", encoding="utf-8") as fd:
                    self.add_from_file(fd)
            except FileNotFoundError as fnfe:
                raise fnfe
            except UnicodeError:
                raise Exception(f"Incorrect encoding detected in {f}, please rebuild the dataset")
            return
        lines = f.readlines()
        for lineTmp in lines:
            line = lineTmp.strip()
            idx = line.rfind(" ")
            if idx == -1:
                raise ValueError("Incorrect dictionary format, expected '<token> <cnt>'")
            word = line[:idx]
            self.encoder[word] = len(self.encoder)

I just force it to run by set line=word and comment on the lines that not needed. Do you guys have any other approach, please share. I do really appreciate that.

@rdemorais
Copy link

@lacls the vocab.txt is just a list of tokens like that:

loquiação
esplenomegalia
fisiologicas
ortopedia
gastrite
bx
t.
inflamatório
branda

Your new token list is supposed to be added in the end of the file.

Nevertheless, try to confirm if you gonna need every symbol from the line added to the vocab.txt. The way you wrote the code, everything will be there including dots, stop words and so on.

@manugarri
Copy link

@rdemorais it worked!

@lacls
Copy link

lacls commented May 15, 2021

@rdemorais it worked, after all, thanks!

@raphaelsty
Copy link

raphaelsty commented Aug 14, 2021

Hi, unless I'm mistaken, the solution that @rdemorais proposed allows to empty the attribute unique_no_split_tokens of the tokenizer. This attribute contains the list of tokens added to the tokenizer. Each time we encode a string, the tokenizer will check if the string exists in this list via the split_on_tokens method to avoid cutting the new tokens into subwords units (that's why it is very slow).

Replacing the tokenizer.unique_no_split_tokens attribute with a dictionary speeds up the processing time but this operation is still very expensive compared to the standard version of the tokenizer (without new tokens):

# Faster than the HuggingFace way but lower than the Hacky way
# Still encode the entity as a single token ✅
>>> tokens = ['star wars episode vi: return of the jedi']
>>> tokenizer.add_tokens(tokens) 
>>> tokenizer.unique_no_split_tokens = {token: None for token in tokenizer.unique_no_split_tokens}
>>> model.resize_token_embeddings(len(tokenizer))
>>> tokenizer.tokenize('star wars episode vi: return of the jedi')
['star wars episode vi: return of the jedi']

The solution of @rdemorais is not perfect because some words of the tokenizer that you add will simply never be used but for the specific problem I'm working on, it's a good compromise because the solution with the dictionary remains very slow.

>>> tokens = ['star wars episode vi: return of the jedi']
>>> tokenizer.save_pretrained(".")

>>> with open("vocab.txt", "a", encoding="utf-8") as tokenizer_vocab:
...    for token in tokens:
...        tokenizer_vocab.write(f"{token}\n")

>>> tokenizer = tokenizer.from_pretrained(".")
>>> model.resize_token_embeddings(len(tokenizer))
>>> tokenizer.tokenize('star wars episode vi: return of the jedi')
['star', 'wars', 'episode', 'vi', ':', 'return', 'of', 'the', 'jedi']

@rdemorais
Copy link

Hello @raphaelsty thank you for spending time over this matter. I believe it is a good way to generate discussion over a topic. Everybody benefits.

After struggling over this problematic, I moved on creating the actual model I was looking for. I figure out that breaking every word into tokens and only the ones supposed to be added as new tokens are actually added, helped the downstream task to behave as it was meant for.

I believe that it is not the job of the vocab.txt to hold business logic, for instance, 'star wars episode vi: return of the jedi' is task dependent, but ['star', 'wars', 'episode', 'vi', ':', 'return', 'of', 'the', 'jedi'] is not. You can use the words from the second approach to go training another model and leverage previous work.

In the final results, I've managed to create something like this:

Screen Shot 2021-08-16 at 16 15 14

Translating: no new syncope episodes. maintains dyspnea on minimal exertion, under spontaneous ventilation with support of o2 per cn at 2l/min with a good respiratory pattern

The words AUSENTE and PRESENTE means Absent and Present. It is Assertion Detection built using the same approach we are talking about.

The thing is. Note that I was able to get compound terms by training the NER accordingly. The MLM with new tokens is the underline tech.

But, in the end, I'm still not sure that it was really needed to add more tokens... which leads me to that meme:

@manugarri
Copy link

Nice example @rdemorais, when you mention:

But, in the end, I'm still not sure that it was really needed to add more tokens... which leads me to that meme:

At my company we are adding additional tokens that dont really have any semantic meaning (they are group identifiers), and Ive noticed model performance at sequence classification improving significantly after adding those 'custom' tokens.

@rdemorais
Copy link

Good to know that @manugarri.

The sequence classification is actually working good, maybe for the new tokens included. Fantastic.

@FeryET
Copy link

FeryET commented Aug 20, 2021

I have the same issue with fine-tuning GPT2. Is there no way to easier way to fix this? Going into vocab.json file and manipulating it manually seems like the worst kind of hack for long term maintenance.

@Narsil
Copy link
Collaborator

Narsil commented Aug 23, 2021

All the slowness describe here is most likely linked to "slow" tokenizers within transformers and not Rust (most likely).

I opened a PR describing the problem and a proposed fix. The proposed fix however includes a breaking change (should be small but still) so it might not land (or be much later in the future).

For the Rust version, can people confirm the slowness ? The regexp used in the rust version should be basically the same as the proposed PR in pure Python.

huggingface/transformers#13220

@manugarri
Copy link

@Narsil i can confirm the slowness happens with BertTokenizerFast as well

@Narsil
Copy link
Collaborator

Narsil commented Aug 23, 2021

@manugarri can you provide a test script demoing the slowness ? in the test provided by @davidnarganes there's a slowdown but it's definitely not on the same level as the "slow" one.

Edit: and all references to unique_no_split_tokens as mentioned in previous answers is definitely "slow"

@manugarri
Copy link

@Narsil I think i misunderstood your previous comment, you meant testing the speed increase using the 'hack' versus using the standard huggingface tokenizer add_tokens ?

The hacky ( removing files) method shared by @rdemorais (@davidnarganes seems like you described the same but did not provide a sample implementation) does work very fast on BertTokenizerFast indeed. Is that what you wanted to find out?

@Narsil
Copy link
Collaborator

Narsil commented Aug 24, 2021

No, the hack (touching unique_no_split_tokens, absolutely should NOT have any influence as it's simply is not used by fast tokenizers (at least it shouldn't).

Benchmark script:

import datetime
from transformers import GPT2Tokenizer, GPT2TokenizerFast


# They have to be sorted in reverse by length, otherwise the tokens arent
newtokens = range(0, 20000)
newtokens = list(newtokens)
newtokens.sort(reverse=True)
newtokens = [f"new_{x}" for x in newtokens]

slow = GPT2Tokenizer.from_pretrained("gpt2")
fast = GPT2TokenizerFast.from_pretrained("gpt2")

# Add new vocab
slow_custom = GPT2Tokenizer.from_pretrained("gpt2")
slow_custom.add_tokens(newtokens)
fast_custom = GPT2TokenizerFast.from_pretrained("gpt2")
fast_custom.add_tokens(newtokens)

# Differences when tokenising the text...
text = "this is a sentence containing new_200"
for k, tokenizer in {"slow": slow, "slow_custom": slow_custom, "fast": fast, "fast_custom": fast_custom}.items():
    start = datetime.datetime.now()
    print(tokenizer.tokenize(text))
    print(k, datetime.datetime.now() - start)

results:

['this', 'Ġis', 'Ġa', 'Ġsentence', 'Ġcontaining', 'Ġnew', '_', '200']
slow 0:00:00.000339
['this', 'Ġis', 'Ġa', 'Ġsentence', 'Ġcontaining', 'new_200']
slow_custom 0:00:00.001379
['this', 'Ġis', 'Ġa', 'Ġsentence', 'Ġcontaining', 'Ġnew', '_', '200']
fast 0:00:00.004188
['this', 'Ġis', 'Ġa', 'Ġsentence', 'Ġcontaining', 'Ġ', 'new_200']
fast_custom 0:00:00.059223

@mug2mag
Copy link

mug2mag commented Oct 28, 2021

@rdemorais @Narsil @manugarri
Hi, thanks a lot for the solution example. But I have tried what you posed, it did not work when I use the save_pretrained directory. I don't know why.
codes are as follows:

  tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
  print("before adding new tokens:", tokenizer.tokenize('star wars episode vi: return of the jedi'))

  model = AutoModelForMaskedLM.from_pretrained('bert-base-uncased')
  tokenizer.add_tokens(['star wars episode vi: return of the jedi'])
  model.resize_token_embeddings(len(tokenizer))
  print("after adding new tokens:", tokenizer.tokenize('star wars episode vi: return of the jedi'))

  model.save_pretrained('./tf_test')
  tokenizer.save_pretrained('./tf_test')

result:

before adding new tokens: ['star', 'wars', 'episode', 'vi', ':', 'return', 'of', 'the', 'jedi']
after adding new tokens: ['star wars episode vi: return of the jedi']

But after I remove added_tokens.json file, delete codes above, added the new token to vocab.txt, when I loaded tokenizer from './tf_test', the tokenizer did not work.

  CN_MODEL = ./tf_test'
  tokenizer = AutoTokenizer.from_pretrained(CN_MODEL)
  print('loading from CN_MODEL result:', tokenizer.tokenize('star wars episode vi: return of the jedi'))

result:

loading from CN_MODEL result: ['star', 'wars', 'episode', 'vi', ':', 'return', 'of', 'the', 'jedi']

I don't if I did something wrong. Please let me know if you find any inappropriate coding. Any reply would be very appreciated!

@rdemorais
Copy link

Hi @mug2mag,

When you tokenizer.add_tokens(['star wars episode vi: return of the jedi']), you are basically saying you want the entire phase to be a token, but the tokenizer split texts into tokens by using spaces (among other things). Thats why it is behaving like that.

Consider adding the individual words as entries to your vocabulary, thus your model will be able to understand variations of texts containing the Star Wars terms.

Also, before save_pretrained take a look at the config.json file, seek the vocab_size field, write down the value. Apply the worst hack ever made, and check the vocab_size again. Tell me what you see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants