Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeBERTa V3 Fast Tokenizer #14712

Open
ikergarcia1996 opened this issue Dec 10, 2021 · 10 comments · May be fixed by #14928
Open

DeBERTa V3 Fast Tokenizer #14712

ikergarcia1996 opened this issue Dec 10, 2021 · 10 comments · May be fixed by #14928

Comments

@ikergarcia1996
Copy link

@ikergarcia1996 ikergarcia1996 commented Dec 10, 2021

🚀 Feature request

Fast Tokenizer for DeBERTA-V3 and mDeBERTa-V3

Motivation

DeBERTa V3 is an improved version of DeBERTa. With the V3 version, the authors also released a multilingual model "mDeBERTa-base" that outperforms XLM-R-base. However, DeBERTa V3 currently lacks a FastTokenizer implementation which makes it impossible to use with some of the example scripts (They require a FastTokenizer).
DeBERTa-V1 and DeBERTa-V2 both have a FastTokenizer implementation, it would be great to have one for DeBERTa-V3.

@bluesky333
Copy link

@bluesky333 bluesky333 commented Dec 22, 2021

From my understanding DeBERTa-V3 has the same tokenizer as V2.
The problem is Transformers DeBERTa-V2 does not have a FastTokenizer implementation,
so we need a request for V2 FastTokenizer.

@LysandreJik
Copy link
Member

@LysandreJik LysandreJik commented Dec 22, 2021

That would be a nice community contribution! I'll add the Good First Issue label, and happy to guide anyone from the community to add a DeBERTa v2 fast tokenizer with @SaulLu!

@stefan-it
Copy link
Collaborator

@stefan-it stefan-it commented Dec 22, 2021

I've looked into it, but the only "problem" I see is the own SPMTokenizer implementation (for the slow tokenizer). It basically wraps spm, but does some own preprocessing steps...

@SaulLu
Copy link
Contributor

@SaulLu SaulLu commented Dec 22, 2021

Indeed, it would be great to have fast versions for the tokenizers of the models :

  • deberta-v2
  • deberta-v3
  • mdeberta-v3

As @stefan-it raised, I think we are indeed missing some information to be able to build a rust version of the spm tokenizer used. I tried to have a quick look at the papers corresponding to each of the models (https://arxiv.org/abs/2006.03654 and https://arxiv.org/abs/2111.09543) and unfortunately, the tokenizer modeling is not explained in them.

As far as I know, there is no way to retro-engineer the spm binaries, but maybe I'm wrong! So, at least, I think we need to know the command used by the authors to train their tokenizer, I saw that several people asked for it on issues on their repo (issue 1, issue 2) but the answer is not in it. So, the only tracks I see are 1) to bounce on the existing issues to indicate that we would also be interested in this information or 2) to contact directly the authors. Indeed, if we know the command that was used to train this tokenizer, we should be able to assemble the right tokenizer components to have a fast tokenizer!

Moreover, concerning mdeberta-v3, in their paper they mention that :

We denote the model as mDeBERTa-base. We use the same SentencePiece vocabulary as mT5 which has 250k tokens.

But unfortunately the binaries spiece.model for mT5 and spm.model for mdeberta-v3 are not equal

@LysandreJik
Copy link
Member

@LysandreJik LysandreJik commented Dec 22, 2021

Gently removing the "Good First Issue" label while solving these issues. Also pinging the author @BigBird01 :)

@stefan-it
Copy link
Collaborator

@stefan-it stefan-it commented Dec 22, 2021

There's also a vocab mismatch:

mDeBERTa: 251000
mT5: 250112

@stefan-it
Copy link
Collaborator

@stefan-it stefan-it commented Dec 22, 2021

I think the most comfortable solution would be to use the T5 Fast Tokenizer (using the mDeBERTa vocab file) - but there are a lot of details to be checked:

Token mapping is different (t5):

eos_token="</s>",
unk_token="<unk>",
pad_token="<pad>",
extra_ids=100,

DeBERTa:

do_lower_case=False,
split_by_punct=False,
bos_token="[CLS]",
eos_token="[SEP]",
unk_token="[UNK]",
sep_token="[SEP]",
pad_token="[PAD]",
cls_token="[CLS]",
mask_token="[MASK]",

🤔

@BigBird01
Copy link
Contributor

@BigBird01 BigBird01 commented Dec 22, 2021

In DeBERTa tokenizer, we remapped [CLS]=>1, [PAD]=>0, [UNK]=>3, [SEP]=>2 while keep other pieces unchanged.

I checked T5Converter, I think it should work by directly use T5Converter to convert deberta v2/v3 tokenizer to faster tokenizer, except for the post_processor part:

tokenizer.post_processor = processors.TemplateProcessing(
single="[CLS]:0 $A:0 [SEP]:0",
pair="[CLS]:0 $A:0 [SEP]:0 $B:0 [SEP]:0",
special_tokens=[
("[CLS]", self.original_tokenizer.convert_tokens_to_ids("[CLS]")),
("[SEP]", self.original_tokenizer.convert_tokens_to_ids("[SEP]")),
],
)

Thanks!

@SaulLu
Copy link
Contributor

@SaulLu SaulLu commented Dec 23, 2021

Thank you so much for your super fast response @BigBird01 😍 ! It's a great help to us!

It looks like all the information is there for this to be a good first issue 🎊 !

So I'm putting the label back on and would be really happy to guide with a todo list and help the person who would like to take care of this new feature! ☺️

@mingboiz
Copy link

@mingboiz mingboiz commented Dec 25, 2021

Hi all, thanks for all the information provided, I have written a Converter Class for DeBERTav2 and testing it manually myself the tokenization looks correct. However I need guidance on how could I write up a DeBERTav2TokenizerFast class so I can add tests! Really appreciate any guidance, thank you!

mingboiz added a commit to mingboiz/transformers that referenced this issue Dec 25, 2021
@alcinos alcinos linked a pull request that will close this issue Dec 26, 2021
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

7 participants