DeBERTa V3 Fast Tokenizer #14712

ikergarcia1996 · 2021-12-10T11:48:12Z

🚀 Feature request

Fast Tokenizer for DeBERTA-V3 and mDeBERTa-V3

Motivation

DeBERTa V3 is an improved version of DeBERTa. With the V3 version, the authors also released a multilingual model "mDeBERTa-base" that outperforms XLM-R-base. However, DeBERTa V3 currently lacks a FastTokenizer implementation which makes it impossible to use with some of the example scripts (They require a FastTokenizer).
DeBERTa-V1 and DeBERTa-V2 both have a FastTokenizer implementation, it would be great to have one for DeBERTa-V3.

bluesky333 · 2021-12-22T06:56:03Z

From my understanding DeBERTa-V3 has the same tokenizer as V2.
The problem is Transformers DeBERTa-V2 does not have a FastTokenizer implementation,
so we need a request for V2 FastTokenizer.

LysandreJik · 2021-12-22T08:20:56Z

That would be a nice community contribution! I'll add the Good First Issue label, and happy to guide anyone from the community to add a DeBERTa v2 fast tokenizer with @SaulLu!

stefan-it · 2021-12-22T08:48:30Z

I've looked into it, but the only "problem" I see is the own SPMTokenizer implementation (for the slow tokenizer). It basically wraps spm, but does some own preprocessing steps...

SaulLu · 2021-12-22T09:50:42Z

Indeed, it would be great to have fast versions for the tokenizers of the models :

deberta-v2
deberta-v3
mdeberta-v3

As @stefan-it raised, I think we are indeed missing some information to be able to build a rust version of the spm tokenizer used. I tried to have a quick look at the papers corresponding to each of the models (https://arxiv.org/abs/2006.03654 and https://arxiv.org/abs/2111.09543) and unfortunately, the tokenizer modeling is not explained in them.

As far as I know, there is no way to retro-engineer the spm binaries, but maybe I'm wrong! So, at least, I think we need to know the command used by the authors to train their tokenizer, I saw that several people asked for it on issues on their repo (issue 1, issue 2) but the answer is not in it. So, the only tracks I see are 1) to bounce on the existing issues to indicate that we would also be interested in this information or 2) to contact directly the authors. Indeed, if we know the command that was used to train this tokenizer, we should be able to assemble the right tokenizer components to have a fast tokenizer!

Moreover, concerning mdeberta-v3, in their paper they mention that :

We denote the model as mDeBERTa-base. We use the same SentencePiece vocabulary as mT5 which has 250k tokens.

But unfortunately the binaries spiece.model for mT5 and spm.model for mdeberta-v3 are not equal

LysandreJik · 2021-12-22T09:52:52Z

Gently removing the "Good First Issue" label while solving these issues. Also pinging the author @BigBird01 :)

stefan-it · 2021-12-22T12:09:43Z

There's also a vocab mismatch:

mDeBERTa: 251000
mT5: 250112

stefan-it · 2021-12-22T12:30:54Z

I think the most comfortable solution would be to use the T5 Fast Tokenizer (using the mDeBERTa vocab file) - but there are a lot of details to be checked:

Token mapping is different (t5):

transformers/src/transformers/models/t5/tokenization_t5.py

Lines 113 to 116 in 824fd44

    
           eos_token="</s>", 
        
           unk_token="<unk>", 
        
           pad_token="<pad>", 
        
           extra_ids=100,

DeBERTa:

transformers/src/transformers/models/deberta_v2/tokenization_deberta_v2.py

Lines 108 to 116 in 824fd44

    
           do_lower_case=False, 
        
           split_by_punct=False, 
        
           bos_token="[CLS]", 
        
           eos_token="[SEP]", 
        
           unk_token="[UNK]", 
        
           sep_token="[SEP]", 
        
           pad_token="[PAD]", 
        
           cls_token="[CLS]", 
        
           mask_token="[MASK]",

🤔

BigBird01 · 2021-12-22T19:32:00Z

In DeBERTa tokenizer, we remapped [CLS]=>1, [PAD]=>0, [UNK]=>3, [SEP]=>2 while keep other pieces unchanged.

I checked T5Converter, I think it should work by directly use T5Converter to convert deberta v2/v3 tokenizer to faster tokenizer, except for the post_processor part:

tokenizer.post_processor = processors.TemplateProcessing(
single="[CLS]:0 $A:0 [SEP]:0",
pair="[CLS]:0 $A:0 [SEP]:0 $B:0 [SEP]:0",
special_tokens=[
("[CLS]", self.original_tokenizer.convert_tokens_to_ids("[CLS]")),
("[SEP]", self.original_tokenizer.convert_tokens_to_ids("[SEP]")),
],
)

Thanks!

SaulLu · 2021-12-23T09:17:31Z

Thank you so much for your super fast response @BigBird01 😍 ! It's a great help to us!

It looks like all the information is there for this to be a good first issue 🎊 !

So I'm putting the label back on and would be really happy to guide with a todo list and help the person who would like to take care of this new feature! ☺️

mingboiz · 2021-12-25T04:40:29Z

Hi all, thanks for all the information provided, I have written a Converter Class for DeBERTav2 and testing it manually myself the tokenization looks correct. However I need guidance on how could I write up a DeBERTav2TokenizerFast class so I can add tests! Really appreciate any guidance, thank you!

bluesky333 mentioned this issue Dec 22, 2021

Huggingface Transformers fastTokenizer for DeBERTa v3 #14860

Closed

LysandreJik added the Good First Issue label Dec 22, 2021

LysandreJik removed the Good First Issue label Dec 22, 2021

SaulLu added the Good First Issue label Dec 23, 2021

mingboiz added a commit to mingboiz/transformers that referenced this issue Dec 25, 2021

[WIP] DeBERTav2 Fast Tokenizer - fixes huggingface#14712

9c0d9f4

mingboiz mentioned this issue Dec 25, 2021

[WIP] DeBERTav2 Fast Tokenizer - fixes #14712 #14923

Closed

5 tasks

alcinos linked a pull request that will close this issue Dec 26, 2021

[WIP] Fast tokenizer for debertaV2 #14928

Open

5 tasks

huggingface / transformers Public

DeBERTa V3 Fast Tokenizer #14712

DeBERTa V3 Fast Tokenizer #14712

ikergarcia1996 commented Dec 10, 2021

bluesky333 commented Dec 22, 2021

LysandreJik commented Dec 22, 2021

stefan-it commented Dec 22, 2021

SaulLu commented Dec 22, 2021 •

edited

LysandreJik commented Dec 22, 2021

stefan-it commented Dec 22, 2021

stefan-it commented Dec 22, 2021

BigBird01 commented Dec 22, 2021 •

edited

SaulLu commented Dec 23, 2021

mingboiz commented Dec 25, 2021

huggingface / transformers Public

DeBERTa V3 Fast Tokenizer #14712

DeBERTa V3 Fast Tokenizer #14712

Comments

ikergarcia1996 commented Dec 10, 2021

🚀 Feature request

Motivation

bluesky333 commented Dec 22, 2021

LysandreJik commented Dec 22, 2021

stefan-it commented Dec 22, 2021

SaulLu commented Dec 22, 2021 • edited

LysandreJik commented Dec 22, 2021

stefan-it commented Dec 22, 2021

stefan-it commented Dec 22, 2021

BigBird01 commented Dec 22, 2021 • edited

SaulLu commented Dec 23, 2021

mingboiz commented Dec 25, 2021

SaulLu commented Dec 22, 2021 •

edited

BigBird01 commented Dec 22, 2021 •

edited