New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeBERTa V3 Fast Tokenizer #14712
Comments
From my understanding DeBERTa-V3 has the same tokenizer as V2. |
That would be a nice community contribution! I'll add the |
I've looked into it, but the only "problem" I see is the own SPMTokenizer implementation (for the slow tokenizer). It basically wraps spm, but does some own preprocessing steps... |
Indeed, it would be great to have fast versions for the tokenizers of the models :
As @stefan-it raised, I think we are indeed missing some information to be able to build a rust version of the spm tokenizer used. I tried to have a quick look at the papers corresponding to each of the models (https://arxiv.org/abs/2006.03654 and https://arxiv.org/abs/2111.09543) and unfortunately, the tokenizer modeling is not explained in them. As far as I know, there is no way to retro-engineer the spm binaries, but maybe I'm wrong! So, at least, I think we need to know the command used by the authors to train their tokenizer, I saw that several people asked for it on issues on their repo (issue 1, issue 2) but the answer is not in it. So, the only tracks I see are 1) to bounce on the existing issues to indicate that we would also be interested in this information or 2) to contact directly the authors. Indeed, if we know the command that was used to train this tokenizer, we should be able to assemble the right tokenizer components to have a fast tokenizer! Moreover, concerning mdeberta-v3, in their paper they mention that :
But unfortunately the binaries spiece.model for mT5 and spm.model for mdeberta-v3 are not equal |
Gently removing the "Good First Issue" label while solving these issues. Also pinging the author @BigBird01 :) |
There's also a vocab mismatch: mDeBERTa: 251000 |
I think the most comfortable solution would be to use the T5 Fast Tokenizer (using the mDeBERTa vocab file) - but there are a lot of details to be checked: Token mapping is different (t5): transformers/src/transformers/models/t5/tokenization_t5.py Lines 113 to 116 in 824fd44
DeBERTa: transformers/src/transformers/models/deberta_v2/tokenization_deberta_v2.py Lines 108 to 116 in 824fd44
|
In DeBERTa tokenizer, we remapped [CLS]=>1, [PAD]=>0, [UNK]=>3, [SEP]=>2 while keep other pieces unchanged. I checked T5Converter, I think it should work by directly use T5Converter to convert deberta v2/v3 tokenizer to faster tokenizer, except for the post_processor part: tokenizer.post_processor = processors.TemplateProcessing( Thanks! |
Thank you so much for your super fast response @BigBird01 It looks like all the information is there for this to be a good first issue So I'm putting the label back on and would be really happy to guide with a todo list and help the person who would like to take care of this new feature! |
Hi all, thanks for all the information provided, I have written a Converter Class for DeBERTav2 and testing it manually myself the tokenization looks correct. However I need guidance on how could I write up a DeBERTav2TokenizerFast class so I can add tests! Really appreciate any guidance, thank you! |
Fast Tokenizer for DeBERTA-V3 and mDeBERTa-V3
Motivation
DeBERTa V3 is an improved version of DeBERTa. With the V3 version, the authors also released a multilingual model "mDeBERTa-base" that outperforms XLM-R-base. However, DeBERTa V3 currently lacks a FastTokenizer implementation which makes it impossible to use with some of the example scripts (They require a FastTokenizer).
DeBERTa-V1 and DeBERTa-V2 both have a FastTokenizer implementation, it would be great to have one for DeBERTa-V3.
The text was updated successfully, but these errors were encountered: