natural-language-understanding

🚀 Feature request

Add better error message to HubertForCTC, Wav2Vec2ForCTC if labels are bigger than vocab size.

Motivation

Following this issue: huggingface/transformers#12264 it is clear that an error message should be thrown if any of the any of the labels are > self.config.vocab_size or else silent errors can sneak into the training script.

So w

The Split class accepts SplitDelimiterBehavior which is really useful. The Punctuation however always uses SplitDelimiterBehavior::Isolated (and Whitespace on the other hand behaves like SplitDelimiterBehavior::Removed).

impl PreTokenizer for Punctuation {
    fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()> {
        pretokenized.split(|_, s| s.spl

Description

While using tokenizers.create with the model and vocab file for a custom corpus, the code throws an error and is not able to generate the BERT vocab file

Error Message

ValueError: Mismatch vocabulary! All special tokens specified must be control tokens in the sentencepiece vocabulary.

To Reproduce

from gluonnlp.data import tokenizers
tokenizers.create('spm', model_p

natural-language-understanding

Here are 512 public repositories matching this topic...

huggingface / transformers

Add error message to Wav2Vec2 & Hubert if labels > vocab_size

🚀 Feature request

Motivation

[Performance] Tracking open Issues and PRs (pytorch transformers)

Getting time offsets of beginning and end of each word in Wav2Vec2

google-research / bert

hanxiao / bert-as-service

ludwig-ai / ludwig

microsoft / nlp-recipes

huggingface / tokenizers

Add SplitDelimiterBehavior to Punctuation constructor

dmlc / gluon-nlp

[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected.

Description

Error Message

To Reproduce

Use official MXNet batchify to implement the batchify functions

NMT Inference: Chunk overlength sequences and translate in sequence

opencog / opencog

google / sling

namisan / mt-dnn

explosion / spacy-transformers

KartikChugh / Otto

chatopera / insuranceqa-corpus-zh

MITESHPUTHRANNEU / Speech-Emotion-Analyzer

declare-lab / conv-emotion

microsoft / DeBERTa

turtlesoupy / this-word-does-not-exist

huggingface / autonlp

practical-nlp / practical-nlp

Decalogue / chat

suragnair / seqGAN

Picovoice / rhino

BotLibre / BotLibre

jayparks / tf-seq2seq

soulbliss / NLP-conference-compendium

chatopera / clause

graphbrain / graphbrain

Droidtown / ArticutAPI

gkiril / oie-resources

JohnSnowLabs / nlu

Improve this page

Add this topic to your repo