language-model

🚀 Feature request

Add better error message to HubertForCTC, Wav2Vec2ForCTC if labels are bigger than vocab size.

Motivation

Following this issue: huggingface/transformers#12264 it is clear that an error message should be thrown if any of the any of the labels are > self.config.vocab_size or else silent errors can sneak into the training script.

So w

The Split class accepts SplitDelimiterBehavior which is really useful. The Punctuation however always uses SplitDelimiterBehavior::Isolated (and Whitespace on the other hand behaves like SplitDelimiterBehavior::Removed).

impl PreTokenizer for Punctuation {
    fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()> {
        pretokenized.split(|_, s| s.spl

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure.

Many users in our community have been asking to have easier ways to return the output of intermediate nodes. I can see that this could be very useful for debugging and also qualitative evaluation.

I think this feature would be very useful, though the exact design is not yet fully clear.

Issue to track tutorial requests:

Deep Learning with PyTorch: A 60 Minute Blitz - #69
Sentence Classification - #79

language-model

Here are 718 public repositories matching this topic...

huggingface / transformers

🚀 Feature request

Motivation

brightmart / nlp_chinese_corpus

EleutherAI / gpt-neo

huggingface / tokenizers

codertimo / BERT-pytorch

speechbrain / speechbrain

tensorflow / lingvo

CyberZHG / keras-bert

CLUEbenchmark / CLUE

deepset-ai / haystack

chiphuyen / lazynlp

zzw922cn / awesome-speech-recognition-speech-synthesis-papers

Separius / awesome-sentence-embedding

salesforce / awd-lstm-lm

NVIDIA / OpenSeq2Seq

huggingface / pytorch-openai-transformer-lm

prabhuomkar / pytorch-cpp

mihail911 / nlp-library

explosion / spacy-transformers

brightmart / bert_language_understanding

EleutherAI / gpt-neox

nlpodyssey / spago

ymcui / Chinese-ELECTRA

LiyuanLucasLiu / LM-LSTM-CRF

pykaldi / pykaldi

smilelight / lightNLP

microsoft / DeBERTa

IsaacChanghau / DL-NLP-Readings

codekansas / keras-language-modeling

SKTBrain / KoBERT

Improve this page

Add this topic to your repo