bert

🚀 Feature request

Add better error message to HubertForCTC, Wav2Vec2ForCTC if labels are bigger than vocab size.

Motivation

Following this issue: huggingface/transformers#12264 it is clear that an error message should be thrown if any of the any of the labels are > self.config.vocab_size or else silent errors can sneak into the training script.

So w

The Split class accepts SplitDelimiterBehavior which is really useful. The Punctuation however always uses SplitDelimiterBehavior::Isolated (and Whitespace on the other hand behaves like SplitDelimiterBehavior::Removed).

impl PreTokenizer for Punctuation {
    fn pre_tokenize(&self, pretokenized: &mut PreTokenizedString) -> Result<()> {
        pretokenized.split(|_, s| s.spl

From paper, it mentioned

Instead, the training data generator chooses 15% of tokens at random, e.g., in the sentence my
dog is hairy it chooses hairy.

It means that 15% of token will be choose for sure.

From https://github.com/codertimo/BERT-pytorch/blob/master/bert_pytorch/dataset/dataset.py#L68,
for every single token, it has 15% of chance that go though the followup procedure.

A web crawler was added by #775, but the test cases are missing.

1.希望可以把底层的api文档再完善一些，比如encoder,decoder，以便于复现一些论文
2.希望可以维护一个pytorch和paddle的api对照一览表，尽量全一些
3.错误日志能否准确一些，有时候datalaoder出的一些错误信息不好定位
4.能否增加使用梯度累加特性，进一步提高batch size

bert

Here are 1,672 public repositories matching this topic...

huggingface / transformers

🚀 Feature request

Motivation

hanxiao / bert-as-service

graykode / nlp-tutorial

brightmart / nlp_chinese_corpus

ymcui / Chinese-BERT-wwm

huggingface / tokenizers

codertimo / BERT-pytorch

PaddlePaddle / ERNIE

macanv / BERT-BiLSTM-CRF-NER

brightmart / albert_zh

jessevig / bertviz

bentrevett / pytorch-sentiment-analysis

IntelLabs / nlp-architect

shibing624 / pycorrector

JohnSnowLabs / spark-nlp

CyberZHG / keras-bert

asyml / texar

deepset-ai / haystack

BrikerMan / Kashgari

CLUEbenchmark / CLUE

Separius / awesome-sentence-embedding

brightmart / roberta_zh

Jiakui / awesome-bert

utterworks / fast-bert

PaddlePaddle / PaddleNLP

dbiir / UER-py

ChineseGLUE / ChineseGLUE

github / CodeSearchNet

bytedance / lightseq

msgi / nlp-journey

Improve this page

Add this topic to your repo