Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warning message if model uses input_ids that include padding tokens, but no attention_mask is provided. #16136

Open
patrickvonplaten opened this issue Mar 14, 2022 · 3 comments

Comments

@patrickvonplaten
Copy link
Member

@patrickvonplaten patrickvonplaten commented Mar 14, 2022

First good issue

A current error is that a user forwards a batched tensor of input_ids that include a padding token, e.g. input_ids = torch.tensor([["hello", "this", "is", "a", "long", "string"], ["hello", "<pad>", "<pad>", "<pad>", "<pad>"]]

In this case, the attention_mask should be provided as well. Otherwise the output hidden_states will be incorrectly computed. This is quite a common silent error IMO.

With @LysandreJik @sgugger, we have decided to not automatically create the attention_mask that masks out the padding tokens in this case because of the reasons explains here: #15479 (comment) . However as pointed out in #15479, we should IMO at least displa a warning since this error happens a lot IMO.

As a first good issue, one could add such a warning to the BertModel in a first case which would go something like:

if attention_mask is not None and (input_ids == pad_token_id).any():
    logger.warn("display nice warning here....")

What do you think @sgugger @LysandreJik ?

@sgugger
Copy link
Member

@sgugger sgugger commented Mar 14, 2022

Models usually don't know the right pad token ID as pointed out in the issue (I'm also not sure that community-contributed models or models not as heavily used as BERT have the right pas token ID in their configs), so I'm not in favor of this. Plus, the check of the inputs at each forward pass would slow down performance.

I agree that it's a common error, and it would make a very nice addition to the troubleshooting guide IMO, but I'm not sure we can add anything in the library to properly warn users without hurting performance or having a lot of false alarms.

@patrickvonplaten
Copy link
Member Author

@patrickvonplaten patrickvonplaten commented Mar 17, 2022

Hmm, think we can be pretty confident that self.config.pad_token_id inside the model is the correct padding token. Agree that performance would suffer here a bit. Think putting it in the Trouble shooting guide is a good idea cc @stevhliu

@stevhliu
Copy link
Member

@stevhliu stevhliu commented Mar 17, 2022

Yay more content for the troubleshooting guide! I'll work on a PR for this 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants