Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast #13972

Open
2 tasks done
JStumpp opened this issue Oct 12, 2021 · 5 comments
Open
2 tasks done

LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast #13972

JStumpp opened this issue Oct 12, 2021 · 5 comments

Comments

@JStumpp
Copy link

@JStumpp JStumpp commented Oct 12, 2021

Environment info

  • transformers version: 4.11.3
  • Platform: Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.2.5
  • Python version: 3.8.12
  • PyTorch version (GPU?): 1.9.1+cu102 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@NielsRogge

Information

Model I am using: LayoutXLM

The problem arises when using:

The tasks I am working on is:

  • an official task: SequenceClassification

To reproduce

Steps to reproduce the behavior:

When we replace the layoutlmv2 tokenizer in cell 8 of https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

with the layoutxlm tokenizer as described in https://huggingface.co/transformers/model_doc/layoutxlm.html

from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor, AutoTokenizer
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutxlm-base')
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

the following error occurs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3433/3030379235.py in <module>
      5 tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutxlm-base')
      6 #tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
----> 7 processor = LayoutLMv2Processor(feature_extractor, tokenizer)

~/.cache/pypoetry/virtualenvs/stp-experiment0-RgVp7VCN-py3.8/lib/python3.8/site-packages/transformers/models/layoutlmv2/processing_layoutlmv2.py in __init__(self, feature_extractor, tokenizer)
     54             )
     55         if not isinstance(tokenizer, (LayoutLMv2Tokenizer, LayoutLMv2TokenizerFast)):
---> 56             raise ValueError(
     57                 f"`tokenizer` has to be of type {LayoutLMv2Tokenizer.__class__} or {LayoutLMv2TokenizerFast.__class__}, but is {type(tokenizer)}"
     58             )

ValueError: `tokenizer` has to be of type <class 'type'> or <class 'type'>, but is <class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

It looks like the LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast.

Expected behavior

That the LayoutLMv2Processor accepts the XLMRobertaTokenizerFast.

@NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Oct 12, 2021

LayoutLMv2Processor currently only supports LayoutLMv2Tokenizer/LayoutLMv2TokenizerFast. It would be a good first issue to add support for a new LayoutXLMTokenizerFast, which is based on XLMRoBERTa and takes into account the bounding box and word label inputs.

@kingyiusuen
Copy link

@kingyiusuen kingyiusuen commented Oct 12, 2021

Hi @NielsRogge, I'd like to take a shot at this!

@NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Oct 13, 2021

Great! So one would need to add tokenization_layoutxlm.py and tokenization_layoutxlm_fast.py to the LayoutLMv2 folder. These should be near identical copies of tokenization_xlm_roberta.py and tokenization_xlm_roberta_fast.py (found here), respectively, but with added support for boxes and word_labels inputs (you can take a look at tokenization_layoutlmv2.py and tokenization_layoutlmv2_fast.py respectively how these are implemented).

@kingyiusuen
Copy link

@kingyiusuen kingyiusuen commented Oct 14, 2021

Great! So one would need to add tokenization_layoutxlm.py and tokenization_layoutxlm_fast.py to the LayoutLMv2 folder. These should be near identical copies of tokenization_xlm_roberta.py and tokenization_xlm_roberta_fast.py (found here), respectively, but with added support for boxes and word_labels inputs (you can take a look at tokenization_layoutlmv2.py and tokenization_layoutlmv2_fast.py respectively how these are implemented).

Thanks. Any advice on how I should go about writing the unit tests?

@NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Oct 14, 2021

For the unit tests, I would define test_tokenization_layoutxlm.py and test_tokenization_layoutxlm_fast.py based on the corresponding tests of LayoutLMv2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants