LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast #13972

JStumpp · 2021-10-12T08:09:22Z

Environment info

transformers version: 4.11.3
Platform: Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.2.5
Python version: 3.8.12
PyTorch version (GPU?): 1.9.1+cu102 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@NielsRogge

Information

Model I am using: LayoutXLM

The problem arises when using:

the official example scripts: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

The tasks I am working on is:

an official task: SequenceClassification

To reproduce

Steps to reproduce the behavior:

When we replace the layoutlmv2 tokenizer in cell 8 of https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

with the layoutxlm tokenizer as described in https://huggingface.co/transformers/model_doc/layoutxlm.html

from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor, AutoTokenizer
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutxlm-base')
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

the following error occurs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3433/3030379235.py in <module>
      5 tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutxlm-base')
      6 #tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
----> 7 processor = LayoutLMv2Processor(feature_extractor, tokenizer)

~/.cache/pypoetry/virtualenvs/stp-experiment0-RgVp7VCN-py3.8/lib/python3.8/site-packages/transformers/models/layoutlmv2/processing_layoutlmv2.py in __init__(self, feature_extractor, tokenizer)
     54             )
     55         if not isinstance(tokenizer, (LayoutLMv2Tokenizer, LayoutLMv2TokenizerFast)):
---> 56             raise ValueError(
     57                 f"`tokenizer` has to be of type {LayoutLMv2Tokenizer.__class__} or {LayoutLMv2TokenizerFast.__class__}, but is {type(tokenizer)}"
     58             )

ValueError: `tokenizer` has to be of type <class 'type'> or <class 'type'>, but is <class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

It looks like the LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast.

Expected behavior

That the LayoutLMv2Processor accepts the XLMRobertaTokenizerFast.

The text was updated successfully, but these errors were encountered:

NielsRogge · 2021-10-12T08:28:34Z

LayoutLMv2Processor currently only supports LayoutLMv2Tokenizer/LayoutLMv2TokenizerFast. It would be a good first issue to add support for a new LayoutXLMTokenizerFast, which is based on XLMRoBERTa and takes into account the bounding box and word label inputs.

kingyiusuen · 2021-10-12T19:58:58Z

Hi @NielsRogge, I'd like to take a shot at this!

NielsRogge · 2021-10-13T08:22:25Z

Great! So one would need to add tokenization_layoutxlm.py and tokenization_layoutxlm_fast.py to the LayoutLMv2 folder. These should be near identical copies of tokenization_xlm_roberta.py and tokenization_xlm_roberta_fast.py (found here), respectively, but with added support for boxes and word_labels inputs (you can take a look at tokenization_layoutlmv2.py and tokenization_layoutlmv2_fast.py respectively how these are implemented).

kingyiusuen · 2021-10-14T03:07:34Z

Great! So one would need to add tokenization_layoutxlm.py and tokenization_layoutxlm_fast.py to the LayoutLMv2 folder. These should be near identical copies of tokenization_xlm_roberta.py and tokenization_xlm_roberta_fast.py (found here), respectively, but with added support for boxes and word_labels inputs (you can take a look at tokenization_layoutlmv2.py and tokenization_layoutlmv2_fast.py respectively how these are implemented).

Thanks. Any advice on how I should go about writing the unit tests?

NielsRogge · 2021-10-14T12:29:37Z

For the unit tests, I would define test_tokenization_layoutxlm.py and test_tokenization_layoutxlm_fast.py based on the corresponding tests of LayoutLMv2.

NielsRogge added the Good First Issue label Oct 12, 2021

kingyiusuen mentioned this issue Oct 15, 2021

Add LayoutXLMTokenizer and LayoutXLMTokenizerFast #14030

Closed

5 tasks

huggingface / transformers Public

LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast #13972

LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast #13972

JStumpp commented Oct 12, 2021

NielsRogge commented Oct 12, 2021 •

edited

kingyiusuen commented Oct 12, 2021

NielsRogge commented Oct 13, 2021 •

edited

kingyiusuen commented Oct 14, 2021

NielsRogge commented Oct 14, 2021

huggingface / transformers Public

LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast #13972

LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast #13972

Comments

JStumpp commented Oct 12, 2021

Environment info

Who can help

Information

To reproduce

Expected behavior

NielsRogge commented Oct 12, 2021 • edited

kingyiusuen commented Oct 12, 2021

NielsRogge commented Oct 13, 2021 • edited

kingyiusuen commented Oct 14, 2021

NielsRogge commented Oct 14, 2021

NielsRogge commented Oct 12, 2021 •

edited

NielsRogge commented Oct 13, 2021 •

edited