Skip to content

How to use tokenizers with DataSet and DataBundle? #449

Open
@davidleejy

Description

@davidleejy

I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.

Basically, just trying to do this:

# Initialize DataSet object `ds` with data.
# Initialize DataBundle object with DataSet object `ds`.
# Define tokenizer.
# Associate tokenizer with field in DataSet or DataBundle object.
# Hope to see tokenizer work when batches of data are extracted from DataSet object.
from fastNLP import DataSet
from fastNLP import Vocabulary
from fastNLP.io import DataBundle
from functools import partial
from transformers import GPT2Tokenizer

data = {'idx': [0, 1, 2],  
        'sentence':["This is an apple .", "I like apples .", "Apples are good for our health ."],
        'words': [['This', 'is', 'an', 'apple', '.'], 
                  ['I', 'like', 'apples', '.'], 
                  ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],
        'num': [5, 4, 7]}

dataset = DataSet(data)    # Initialize DataSet object with data.

data_bundle = DataBundle(datasets={'train': dataset})    # Initialize DataBundle object

# Define tokenizer:
tokenizer_in = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer_in.pad_token, tokenizer_in.padding_side = tokenizer_in.eos_token, 'left'
tokenizer_in_fn = partial(tokenizer_in.encode_plus, padding=True, return_attention_mask=True)
print(tokenizer_in_fn)       # ensure that settings are as expected.

# Associate tokenizer with field:
data_bundle.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')

print(ds[0:3])
# Gives:
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | idx | sentence       | words          | num | input_ids      | attention_mask     | length |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | 0   | This is an ... | ['This', 'i... | 5   | [1212, 318,... | [1, 1, 1, 1, 1]... | 5      |
# | 1   | I like appl... | ['I', 'like... | 4   | [40, 588, 2... | [1, 1, 1, 1]       | 4      |
# | 2   | Apples are ... | ['Apples', ... | 7   | [4677, 829,... | [1, 1, 1, 1, 1,... | 8      |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+

# Try to obtain batch data:
ds = data_bundle.get_dataset('train')
print(ds['sentence'].get([0,1,2])) # okay, no problem.
print(ds['input_ids'].get([0,1,2])) # throws exception.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[66], line 1
----> 1 print(ds['input_ids'].get([0,1,2]))

File ~/condaenvs/bbt-hf425-py310/lib/python3.10/site-packages/fastNLP/core/dataset/field.py:77, in FieldArray.get(self, indices)
     75 except BaseException as e:
     76     raise e
---> 77 return np.array(contents)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

I've tried associating the tokenizer to the DataSet object but the same exception is encountered:

# ds is initialized DataSet object
ds.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')
ds['input_ids'].get([0,1,2])  # throws same exception as above.

Python version 3.10, numpy 1.24.1 (are there other python packages whose versions I need to be careful about?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions