Open
Description
I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.
Basically, just trying to do this:
# Initialize DataSet object `ds` with data.
# Initialize DataBundle object with DataSet object `ds`.
# Define tokenizer.
# Associate tokenizer with field in DataSet or DataBundle object.
# Hope to see tokenizer work when batches of data are extracted from DataSet object.
from fastNLP import DataSet
from fastNLP import Vocabulary
from fastNLP.io import DataBundle
from functools import partial
from transformers import GPT2Tokenizer
data = {'idx': [0, 1, 2],
'sentence':["This is an apple .", "I like apples .", "Apples are good for our health ."],
'words': [['This', 'is', 'an', 'apple', '.'],
['I', 'like', 'apples', '.'],
['Apples', 'are', 'good', 'for', 'our', 'health', '.']],
'num': [5, 4, 7]}
dataset = DataSet(data) # Initialize DataSet object with data.
data_bundle = DataBundle(datasets={'train': dataset}) # Initialize DataBundle object
# Define tokenizer:
tokenizer_in = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer_in.pad_token, tokenizer_in.padding_side = tokenizer_in.eos_token, 'left'
tokenizer_in_fn = partial(tokenizer_in.encode_plus, padding=True, return_attention_mask=True)
print(tokenizer_in_fn) # ensure that settings are as expected.
# Associate tokenizer with field:
data_bundle.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')
print(ds[0:3])
# Gives:
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | idx | sentence | words | num | input_ids | attention_mask | length |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | 0 | This is an ... | ['This', 'i... | 5 | [1212, 318,... | [1, 1, 1, 1, 1]... | 5 |
# | 1 | I like appl... | ['I', 'like... | 4 | [40, 588, 2... | [1, 1, 1, 1] | 4 |
# | 2 | Apples are ... | ['Apples', ... | 7 | [4677, 829,... | [1, 1, 1, 1, 1,... | 8 |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# Try to obtain batch data:
ds = data_bundle.get_dataset('train')
print(ds['sentence'].get([0,1,2])) # okay, no problem.
print(ds['input_ids'].get([0,1,2])) # throws exception.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[66], line 1
----> 1 print(ds['input_ids'].get([0,1,2]))
File ~/condaenvs/bbt-hf425-py310/lib/python3.10/site-packages/fastNLP/core/dataset/field.py:77, in FieldArray.get(self, indices)
75 except BaseException as e:
76 raise e
---> 77 return np.array(contents)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
I've tried associating the tokenizer to the DataSet object but the same exception is encountered:
# ds is initialized DataSet object
ds.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')
ds['input_ids'].get([0,1,2]) # throws same exception as above.
Python version 3.10, numpy 1.24.1 (are there other python packages whose versions I need to be careful about?)
Metadata
Metadata
Assignees
Labels
No labels