How to use tokenizers with DataSet and DataBundle?

I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.

Basically, just trying to do this:

```python
# Initialize DataSet object `ds` with data.
# Initialize DataBundle object with DataSet object `ds`.
# Define tokenizer.
# Associate tokenizer with field in DataSet or DataBundle object.
# Hope to see tokenizer work when batches of data are extracted from DataSet object.
```

```python
from fastNLP import DataSet
from fastNLP import Vocabulary
from fastNLP.io import DataBundle
from functools import partial
from transformers import GPT2Tokenizer

data = {'idx': [0, 1, 2],  
        'sentence':["This is an apple .", "I like apples .", "Apples are good for our health ."],
        'words': [['This', 'is', 'an', 'apple', '.'], 
                  ['I', 'like', 'apples', '.'], 
                  ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],
        'num': [5, 4, 7]}

dataset = DataSet(data)    # Initialize DataSet object with data.

data_bundle = DataBundle(datasets={'train': dataset})    # Initialize DataBundle object

# Define tokenizer:
tokenizer_in = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer_in.pad_token, tokenizer_in.padding_side = tokenizer_in.eos_token, 'left'
tokenizer_in_fn = partial(tokenizer_in.encode_plus, padding=True, return_attention_mask=True)
print(tokenizer_in_fn)       # ensure that settings are as expected.

# Associate tokenizer with field:
data_bundle.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')

print(ds[0:3])
# Gives:
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | idx | sentence       | words          | num | input_ids      | attention_mask     | length |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | 0   | This is an ... | ['This', 'i... | 5   | [1212, 318,... | [1, 1, 1, 1, 1]... | 5      |
# | 1   | I like appl... | ['I', 'like... | 4   | [40, 588, 2... | [1, 1, 1, 1]       | 4      |
# | 2   | Apples are ... | ['Apples', ... | 7   | [4677, 829,... | [1, 1, 1, 1, 1,... | 8      |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+

# Try to obtain batch data:
ds = data_bundle.get_dataset('train')
print(ds['sentence'].get([0,1,2])) # okay, no problem.
print(ds['input_ids'].get([0,1,2])) # throws exception.
```

```python
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[66], line 1
----> 1 print(ds['input_ids'].get([0,1,2]))

File ~/condaenvs/bbt-hf425-py310/lib/python3.10/site-packages/fastNLP/core/dataset/field.py:77, in FieldArray.get(self, indices)
     75 except BaseException as e:
     76     raise e
---> 77 return np.array(contents)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
```

I've tried associating the tokenizer to the DataSet object but the same exception is encountered:
```
# ds is initialized DataSet object
ds.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')
ds['input_ids'].get([0,1,2])  # throws same exception as above.
``` 

Python version 3.10, numpy 1.24.1 (are there other python packages whose versions I need to be careful about?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use tokenizers with DataSet and DataBundle? #449

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to use tokenizers with DataSet and DataBundle? #449

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions