Dataset Features
- Add
concatenate_datasets
for iterable datasets by @lhoestq in #4500 - Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in #4625
- Support using PCM audio files (#4323) by @YooSungHyun in #4409
- [data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in #4633
- Support extract 7-zip compressed data files by @albertvillanova in #4672
- Support extract lz4 compressed data files by @albertvillanova in #4700
- Support
metadata.jsonl
from parent directories inimagefolder
@mariosasko in #4576
Dataset changes
- Update: allocine - Support streaming by @albertvillanova in #4563
- Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in #4585
- Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in #4586
- Update: financial_phrasebank - Host data on the Hub by @albertvillanova in #4598
- Update: cfq - Support streaming by @albertvillanova in #4579
- Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in #4588
- Update: bookcorpus - Support streaming dataset by @albertvillanova in #4564
- Update: fever - Refactor and add metadata by @albertvillanova in #4503
- Update: mlsum - Support streaming dataset by @albertvillanova in #4574
- Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in #4523
- Fix: conll2003 - fix empty example by @lhoestq in #4662
- Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in #4554
- Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in #4706
- Fix: crd3 - fix splits that were containing the same data by @lhoestq in #4705
Dataset Cards
- Add action names in schema_guided_dstc8 dataset card by @lhoestq in #4559
- Add evaluation data to acronym_identification by @lewtun in #4561
- Update WinoBias README by @sashavor in #4631
- Support "tags" yaml tag by @lhoestq in #4716
- Fix POS tags by @lhoestq in #4715
- AESLC dataset: Add summarization tags by @hobson in #4517
Documentation
- Update docs around audio and vision by @stevhliu in #4440
- Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in #4513
- Remove multiple config section by @stevhliu in #4600
- Create new sections for audio and vision in guides by @stevhliu in #4519
- Document installation of sox OS dependency for audio by @albertvillanova in #4713
General improvements and bug fixes
- Add regression test for
ArrowWriter.write_batch
when batch is empty by @alvarobartt in #4510 - Support all negative values in ClassLabel by @lhoestq in #4511
- Add uppercased versions of image file extensions for automatic module inference by @mariosasko in #4515
- Patch tests for hfh v0.8.0 by @LysandreJik in #4518
- Replace deprecated logging.warn with logging.warning by @hugovk in #4539
- [CI] Fix upstream hub test url by @lhoestq in #4543
- Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in #4541
- [CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in #4546
- Tell users to upload on the hub directly by @lhoestq in #4552
- Add
batch_size
parameter when callingadd_faiss_index
andadd_faiss_index_from_external_arrays
by @alvarobartt in #4535 - Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in #4545
- Properly raise FileNotFound even if the dataset is private by @lhoestq in #4536
- Fix hashing for python 3.9 by @lhoestq in #4516
- [CI] Fix some warnings by @lhoestq in #4547
- Validate new_fingerprint passed by user by @lhoestq in #4587
- Update CI Windows orb by @albertvillanova in #4604
- Perform hidden file check on relative data file path by @mariosasko in #4551
- Align more metadata with other repo types (models,spaces) by @julien-c in #4607
- Align/fix license metadata info by @julien-c in #4613
- Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in #4611
- Add authentication tip to
load_dataset
by @mariosasko in #4577 - Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in #4553
- fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in #4630
- Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in #4608
- Rename master to main by @lhoestq in #4643
- Set HF_SCRIPTS_VERSION to main by @lhoestq in #4645
- [Minor fix] Typo correction by @cakiki in #4644
- fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in #4627
- Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in #4590
- Fix time type
_arrow_to_datasets_dtype
conversion by @mariosasko in #4628 - Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in #4660
- Replace
assertEqual
withassertTupleEqual
in unit tests for verbosity by @alvarobartt in #4496 - Fix
embed_storage
on features inside lists/sequences by @mariosasko in #4615 - Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in #4512
- Transfer CI to GitHub Actions by @albertvillanova in #4659
- Fix mock fsspec by @albertvillanova in #4685
- Trigger CI also on push to main by @albertvillanova in #4687
- Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in #4622
- Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in #4688
- Test extractors for all compression formats by @albertvillanova in #4689
- Refactor base extractors by @albertvillanova in #4690
- Update create dataset card docs by @stevhliu in #4683
- Add text decorators by @stevhliu in #4663
- Skip tests only for lz4/zstd params if not installed by @albertvillanova in #4704
- Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in #4614
- Docs: Fix same-page haslinks by @mishig25 in #4722
- Fix broken link to the Hub by @stevhliu in #4726
- Refactor conftest fixtures by @albertvillanova in #4723
- Add object detection processing tutorial by @nateraw in #4710
- Fix require torchaudio and refactor test requirements by @albertvillanova in #4708
- docs:
✏️ fix TranslationVariableLanguages example by @severo in #4731 - Pin rouge_score test dependency by @albertvillanova in #4735
- Fix named split sorting and remove unnecessary casting by @albertvillanova in #4714
- Make cast in
from_pandas
more robust by @mariosasko in #4703 - Make Extractor accept Path as input by @albertvillanova in #4718
- Refactor Hub tests by @albertvillanova in #4729
- Fix to dict conversion of
DatasetInfo
/Features
by @mariosasko in #4741
New Contributors
- @hugovk made their first contribution in #4539
- @VijayKalmath made their first contribution in #4545
- @gugarosa made their first contribution in #4630
- @benlipkin made their first contribution in #4627
- @YooSungHyun made their first contribution in #4409
- @hobson made their first contribution in #4517
- @khushmeeet made their first contribution in #4554
- @dtuit made their first contribution in #4614
Full Changelog: 2.3.2...2.4.0
Bug fixes
- Fix double dots in data files by @lhoestq in #4505
- fix a bug when
/../
is passed todata_files
causing FileNotFoundError
- fix a bug when
- fix ETT m1/m2 test/val dataset by @kashif in #4499
- Corrected broken links in doc by @clefourrier in #4501
New Contributors
- @clefourrier made their first contribution in #4501
Full Changelog: 2.3.1...2.3.2
Bug fixes
- Fix patching module that doesn't exist by @lhoestq in #4495
- fix bug when importing the lib when scipy is not installed
- Re-add download_manager module in utils by @lhoestq in #4497
- fix moved imports of
DownloadConfig
,DownloadMode
,DownloadManager
- fix moved imports of
- Support streaming UDHR dataset by @albertvillanova in #4487
Full Changelog: 2.3.0...2.3.1
Datasets Changes
- New: ImageNet-Sketch by @nateraw in #4301
- New: Biwi Kinect Head Pose by @dnaveenr in #3903
- New: enwik8 by @HallerPatrick in #4321
- New: LCCC dataset by @silverriver in #4416
- New: TruthfulQA by @jon-tow in #4159
- New: BIG-bench by @andersjohanandreassen in #4125
- New: QuickDraw by @mariosasko in #3592
- New: SST-2 by @albertvillanova in #4473
- Update: imagenet-1k - remove manual download by @mariosasko in #4299
- ImageNet can now be loaded in python with
load_dataset
without requiring a manual download ! - It also supports streaming mode with
load_dataset("imagenet-1k", streaming=True)
- ImageNet can now be loaded in python with
- Update: spider - Remove Google Drive URL by @albertvillanova in #4410
- Update: blended_skill_talk - add missing columns to by @mariosasko in #4437
- Update: multi-news - Use newer version with fixes by @JohnGiorgi in #4451
- Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459
- Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/
- Update: udhr - update metadata by @leondz in #4362
- Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in #4469
- Update: PASS - update dataset version by @mariosasko in #4488
- Fix: GEM - fix bug in wiki_auto_asset_turk config by @albertvillanova in #4389
- Fix: GEM - fix URL for totto config by @albertvillanova in #4396
- Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in #4424
- Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in #4425
- Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in #4436
- Fix: iwslt2017 by @lhoestq in #4481
Dataset Features
- to_tf_dataset rewrite by @Rocketknight1 in #4170
- see more in the documentation
- Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in #4375
- see more in the documentation
- Added stratify option to
train_test_split
by @nandwalritik in #4322 - Re-add support for Apache Beam functionality by @albertvillanova in #4328
- Resume
push_to_hub
: skip identical files inpush_to_hub
instead of overwriting by @mariosasko in #4402 - Support nested/complex feature types as
features
in packaged loaders by @mariosasko in #4364 - Optimize contiguous shard and select by @lhoestq in #4466
Dataset Cards
- Minor fixes/improvements in
scene_parse_150
card by @mariosasko in #4447 - Tidy up license metadata for google_wellformed_query, newspop, sick by @leondz in #4378
- Fix example in opus_ubuntu, Add license info by @leondz in #4360
- Update README.md of fquad by @lhoestq in #4450
Documentation
- Add API code examples for loading methods by @stevhliu in #4300
- Add API code examples for remaining main classes by @stevhliu in #4292
- Generalize tutorials for audio and vision by @stevhliu in #4468
- [Docs] How to use with PyTorch page by @lhoestq in #4474
- First draft of the docs for TF + Datasets by @Rocketknight1 in #4457
Other improvements and bug fixes
- Update CI deprecated legacy image by @albertvillanova in #4393
- remove int documentation from logging docs by @lvwerra in #4392
- Fix docstring in DatasetDict::shuffle by @felixdivo in #4344
- Fix Version equality by @albertvillanova in #4359
- Set builder name from module instead of class by @albertvillanova in #4388
- Test dill by @albertvillanova in #4385
- Refactor download by @albertvillanova in #4384
- Fix dependency on dill version by @albertvillanova in #4397
- Support remote cache_dir by @albertvillanova in #4347
- Update imagenet gate by @lhoestq in #4408
- Fix dataset builder default version by @albertvillanova in #4356
- Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in #4403
- Rename DatasetBuilder config_name by @albertvillanova in #4414
- Fix metadata validation by @albertvillanova in #4390
- Add HF.co for PRs/Issues for specific datasets by @lhoestq in #4427
- Fix type hint and documentation for
new_fingerprint
by @fxmarty in #4326 - Skip hidden files/directories in data files resolution and
iter_files
by @mariosasko in #4412 - Fix docstring of inspect_dataset by @albertvillanova in #4438
- Fix builder docstring by @albertvillanova in #4432
- Fix kwargs in docstrings by @albertvillanova in #4444
- Fix missing args in docstring of load_dataset_builder by @albertvillanova in #4445
- Add missing kwargs to docstrings by @albertvillanova in #4446
- Add extractor for bzip2-compressed files by @asivokon in #4421
- Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in #4434
- Update
dataset_infos.json
with new split info inDataset.push_to_hub
to avoid verification error by @mariosasko in #4415 - Update builder docstring for deprecated/added arguments by @albertvillanova in #4429
- Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in #4464
- Fix script fetching and local path handling in
inspect_dataset
andinspect_metric
by @mariosasko in #4433 - Fix bigbench config names by @lhoestq in #4465
- Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in #4472
- Reorder returned validation/test splits in script template by @albertvillanova in #4470
- Better ImportError message when a dataset script dependency is missing by @lhoestq in #4484
- Fix cast to null by @lhoestq in #4485
- Update
_format_columns
inremove_columns
by @alvarobartt in #4411 - Fix wrong map parameter name in cache docs by @h4iku in #4293
- Pin the revision in imagenet download links by @lhoestq in #4492
- Refactor column mappings for question answering datasets by @lewtun in #4391
New Contributors
- @leondz made their first contribution in #4378
- @felixdivo made their first contribution in #4344
- @nandwalritik made their first contribution in #4322
- @fxmarty made their first contribution in #4326
- @HallerPatrick made their first contribution in #4321
- @silverriver made their first contribution in #4416
- @asivokon made their first contribution in #4421
- @andersjohanandreassen made their first contribution in #4125
Full Changelog: 2.2.2...lol
Datasets fixes
- Fix: irc_disentangle - fix checksum and bug dataset by @albertvillanova in #4377
- Fix: CC-Aligned - fix invalid url by @juntang-zhuang in #4231
- Fix: multi_news - don't strip proceeding hyphen by @JohnGiorgi in #4353
Bug fixes
- Support lists of multi-dimensional numpy arrays by @albertvillanova in #4194
- Check if dataset features match before push in
DatasetDict.push_to_hub
by @mariosasko in #4372 - Pin dill by @albertvillanova in #4380
- dill 0.3.5 has some issues in
transformers
- pinning the version to<0.3.5
for now
- dill 0.3.5 has some issues in
Dataset Cards
- Adding eval metadata for ade v2 by @sashavor in #4319
- Adding eval metadata for AG News by @sashavor in #4329
- Adding eval metadata to Allociné dataset by @sashavor in #4330
- Adding eval metadata to Amazon Polarity by @sashavor in #4331
- Adding eval metadata for arabic speech corpus by @sashavor in #4332
- Adding eval metadata for Banking 77 by @sashavor in #4333
- Eval metadata Batch 4: Tweet Eval, Tweets Hate Speech Detection, VCTK, Weibo NER, Wisesight Sentiment, XSum, Yahoo Answers Topics, Yelp Polarity, Yelp Review Full by @sashavor in #4338
- Eval metadata batch 3: Reddit, Rotten Tomatoes, SemEval 2010, Sentiment 140, SMS Spam, Snips, SQuAD, SQuAD v2, Timit ASR by @sashavor in #4337
- Eval metadata batch 1: BillSum, CoNLL2003, CoNLLPP, CUAD, Emotion, GigaWord, GLUE, Hate Speech 18, Hate Speech by @sashavor in #4335
- Eval metadata batch 2 : Health Fact, Jigsaw Toxicity, LIAR, LJ Speech, MSRA NER, Multi News, NCBI Disease, Poem Sentiment by @sashavor in #4336
Docs
- Add API code examples for Builder classes by @stevhliu in #4313
- Add redirect to dataset script in the repo structure page by @lhoestq in #4369
Other improvements and bug fixes
- Fix failing CI on Windows for sari and wiki_split metrics by @albertvillanova in #4342
- Fix never ending GH Action to build documentation by @albertvillanova in #4345
- Fix warning in upload_file by @albertvillanova in #4355
- Fix warning in push_to_hub by @albertvillanova in #4357
- Remove config names as yaml keys by @lhoestq in #4367
- Add missing language tags for udhr dataset by @albertvillanova in #4371
- Remove links in docs to old dataset viewer by @mariosasko in #4373
New Contributors
- @JohnGiorgi made their first contribution in #4353
- @juntang-zhuang made their first contribution in #4231
Full Changelog: 2.2.1...2.2.2
Datasets bug fixes
- Fix cnn_dailymail (dm stories were ignored) by @lhoestq in #4317
datasets
2.2.0 introduced a bug in cnn_dailymail and some examples were missing in the dataset
General improvements and bug fixes
- Fix: Add missing comma by @mrm8488 in #4303
- Catch pull error when mirroring by @lhoestq in #4314
- Remove unused multiprocessing args from test CLI by @albertvillanova in #4308
- Fix CLI run_beam namespace by @albertvillanova in #4315
- Support passing config_kwargs to CLI run_beam by @albertvillanova in #4316
- Don't check f.loc in _get_extraction_protocol_with_magic_number by @lhoestq in #4318
New Contributors
Full Changelog: 2.2.0...2.2.1
Dataset Changes
- New: ImageNet by @apsdehal in #4178
- Manual download only for now
- New: Google Conceptual Captions by @abhishekkrthakur in #1459
- New: Conceptual 12M by @thomasw21 in #4162
- New: Visual Genome by @thomasw21 in #4161
- New: RVL-CDIP by @dnaveenr in #4050
- New: Text-based NP Enrichment (TNE) by @yanaiela in #4153
- New: TextVQA by @apsdehal in #3967
- New: ETT time series dataset by @kashif in #4213
- Update: assin2 - update metadata by @lhoestq in #4172
- Update: Librispeech - Add 'all' config by @patrickvonplaten in #4184
- Update: XGLUE - Support streaming dataset by @albertvillanova in #4249
- Update: crd3 - group all the turns in one example by @shanyas10 in #4240
- Update: pubmed_qa - Remove google drive URL by @lhoestq in #4255
- Update: SAMSum - Replace data URL dataset and support streaming by @albertvillanova in #4254
- Update: SAMSum - Replace data URL dataset within the same repository by @albertvillanova in #4267
- Update: big_patent - Replace data URL in dataset and support streaming by @albertvillanova in #4236
- Update: openbookqa - Add missing features for additional config by @albertvillanova in #4278
- Update: commonsense_qa - Add missing features by @albertvillanova in #4280
- Fix: Common Voice - Make sure bytes are correctly deleted if
path
exists by @patrickvonplaten in #4212 - Fix: openbookqa - fix bug in choices labels by @manandey in #4259
- Fix: openbookqa - fix style in openbookqa dataset by @albertvillanova in #4270
Dataset Features
- Add support for metadata files to
imagefolder
by @mariosasko in #4069- load a folder of images and metadata stored in
metadata.jsonl
, more info in the documentation on how to load an image dataset
- load a folder of images and metadata stored in
- Infer splits from the
data_dir
parameter when loading datasets without script by @polinaeterna in #4144- splits are inferred from the directory and file names, see more info in the documentation on how to structure your repository
- Enable label alignment for token classification datasets by @lewtun in #4277
- Add
drop_last_batch
toIterableDataset.map
by @mariosasko in #4215 - Load dataset with TSV files by @albertvillanova in #4246
Dataset Cards
- Autoeval config by @nrajani in #4234
- Add
train-deval-index
metadata to automate evaluation on your datasets based on their tasks
- Add
- Adding license information for Openbookcorpus by @meg-huggingface in #3525
- Make code for image downloading from image urls cacheable by @mariosasko in #4218
- Fix description links in dataset cards by @albertvillanova in #4222
- Add YAML tags to Dataset Card rotten tomatoes by @mo6zes in #4262
- Remove a copy-paste sentence in dataset cards by @albertvillanova in #4281
- Update LexGLUE README.md by @iliaschalkidis in #4285
- leadboard info added for TNE by @yanaiela in #4273
- Add Lahnda language tag by @mariosasko in #4286
- Add license and point of contact to big_patent dataset by @albertvillanova in #4269
- Add HF Speech Bench to Librispeech Dataset Card by @sanchit-gandhi in #4266
Metrics Changes
- Perplexity Speedup by @emibaylor in #4108
- Add AUC ROC Metric by @emibaylor in #4158
- Small fixes in ROC AUC docs by @wschella in #4239
- Fix/start token mask issue and update documentation by @TristanThrush in #4258
- Add pearsonr mc, update functionality to match the original docs by @emibaylor in #4226
Metric Cards
- Metric card for the XTREME-S dataset by @sashavor in #4251
- Creating metric card for MAE by @sashavor in #4252
- Create metric cards for mean IOU by @sashavor in #4253
- Create metric card for Mahalanobis Distance by @sashavor in #4257
- Create metric card for MSE by @sashavor in #4256
- Fix exact match by @emibaylor in #4166
- Fix google bleu typos, examples by @emibaylor in #4165
- Add f1 metric card, update docstring in py file by @emibaylor in #4227
- Add Recall Metric Card by @emibaylor in #4204
- Matthews Correlation Metric Card by @emibaylor in #4110
- Add Precision Metric Card by @emibaylor in #4203
- Add Accuracy Metric Card by @emibaylor in #4223
- Add Spearmanr Metric Card by @emibaylor in #4109
- Metric card template by @emibaylor in #3915
Documentation
- Document save_to_disk and push_to_hub on images and audio files by @lhoestq in #4193
- Add to docs how to load from local script by @albertvillanova in #4200
- Add code examples to API docs by @stevhliu in #4168
- Add code examples for DatasetDict by @stevhliu in #4245
- Add API code examples for IterableDataset by @stevhliu in #4274
- Add packaged builder configs to the documentation by @lhoestq in #4307
- [Imagefolder] Docs + Don't infer labels from file names when there are metadata + Error messages when metadata and images aren't linked correctly by @lhoestq in #4311
General improvements and bug fixes
- Generate tasks.json taxonomy from
huggingface_hub
by @julien-c in #4154 - Fix when map function modifies input in-place by @thomasw21 in #4174
- Support streaming cnn_dailymail dataset by @albertvillanova in #4188
- Don't duplicate data when encoding audio or image by @lhoestq in #4187
- Fix outdated docstring about default dataset config by @lhoestq in #4186
- Deprecate
shard_size
inpush_to_hub
in favor ofmax_shard_size
by @mariosasko in #4190 - Fix some type annotation in doc by @thomasw21 in #4202
- Update GH template for dataset viewer issues by @albertvillanova in #4201
- Update auth when mirroring datasets on the hub by @lhoestq in #4242
- Rename imagenet2012 -> imagenet-1k by @lhoestq in #4263
- Skip checksum computation in Imagefolder by default by @mariosasko in #4214
- Fix
convert_file_size_to_int
for kilobits and megabits by @mariosasko in #4205 - Fix typo in logging docs by @stevhliu in #4272
- Bump PyArrow Version to 6 by @dnaveenr in #4250
- task id update by @nrajani in #4244
- Avoid recursion error in map if example is returned as dict value by @mariosasko in #4216
- Update minimal PyArrow version warning by @mariosasko in #4279
- [Minor edit] Fix typo in class name by @cakiki in #4207
- Stream private zipped images by @lhoestq in #4173
- Fix filesystem docstring by @stevhliu in #4283
- Document how to use FAISS index for special operations by @albertvillanova in #4189
- Contributing MedMCQA dataset by @monk1337 in #4064
- Don't do unnecessary list type casting to avoid replacing None values by empty lists by @lhoestq in #4282
- Fix missing lz4 dependency for tests by @albertvillanova in #4295
- Altered faiss installation comment by @vishalsrao in #4220
- Fix CLI run_beam save_infos by @albertvillanova in #4294
- Add missing
faiss
import to fix #4287 by @alvarobartt in #4288
New Contributors
- @shanyas10 made their first contribution in #4240
- @apsdehal made their first contribution in #4178
- @wschella made their first contribution in #4239
- @TristanThrush made their first contribution in #4258
- @yanaiela made their first contribution in #4153
- @mo6zes made their first contribution in #4262
- @nrajani made their first contribution in #4244
- @sanchit-gandhi made their first contribution in #4266
- @cakiki made their first contribution in #4207
- @monk1337 made their first contribution in #4064
- @alvarobartt made their first contribution in #4288
Full Changelog: 2.1.0...2.2.0
Datasets Changes
- New: initial monash time series forecasting by @kashif in #3743
- New: Roman Urdu Hate Speech dataset by @bp-high in #3972
- New: Adversarial GLUE by @jxmorris12 in #3849
- New: MetaShift by @dnaveenr in #3900
- New: GSM8K by @jon-tow in #4103
- New: SBU Captions Photo by @thomasw21 in #4130
- Deprecated: Multilingual Librispeech - deprecate dataset in favor of
facebook/multilingual_librispeech
by @polinaeterna in #4060 - Update (BREAKING): TIMIT - Redirect users to download data manually from LDC by @lhoestq in #4145
- Update: Wikipedia by @albertvillanova in #3821 and #3989
- Update: conll2012_ontonotesv5 - Support streaming by @albertvillanova in #4002
- Update: daily_dialog - Support streaming by @albertvillanova in #4008
- Update: id_clickbait - Support streaming by @albertvillanova in #4014
- Update: blimp - Support streaming by @albertvillanova in #4016
- Update: scan - Support streaming by @albertvillanova in #4017
- Update: yelp_review_full - Replace data url by @lhoestq in #4018
- Update: yelp_polarity - Support streaming by @lhoestq in #4019
- Update: amazon_polarity - Replace data URL by @lhoestq in #4020
- Update: dbpedia_14 - Replace data url by @lhoestq in #4022
- Update: xtreme - Support streaming dataset for bucc18 config by @albertvillanova in #4026
- Update: yahoo_answers_topics - Replace data url by @lhoestq in #4023* Update: ASSIN 2 dataset - replace broken Google Drive URLS by links on github by @ruanchaves in #4004
- Update: xcopa - Support streaming by @albertvillanova in #4039
- Update: medical_dialog - Add configs with processed data by @albertvillanova in #4127
- Update: xtreme - Support streaming for udpos config by @albertvillanova in #4131
- Update: xtreme - Support streaming for PAWS-X config by @albertvillanova in #4132
- Update: xtreme - Support streaming for PAN-X config by @albertvillanova in #4135
- Update: SQuAD v2 - Use a constant for the articles regex by @bryant1410 in #4030
- Update: HANS - Support streaming by @mariosasko in #4155
- Fix: cats_vs_dogs - fix checksum error dataset by @albertvillanova in #4033
- Fix: xcopa - fix null checksum by @albertvillanova in #4034
- Fix: amazon_us_reviews - fix metadata - 4/4/2022 by @trentonstrong in #4092
Dataset Cards
- Updated annotations for nli_tr dataset by @e-budur in #4058
- Add missing label for emotion description by @lijiazheng99 in #4151
- Remove unncessary 'pylint disable' message in ReadMe by @Datta0 in #3955
- Improve RedCaps dataset card by @mariosasko in #4100
- Fix duplicate key in multi_news by @lhoestq in #4164
Datasets Tags and Search on the Hugging Face Hub
- Tasks alignment with models by @lhoestq in #4066
- Update datasets task tags to align tags with models by @lhoestq in #4067
Metrics Changes
- Xtreme-S Metrics by @patrickvonplaten in #3799
- Fix xtreme s metrics by @patrickvonplaten in #3957
- Avoid info log messages from transformers in FrugalScore metric by @albertvillanova in #3938
- Add exact match metric by @emibaylor in #3899
- Fix comet metric by @lhoestq in #3945
- Add zero_division argument to precision and recall metrics by @albertvillanova in #4035
- Support float data types in pearsonr/spearmanr metrics by @albertvillanova in #4054
- Remove GLEU metric by @emibaylor in #3949
Metric Cards
- Perplexity Metric Card by @emibaylor in #3905
- Create README.md by @sashavor in #3917
- Create README.md for CER metric by @sashavor in #3911
- Create README.md by @sashavor in #3944
- Update README.md by @sashavor in #3933
- Create SARI metric card by @sashavor in #3932
- Create MAUVE metric card by @sashavor in #3934
- Create CoVAL metric card by @sashavor in #3940
- Google BLEU Metric Card by @emibaylor in #3948
- Create metric card for BERTScore by @sashavor in #3966
- Rename wer to cer by @pmgautam in #4012
- Create metric card for XNLI by @sashavor in #4046
- Create metric card for the Code Eval metric by @sashavor in #4049
- Add TER metric card by @emibaylor in #3981
- BLEU metric card by @emibaylor in #3947
- Create metric card for CUAD by @sashavor in #4043
- Create metric card for METEOR by @sashavor in #4065
- Create a metric card for Competition MATH by @sashavor in #4073
- Create metric card for seqeval by @sashavor in #4070
- Create README.md by @sashavor in #3930
- Create metric card for Frugal Score by @sashavor in #4089
- Updating FrugalScore metric card by @sashavor in #4097
- Proposing WikiSplit metric card by @sashavor in #4098
- Fix formatting in BLEU metric card by @mariosasko in #4157
Documentation
- Doc maintenance by @stevhliu in #3926
- [Doc] Don't use v for version tags on GitHub by @sgugger in #3943
- Use templates for doc-builidng jobs by @sgugger in #3914
- Add align_labels_with_mapping docs by @stevhliu in #3931
- Add tip on how to speed up loading with ImageFolder by @mariosasko in #3980
- Fix main_classes docs index by @lhoestq in #3925
- More consistent references in docs by @mariosasko in #3988
- Docs maintenance by @stevhliu in #3999
- Add ROUGE Metric Card by @emibaylor in #4076
- Add chrF(++) Metric Card by @emibaylor in #4082
- Add SacreBLEU Metric Card by @emibaylor in #4083
General improvements and bug fixes
- Fix flatten of complex feature types by @mariosasko in #3723
- Fix flatten of Sequence feature type by @lhoestq in #3962
- Exclude Google Drive tests of the CI by @lhoestq in #3982
- Close
PIL.Image
file handler inImage.decode_example
by @mariosasko in #3995 - Fix Faiss custom_index device by @albertvillanova in #3987
- Fix None issue with Sequence of dict by @lhoestq in #4010
- Update main readme by @lhoestq in #3927
- Fix
map
remove_columns on empty dataset by @lhoestq in #4021 - Fix Audio.encode_example() when writing an array by @polinaeterna in #3998
- Use audio feature in ASR task template by @lhoestq in #4006
- Improve out of bounds error message by @lhoestq in #4068
- Increase max retries for GitHub metrics by @albertvillanova in #4063
- Fix CLI dummy data generation by @albertvillanova in #4045
- Fix docs on audio feature installation by @albertvillanova in #4028
- Add installation instructions to image_process doc by @mariosasko in #4072
- Fix GithubMetricModuleFactory instantiation with None download_config by @albertvillanova in #4078
- Increase max retries for GitHub datasets by @albertvillanova in #4079
- Close parquet writer properly in
push_to_hub
by @lhoestq in #4081 - fix typo in rename_column error message by @hunterlang in #4095
- Fix BeamWriter output Parquet file by @albertvillanova in #4087
- Remove unused legacy Beam utils by @albertvillanova in #4088
- Hotfix failing CI tests on Windows by @albertvillanova in #4119
- Update security policy by @albertvillanova in #4111
- Avoid writing empty license files by @albertvillanova in #4090
- Support huggingface_hub 0.5 by @lhoestq in #4106
- Pretty print dataset info files by @mariosasko in #4116
- Add single dataset citations for TweetEval by @gchhablani in #4137
- Adjust path to datasets tutorial in How-To by @NimaBoscarino in #4147
- Applied index-filters on scores in search.py. by @vishalsrao in #3971
- More robust
cast_to_python_objects
inTypedSequence
by @mariosasko in #4128 - Sync Features dictionaries by @mariosasko in #3997
- Avoid rate limit in update hub repositories by @lhoestq in #4167
New Contributors
- @bp-high made their first contribution in #3972
- @ruanchaves made their first contribution in #4004
- @pmgautam made their first contribution in #4012
- @hunterlang made their first contribution in #4095
- @trentonstrong made their first contribution in #4092
- @NimaBoscarino made their first contribution in #4147
- @jon-tow made their first contribution in #4103
- @lijiazheng99 made their first contribution in #4151
- @Datta0 made their first contribution in #3955
- @vishalsrao made their first contribution in #3971
Full Changelog: 2.0.0...2.1.0
🤗 Datasets 2.0.0
We're happy to announce that our new documentation is available at hf.co/docs/datasets !
Dataset Features
- Load a folder of images using the
imagefolder
dataset loader:- Add imagefolder dataset by @nateraw in #2830
- Faster ImageFolder + add option to drop labels by @mariosasko in #3887
- Push your image and audio datasets on the Hugging Face Hub with
push_to_hub
:- Add support for
Audio
andImage
feature inpush_to_hub
by @mariosasko in #3685
- Add support for
- New processing methods for streaming datasets:
- And more:
- Add more compression types for
to_json
by @bhavitvyamalik in #3551 - Multi-GPU support for
FaissIndex
by @rentruewang in #3721
- Add more compression types for
Breaking changes
- API changes for
map
andshuffle
for datasets loaded in streaming mode: - Rename GenerateMode to DownloadMode by @albertvillanova in #3759
- Remove deprecated methods/params (preparation for v2.0) by @mariosasko in #3803
- Remove deprecated
remove_columns
param infilter
by @mariosasko in #3827 - Module namespace cleanup for v2.0 by @mariosasko in #3875
Dataset Changes
- New: CFPB Consumer Complaints by @kayvane1 in #3617
- New: told-br (brazilian hate speech) by @JAugusto97 in #3683
- New: electricity load diagram by @kashif in #3722
- New: MIT Scene Parsing Benchmark by @mariosasko in #3607
- New: ElkarHizketak v1.0 by @antxa in #3780
- New: wikitablequestions by @SivilTaram in #3870
- New: ontonotes_conll by @richarddwang in #3853
- Update: BnL Historical Newspapers - make the dataset streamable by @albertvillanova in #3616
- Update: Common voice - add validated partition by @shalymin-amzn in #3669
- Update: Common Voice - add local paths to audio files by @lhoestq in #3736
- Update: Common Voice - simplify code by @lhoestq in #3817
- Update: Natural Questions - add dev-only configuration by @albertvillanova in #3699
- Update: pubmed - update data url by @albertvillanova in #3692
- Update: pubmed - make the dataset streamable by @abhi-mosaic in #3740
- Update: RedCaps - make the dataset streamable by @mariosasko in #3737
- Update: cats_vs_dogs - update metadata by @albertvillanova in #3752
- Update: newsroom - update manual download url by @albertvillanova in #3779
- Update: xcopa - update to new version by @albertvillanova in #3810
- Update: cats_vs_dogs size by @mariosasko in #3878
- Fix: sem_eval_2018_task_1 - fix download location by @maxpel in #3643
- Fix: newsqa - fix unique keys by @albertvillanova in #3696
- Fix: The Pile datasets - fix host urls by @albertvillanova in #3627
- Fix: Evidence Infer Treatment - fix dataset script by @albertvillanova in #3718
- Fix: NewsQA - fix dataset script by @albertvillanova in #3734
- Fix: head_qa - fix data url by @albertvillanova in #3766
- Fix: msr_sqa - fix unique keys by @albertvillanova in #3771
- Fix: reddit_tifu - fix data url by @albertvillanova in #3774
- Fix: wiki_lingua - fix spanish data file url by @albertvillanova in #3806
- Fix: beans - fix data urls by @mariosasko in #3890
- Fix: CRD3 - fix NonMatchingChecksumError by @albertvillanova in #3921
- Fix: MultiWOZ 2.2 - fix NonMatchingChecksumError by @albertvillanova in #3922
Dataset cards
- Add code example in wikipedia card by @lhoestq in #3678
- Fix Multi-News dataset metadata and card by @albertvillanova in #3731
- Reddit dataset card additions by @anna-kay in #3781
- Update gigaword card and info by @mariosasko in #3775
- Reddit dataset card contribution by @anna-kay in #3797
Metric Changes
- New: FrugalScore by @moussaKam in #3674
- New: Mahalanobis distance by @JoaoLages in #3794
- New: mIoU by @NielsRogge in #3745
- New: MSE and MAE - V2 by @dnaveenr in #3874
- Fix: METEOR - fix bug due to nltk version by @albertvillanova in #3884
Metric cards
- Add perplexity to metrics by @emibaylor in #3757
- Create SQuAD metric README.md by @sashavor in #3873
- SQuAD v2 metric: create README.md by @sashavor in #3879
- Update README.md for SQuAD v2 metric by @sashavor in #3908
- Update README.md for SQuAD metric by @sashavor in #3907
- Create README.md for WER metric by @sashavor in #3898
- Create README.md for GLUE by @sashavor in #3916
New documentation
General improvements and bug fixes
- Better TQDM output by @mariosasko in #3654
- Prioritize
module.builder_kwargs
over defaults inTestCommand
by @lvwerra in #3672 - Extend support for streaming datasets that use os.path.relpath by @albertvillanova in #3623
- Add Fon language tag by @albertvillanova in #3620
- Remove unnecessary 'r' arg in by @bryant1410 in #3661
- Fix TestCommand to copy dataset_infos to local dir with only data files by @albertvillanova in #3680
- Upgrade black to version ~=22.0 by @LysandreJik in #3691
- Fix streaming for servers not supporting HTTP range requests by @albertvillanova in #3689
- Pin ElasticSearch by @lhoestq in #3701
- Raise informative error when loading a save_to_disk dataset by @albertvillanova in #3705
- Fix ClassLabel to/from dict when passed names_file by @albertvillanova in #3695
- Fix CI code quality issue by @albertvillanova in #3710
- Check if indices values in
Dataset.select
are within bounds by @mariosasko in #3719 - Pin pandas to avoid bug in streaming mode by @albertvillanova in #3725
- Use config pandas version in CSV dataset builder by @albertvillanova in #3726
- Set base path to hub url for canonical datasets by @lhoestq in #3709
- Fix ValueError message formatting in int2str by @akulchik in #3742
- Patch all module attributes in its namespace by @albertvillanova in #3727
- Fix typo in train split name by @albertvillanova in #3751
- feat:
🎸 generate info if dataset_infos.json does not exist by @severo in #3670 - Support streaming in size estimation function in
push_to_hub
by @mariosasko in #3732 - Expose method and fix param by @severo in #3767
- Fix HfFileSystem docstring by @lhoestq in #3768
- process .opus files (for Multilingual Spoken Words) by @polinaeterna in #3666
- Fix: dataset name is stored in keys by @thomasw21 in #3772
- Use the same seed to shuffle shards and metadata in streaming mode by @lhoestq in #3746
- Start removing canonical datasets logic by @lhoestq in #3777
- Support passing str to iter_files by @albertvillanova in #3783
- Fix Google Drive URL to avoid Virus scan warning by @albertvillanova in #3787
- Skip checksum computation if
ignore_verifications
isTrue
by @mariosasko in #3796 - Fix error message in CSV loader for newer Pandas versions by @mariosasko in #3798
- Add
data_dir
todata_files
resolution and misc improvements to HfFileSystem by @mariosasko in #3791 - Error of writing with different schema, due to nonpreservation of nullability by @richarddwang in #3782
- Handle Nones in PyArrow struct by @mariosasko in #3814
- Fix iter_archive getting reset by @lhoestq in #3815
- Added computer vision tasks by @merveenoyan in #3800
- Fix typo in doc build yml by @mishig25 in #3819
- Allow not specifying feature cols other than
predictions
/references
inMetric.compute
by @mariosasko in #3824 - Logo float left by @mishig25 in #3836
- Pin responses to fix CI for Windows by @albertvillanova in #3840
- Fix dead dataset scripts creation link. by @dnaveenr in #3834
- Remove decode: true for image feature in head_qa by @craffel in #3805
- Update faiss device docstring by @lhoestq in #3846
- Udpate index.mdx margins by @gary149 in #3858
- Fix push_to_hub with null images by @lhoestq in #3856
- Redundant add dataset information and dead link. by @dnaveenr in #3852
- Update image dataset tags by @mariosasko in #3864
- Bring back imgs so that forsk dont get broken by @mishig25 in #3866
- Small typos in How-to-train tutorial. by @lkhphuc in #3833
- Small doc fixes by @mishig25 in #3860
- add pandas to env command by @patrickvonplaten in #3871
- Ignore duplicate keys if
ignore_verifications=True
by @mariosasko in #3868 - Update code blocks by @lhoestq in #3863
- Fix download_mode in dataset_module_factory by @albertvillanova in #3876
- Fix some shuffle docs by @lhoestq in #3885
- Fix race condition in doc build by @lhoestq in #3891
- Add default branch for doc building by @sgugger in #3893
- [docs] make dummy data creation optional by @lhoestq in #3894
- Fix code examples indentation by @lhoestq in #3895
- Align tqdm control/cache control with Transformers by @mariosasko in #3897
- Fix CLI test checksums by @albertvillanova in #3892
- Fix Google Drive URL to avoid Virus scan warning in streaming mode by @mariosasko in #3843
- Change the framework switches to the new syntax by @sgugger in #3880
New Contributors
- @kayvane1 made their first contribution in #3617
- @JAugusto97 made their first contribution in #3683
- @shalymin-amzn made their first contribution in #3669
- @kashif made their first contribution in #3722
- @akulchik made their first contribution in #3742
- @abhi-mosaic made their first contribution in #3740
- @emibaylor made their first contribution in #3757
- @anna-kay made their first contribution in #3781
- @JoaoLages made their first contribution in #3794
- @mishig25 made their first contribution in #3690
- @antxa made their first contribution in #3780
- @dnaveenr made their first contribution in #3834
- @lkhphuc made their first contribution in #3833
- @rentruewang made their first contribution in #3721
- @gary149 made their first contribution in #3858
- @NielsRogge made their first contribution in #3745
- @sashavor made their first contribution in #3873
- @SivilTaram made their first contribution in #3870
- Document cases for github datasets by @lhoestq in #3924
- Fix text loader to split only on universal newlines by @albertvillanova in #3910
- Retry HfApi call inside push_to_hub when 504 error by @albertvillanova in #3886
Full Changelog: 1.18.3...0.0.0
Bug fixes
- Prioritize
module.builder_kwargs
over defaults inTestCommand
#3672 (@lvwerra) - Fix TestCommand to copy dataset_infos to local dir with only data files #3680 (@albertvillanova)
- Upgrade black to version ~=22.0 #3691 (@LysandreJik)
- Fix streaming for servers not supporting HTTP range requests #3689 (@albertvillanova)
- Pin ElasticSearch #3701 (@lhoestq)
- Fix ClassLabel to/from dict when passed names_file #3695 (@albertvillanova)
- Fix CI code quality issue #3710 (@albertvillanova)
- Check if indices values in
Dataset.select
are within bounds #3719 (@mariosasko) - Pin pandas to avoid bug in streaming mode #3725 (@albertvillanova)
- Use config pandas version in CSV dataset builder #3726 (@albertvillanova)
- Fix dataset mirroring (@lhoestq)
- Fix ValueError message formatting in int2str #3742 (@akulchik)
- Patch all module attributes in its namespace #3727 (@albertvillanova)
- Fix HfFileSystem docstring #3768 (@lhoestq)
- Fix: dataset name is stored in keys #3772 (@thomasw21)
- Fix Google Drive URL to avoid Virus scan warning #3787 (@albertvillanova)
- Fix error message in CSV loader for newer Pandas versions #3798 (@mariosasko)
- Pin responses to fix CI for Windows #3840 (@albertvillanova)
Full Changelog: 1.18.3...1.18.4