2.4.0

@lhoestq

Dataset Features

Add concatenate_datasets for iterable datasets by @lhoestq in #4500
Support parallelism with PyTorch DataLoader with parquet/json/csv/text/image/etc. files by @mariosasko in #4625
Support using PCM audio files (#4323) by @YooSungHyun in #4409
[data_files] Files disambiguation: match split names in data files if they are between separators by @lhoestq in #4633
Support extract 7-zip compressed data files by @albertvillanova in #4672
Support extract lz4 compressed data files by @albertvillanova in #4700
Support metadata.jsonl from parent directories in imagefolder @mariosasko in #4576

Dataset changes

Update: allocine - Support streaming by @albertvillanova in #4563
Update: multi_news - Host data on the Hub instead of Google Drive by @albertvillanova in #4585
Update: pn_summary - Host data on the Hub instead of Google Drive by @albertvillanova in #4586
Update: financial_phrasebank - Host data on the Hub by @albertvillanova in #4598
Update: cfq - Support streaming by @albertvillanova in #4579
Update: head_qa - Host data on the Hub and fix NonMatchingChecksumError by @albertvillanova in #4588
Update: bookcorpus - Support streaming dataset by @albertvillanova in #4564
Update: fever - Refactor and add metadata by @albertvillanova in #4503
Update: mlsum - Support streaming dataset by @albertvillanova in #4574
Fix: cats_vs_dogs - Update download url and improve card by @mariosasko in #4523
Fix: conll2003 - fix empty example by @lhoestq in #4662
Fix: WMT datasets - fix loading issue when choosing specific subsets and docs update by @khushmeeet in #4554
Fix: xtreme - fix empty examples in dataset for bucc18 config by @lhoestq in #4706
Fix: crd3 - fix splits that were containing the same data by @lhoestq in #4705

Dataset Cards

Add action names in schema_guided_dstc8 dataset card by @lhoestq in #4559
Add evaluation data to acronym_identification by @lewtun in #4561
Update WinoBias README by @sashavor in #4631
Support "tags" yaml tag by @lhoestq in #4716
Fix POS tags by @lhoestq in #4715
AESLC dataset: Add summarization tags by @hobson in #4517

Documentation

Update docs around audio and vision by @stevhliu in #4440
Update Google Cloud Storage documentation and add Azure Blob Storage example by @alvarobartt in #4513
Remove multiple config section by @stevhliu in #4600
Create new sections for audio and vision in guides by @stevhliu in #4519
Document installation of sox OS dependency for audio by @albertvillanova in #4713

General improvements and bug fixes

Add regression test for ArrowWriter.write_batch when batch is empty by @alvarobartt in #4510
Support all negative values in ClassLabel by @lhoestq in #4511
Add uppercased versions of image file extensions for automatic module inference by @mariosasko in #4515
Patch tests for hfh v0.8.0 by @LysandreJik in #4518
Replace deprecated logging.warn with logging.warning by @hugovk in #4539
[CI] Fix upstream hub test url by @lhoestq in #4543
Fix timestamp conversion from Pandas to Python datetime in streaming mode by @lhoestq in #4541
[CI] fixing seqeval install in ci by pinning setuptools-scm by @lhoestq in #4546
Tell users to upload on the hub directly by @lhoestq in #4552
Add batch_size parameter when calling add_faiss_index and add_faiss_index_from_external_arrays by @alvarobartt in #4535
Make DuplicateKeysError more user friendly [For Issue #2556] by @VijayKalmath in #4545
Properly raise FileNotFound even if the dataset is private by @lhoestq in #4536
Fix hashing for python 3.9 by @lhoestq in #4516
[CI] Fix some warnings by @lhoestq in #4547
Validate new_fingerprint passed by user by @lhoestq in #4587
Update CI Windows orb by @albertvillanova in #4604
Perform hidden file check on relative data file path by @mariosasko in #4551
Align more metadata with other repo types (models,spaces) by @julien-c in #4607
Align/fix license metadata info by @julien-c in #4613
Preserve member order by MockDownloadManager.iter_archive by @albertvillanova in #4611
Add authentication tip to load_dataset by @mariosasko in #4577
Stop dropping columns in to_tf_dataset() before we load batches by @Rocketknight1 in #4553
fix(dataset_wrappers): Fixes access to fsspec.asyn in torch_iterable_dataset.py. by @gugarosa in #4630
Fix xisfile, xgetsize, xisdir, xlistdir in private repo by @lhoestq in #4608
Rename master to main by @lhoestq in #4643
Set HF_SCRIPTS_VERSION to main by @lhoestq in #4645
[Minor fix] Typo correction by @cakiki in #4644
fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in #4627
Generalize meta_path json file creation in load.py [#4540] by @VijayKalmath in #4590
Fix time type _arrow_to_datasets_dtype conversion by @mariosasko in #4628
Fix _resolve_single_pattern_locally on Windows with multiple drives by @albertvillanova in #4660
Replace assertEqual with assertTupleEqual in unit tests for verbosity by @alvarobartt in #4496
Fix embed_storage on features inside lists/sequences by @mariosasko in #4615
Add links to vision tasks scripts in ADD_NEW_DATASET template by @mariosasko in #4512
Transfer CI to GitHub Actions by @albertvillanova in #4659
Fix mock fsspec by @albertvillanova in #4685
Trigger CI also on push to main by @albertvillanova in #4687
Fix ImageFolder with parameters drop_metadata=True and drop_labels=False (when metadata.jsonl is present) by @polinaeterna in #4622
Skip test_extractor only for zstd param if zstandard not installed by @albertvillanova in #4688
Test extractors for all compression formats by @albertvillanova in #4689
Refactor base extractors by @albertvillanova in #4690
Update create dataset card docs by @stevhliu in #4683
Add text decorators by @stevhliu in #4663
Skip tests only for lz4/zstd params if not installed by @albertvillanova in #4704
Ensure ConcatenationTable.cast uses target_schema metadata by @dtuit in #4614
Docs: Fix same-page haslinks by @mishig25 in #4722
Fix broken link to the Hub by @stevhliu in #4726
Refactor conftest fixtures by @albertvillanova in #4723
Add object detection processing tutorial by @nateraw in #4710
Fix require torchaudio and refactor test requirements by @albertvillanova in #4708
docs: ✏️ fix TranslationVariableLanguages example by @severo in #4731
Pin rouge_score test dependency by @albertvillanova in #4735
Fix named split sorting and remove unnecessary casting by @albertvillanova in #4714
Make cast in from_pandas more robust by @mariosasko in #4703
Make Extractor accept Path as input by @albertvillanova in #4718
Refactor Hub tests by @albertvillanova in #4729
Fix to dict conversion of DatasetInfo/Features by @mariosasko in #4741

New Contributors

@hugovk made their first contribution in #4539
@VijayKalmath made their first contribution in #4545
@gugarosa made their first contribution in #4630
@benlipkin made their first contribution in #4627
@YooSungHyun made their first contribution in #4409
@hobson made their first contribution in #4517
@khushmeeet made their first contribution in #4554
@dtuit made their first contribution in #4614

Full Changelog: 2.3.2...2.4.0

@lhoestq

Bug fixes

Fix double dots in data files by @lhoestq in #4505
- fix a bug when /../ is passed to data_files causing FileNotFoundError
fix ETT m1/m2 test/val dataset by @kashif in #4499
Corrected broken links in doc by @clefourrier in #4501

New Contributors

@clefourrier made their first contribution in #4501

Full Changelog: 2.3.1...2.3.2

@lhoestq

Bug fixes

Fix patching module that doesn't exist by @lhoestq in #4495
- fix bug when importing the lib when scipy is not installed
Re-add download_manager module in utils by @lhoestq in #4497
- fix moved imports of DownloadConfig, DownloadMode, DownloadManager
Support streaming UDHR dataset by @albertvillanova in #4487

Full Changelog: 2.3.0...2.3.1

@nateraw

Datasets Changes

New: ImageNet-Sketch by @nateraw in #4301
New: Biwi Kinect Head Pose by @dnaveenr in #3903
New: enwik8 by @HallerPatrick in #4321
New: LCCC dataset by @silverriver in #4416
New: TruthfulQA by @jon-tow in #4159
New: BIG-bench by @andersjohanandreassen in #4125
New: QuickDraw by @mariosasko in #3592
New: SST-2 by @albertvillanova in #4473
Update: imagenet-1k - remove manual download by @mariosasko in #4299
- ImageNet can now be loaded in python with load_dataset without requiring a manual download !
- It also supports streaming mode with load_dataset("imagenet-1k", streaming=True)
Update: spider - Remove Google Drive URL by @albertvillanova in #4410
Update: blended_skill_talk - add missing columns to by @mariosasko in #4437
Update: multi-news - Use newer version with fixes by @JohnGiorgi in #4451
Update: fever - update data URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/44554459
Update: udhr - Add and fix language tags by @albertvillanova in https://github.com/huggingface/datasets/pull/
Update: udhr - update metadata by @leondz in #4362
Update: wider_face - Replace data URLs once hosted on the Hub by @albertvillanova in #4469
Update: PASS - update dataset version by @mariosasko in #4488
Fix: GEM - fix bug in wiki_auto_asset_turk config by @albertvillanova in #4389
Fix: GEM - fix URL for totto config by @albertvillanova in #4396
Fix: timit_asr - fix DuplicatedKeysError by @albertvillanova in #4424
Fix: timit_asr - Make extensions case-insensitive by @albertvillanova in #4425
Fix: timit_asr - Fix directory names for LDC data by @albertvillanova in #4436
Fix: iwslt2017 by @lhoestq in #4481

Dataset Features

to_tf_dataset rewrite by @Rocketknight1 in #4170
- see more in the documentation
Support DataLoader with num_workers > 0 in streaming mode by @lhoestq in #4375
- see more in the documentation
Added stratify option to train_test_split by @nandwalritik in #4322
Re-add support for Apache Beam functionality by @albertvillanova in #4328
Resume push_to_hub: skip identical files in push_to_hub instead of overwriting by @mariosasko in #4402
Support nested/complex feature types as features in packaged loaders by @mariosasko in #4364
Optimize contiguous shard and select by @lhoestq in #4466

Dataset Cards

Minor fixes/improvements in scene_parse_150 card by @mariosasko in #4447
Tidy up license metadata for google_wellformed_query, newspop, sick by @leondz in #4378
Fix example in opus_ubuntu, Add license info by @leondz in #4360
Update README.md of fquad by @lhoestq in #4450

Documentation

Add API code examples for loading methods by @stevhliu in #4300
Add API code examples for remaining main classes by @stevhliu in #4292
Generalize tutorials for audio and vision by @stevhliu in #4468
[Docs] How to use with PyTorch page by @lhoestq in #4474
First draft of the docs for TF + Datasets by @Rocketknight1 in #4457

Other improvements and bug fixes

Update CI deprecated legacy image by @albertvillanova in #4393
remove int documentation from logging docs by @lvwerra in #4392
Fix docstring in DatasetDict::shuffle by @felixdivo in #4344
Fix Version equality by @albertvillanova in #4359
Set builder name from module instead of class by @albertvillanova in #4388
Test dill by @albertvillanova in #4385
Refactor download by @albertvillanova in #4384
Fix dependency on dill version by @albertvillanova in #4397
Support remote cache_dir by @albertvillanova in #4347
Update imagenet gate by @lhoestq in #4408
Fix dataset builder default version by @albertvillanova in #4356
Uncomment logging deactivation for ArrowBasedBuilder by @thomasw21 in #4403
Rename DatasetBuilder config_name by @albertvillanova in #4414
Fix metadata validation by @albertvillanova in #4390
Add HF.co for PRs/Issues for specific datasets by @lhoestq in #4427
Fix type hint and documentation for new_fingerprint by @fxmarty in #4326
Skip hidden files/directories in data files resolution and iter_files by @mariosasko in #4412
Fix docstring of inspect_dataset by @albertvillanova in #4438
Fix builder docstring by @albertvillanova in #4432
Fix kwargs in docstrings by @albertvillanova in #4444
Fix missing args in docstring of load_dataset_builder by @albertvillanova in #4445
Add missing kwargs to docstrings by @albertvillanova in #4446
Add extractor for bzip2-compressed files by @asivokon in #4421
Fix dummy dataset generation script for handling nested types of _URLs by @silverriver in #4434
Update dataset_infos.json with new split info in Dataset.push_to_hub to avoid verification error by @mariosasko in #4415
Update builder docstring for deprecated/added arguments by @albertvillanova in #4429
Extend support for streaming datasets that use xml.dom.minidom.parse by @albertvillanova in #4464
Fix script fetching and local path handling in inspect_dataset and inspect_metric by @mariosasko in #4433
Fix bigbench config names by @lhoestq in #4465
Fix 401 error for unauthticated requests to non-existing repos by @lhoestq in #4472
Reorder returned validation/test splits in script template by @albertvillanova in #4470
Better ImportError message when a dataset script dependency is missing by @lhoestq in #4484
Fix cast to null by @lhoestq in #4485
Update _format_columns in remove_columns by @alvarobartt in #4411
Fix wrong map parameter name in cache docs by @h4iku in #4293
Pin the revision in imagenet download links by @lhoestq in #4492
Refactor column mappings for question answering datasets by @lewtun in #4391

New Contributors

@leondz made their first contribution in #4378
@felixdivo made their first contribution in #4344
@nandwalritik made their first contribution in #4322
@fxmarty made their first contribution in #4326
@HallerPatrick made their first contribution in #4321
@silverriver made their first contribution in #4416
@asivokon made their first contribution in #4421
@andersjohanandreassen made their first contribution in #4125

Full Changelog: 2.2.2...lol

@albertvillanova

Datasets fixes

Fix: irc_disentangle - fix checksum and bug dataset by @albertvillanova in #4377
Fix: CC-Aligned - fix invalid url by @juntang-zhuang in #4231
Fix: multi_news - don't strip proceeding hyphen by @JohnGiorgi in #4353

Bug fixes

Support lists of multi-dimensional numpy arrays by @albertvillanova in #4194
Check if dataset features match before push in DatasetDict.push_to_hub by @mariosasko in #4372
Pin dill by @albertvillanova in #4380
- dill 0.3.5 has some issues in transformers - pinning the version to <0.3.5 for now

Dataset Cards

Adding eval metadata for ade v2 by @sashavor in #4319
Adding eval metadata for AG News by @sashavor in #4329
Adding eval metadata to Allociné dataset by @sashavor in #4330
Adding eval metadata to Amazon Polarity by @sashavor in #4331
Adding eval metadata for arabic speech corpus by @sashavor in #4332
Adding eval metadata for Banking 77 by @sashavor in #4333
Eval metadata Batch 4: Tweet Eval, Tweets Hate Speech Detection, VCTK, Weibo NER, Wisesight Sentiment, XSum, Yahoo Answers Topics, Yelp Polarity, Yelp Review Full by @sashavor in #4338
Eval metadata batch 3: Reddit, Rotten Tomatoes, SemEval 2010, Sentiment 140, SMS Spam, Snips, SQuAD, SQuAD v2, Timit ASR by @sashavor in #4337
Eval metadata batch 1: BillSum, CoNLL2003, CoNLLPP, CUAD, Emotion, GigaWord, GLUE, Hate Speech 18, Hate Speech by @sashavor in #4335
Eval metadata batch 2 : Health Fact, Jigsaw Toxicity, LIAR, LJ Speech, MSRA NER, Multi News, NCBI Disease, Poem Sentiment by @sashavor in #4336

Docs

Add API code examples for Builder classes by @stevhliu in #4313
Add redirect to dataset script in the repo structure page by @lhoestq in #4369

Other improvements and bug fixes

Fix failing CI on Windows for sari and wiki_split metrics by @albertvillanova in #4342
Fix never ending GH Action to build documentation by @albertvillanova in #4345
Fix warning in upload_file by @albertvillanova in #4355
Fix warning in push_to_hub by @albertvillanova in #4357
Remove config names as yaml keys by @lhoestq in #4367
Add missing language tags for udhr dataset by @albertvillanova in #4371
Remove links in docs to old dataset viewer by @mariosasko in #4373

New Contributors

@JohnGiorgi made their first contribution in #4353
@juntang-zhuang made their first contribution in #4231

Full Changelog: 2.2.1...2.2.2

@lhoestq

Datasets bug fixes

Fix cnn_dailymail (dm stories were ignored) by @lhoestq in #4317
- datasets 2.2.0 introduced a bug in cnn_dailymail and some examples were missing in the dataset

General improvements and bug fixes

Fix: Add missing comma by @mrm8488 in #4303
Catch pull error when mirroring by @lhoestq in #4314
Remove unused multiprocessing args from test CLI by @albertvillanova in #4308
Fix CLI run_beam namespace by @albertvillanova in #4315
Support passing config_kwargs to CLI run_beam by @albertvillanova in #4316
Don't check f.loc in _get_extraction_protocol_with_magic_number by @lhoestq in #4318

New Contributors

@mrm8488 made their first contribution in #4303

Full Changelog: 2.2.0...2.2.1

@apsdehal

Dataset Changes

New: ImageNet by @apsdehal in #4178
- Manual download only for now
New: Google Conceptual Captions by @abhishekkrthakur in #1459
New: Conceptual 12M by @thomasw21 in #4162
New: Visual Genome by @thomasw21 in #4161
New: RVL-CDIP by @dnaveenr in #4050
New: Text-based NP Enrichment (TNE) by @yanaiela in #4153
New: TextVQA by @apsdehal in #3967
New: ETT time series dataset by @kashif in #4213
Update: assin2 - update metadata by @lhoestq in #4172
Update: Librispeech - Add 'all' config by @patrickvonplaten in #4184
Update: XGLUE - Support streaming dataset by @albertvillanova in #4249
Update: crd3 - group all the turns in one example by @shanyas10 in #4240
Update: pubmed_qa - Remove google drive URL by @lhoestq in #4255
Update: SAMSum - Replace data URL dataset and support streaming by @albertvillanova in #4254
Update: SAMSum - Replace data URL dataset within the same repository by @albertvillanova in #4267
Update: big_patent - Replace data URL in dataset and support streaming by @albertvillanova in #4236
Update: openbookqa - Add missing features for additional config by @albertvillanova in #4278
Update: commonsense_qa - Add missing features by @albertvillanova in #4280
Fix: Common Voice - Make sure bytes are correctly deleted if path exists by @patrickvonplaten in #4212
Fix: openbookqa - fix bug in choices labels by @manandey in #4259
Fix: openbookqa - fix style in openbookqa dataset by @albertvillanova in #4270

Dataset Features

Add support for metadata files to imagefolder by @mariosasko in #4069
- load a folder of images and metadata stored in metadata.jsonl, more info in the documentation on how to load an image dataset
Infer splits from the data_dir parameter when loading datasets without script by @polinaeterna in #4144
- splits are inferred from the directory and file names, see more info in the documentation on how to structure your repository
Enable label alignment for token classification datasets by @lewtun in #4277
Add drop_last_batch to IterableDataset.map by @mariosasko in #4215
Load dataset with TSV files by @albertvillanova in #4246

Dataset Cards

Autoeval config by @nrajani in #4234
- Add train-deval-index metadata to automate evaluation on your datasets based on their tasks
Adding license information for Openbookcorpus by @meg-huggingface in #3525
Make code for image downloading from image urls cacheable by @mariosasko in #4218
Fix description links in dataset cards by @albertvillanova in #4222
Add YAML tags to Dataset Card rotten tomatoes by @mo6zes in #4262
Remove a copy-paste sentence in dataset cards by @albertvillanova in #4281
Update LexGLUE README.md by @iliaschalkidis in #4285
leadboard info added for TNE by @yanaiela in #4273
Add Lahnda language tag by @mariosasko in #4286
Add license and point of contact to big_patent dataset by @albertvillanova in #4269
Add HF Speech Bench to Librispeech Dataset Card by @sanchit-gandhi in #4266

Metrics Changes

Perplexity Speedup by @emibaylor in #4108
Add AUC ROC Metric by @emibaylor in #4158
Small fixes in ROC AUC docs by @wschella in #4239
Fix/start token mask issue and update documentation by @TristanThrush in #4258
Add pearsonr mc, update functionality to match the original docs by @emibaylor in #4226

Metric Cards

Metric card for the XTREME-S dataset by @sashavor in #4251
Creating metric card for MAE by @sashavor in #4252
Create metric cards for mean IOU by @sashavor in #4253
Create metric card for Mahalanobis Distance by @sashavor in #4257
Create metric card for MSE by @sashavor in #4256
Fix exact match by @emibaylor in #4166
Fix google bleu typos, examples by @emibaylor in #4165
Add f1 metric card, update docstring in py file by @emibaylor in #4227
Add Recall Metric Card by @emibaylor in #4204
Matthews Correlation Metric Card by @emibaylor in #4110
Add Precision Metric Card by @emibaylor in #4203
Add Accuracy Metric Card by @emibaylor in #4223
Add Spearmanr Metric Card by @emibaylor in #4109
Metric card template by @emibaylor in #3915

Documentation

Document save_to_disk and push_to_hub on images and audio files by @lhoestq in #4193
Add to docs how to load from local script by @albertvillanova in #4200
Add code examples to API docs by @stevhliu in #4168
Add code examples for DatasetDict by @stevhliu in #4245
Add API code examples for IterableDataset by @stevhliu in #4274
Add packaged builder configs to the documentation by @lhoestq in #4307
[Imagefolder] Docs + Don't infer labels from file names when there are metadata + Error messages when metadata and images aren't linked correctly by @lhoestq in #4311

General improvements and bug fixes

Generate tasks.json taxonomy from huggingface_hub by @julien-c in #4154
Fix when map function modifies input in-place by @thomasw21 in #4174
Support streaming cnn_dailymail dataset by @albertvillanova in #4188
Don't duplicate data when encoding audio or image by @lhoestq in #4187
Fix outdated docstring about default dataset config by @lhoestq in #4186
Deprecate shard_size in push_to_hub in favor of max_shard_size by @mariosasko in #4190
Fix some type annotation in doc by @thomasw21 in #4202
Update GH template for dataset viewer issues by @albertvillanova in #4201
Update auth when mirroring datasets on the hub by @lhoestq in #4242
Rename imagenet2012 -> imagenet-1k by @lhoestq in #4263
Skip checksum computation in Imagefolder by default by @mariosasko in #4214
Fix convert_file_size_to_int for kilobits and megabits by @mariosasko in #4205
Fix typo in logging docs by @stevhliu in #4272
Bump PyArrow Version to 6 by @dnaveenr in #4250
task id update by @nrajani in #4244
Avoid recursion error in map if example is returned as dict value by @mariosasko in #4216
Update minimal PyArrow version warning by @mariosasko in #4279
[Minor edit] Fix typo in class name by @cakiki in #4207
Stream private zipped images by @lhoestq in #4173
Fix filesystem docstring by @stevhliu in #4283
Document how to use FAISS index for special operations by @albertvillanova in #4189
Contributing MedMCQA dataset by @monk1337 in #4064
Don't do unnecessary list type casting to avoid replacing None values by empty lists by @lhoestq in #4282
Fix missing lz4 dependency for tests by @albertvillanova in #4295
Altered faiss installation comment by @vishalsrao in #4220
Fix CLI run_beam save_infos by @albertvillanova in #4294
Add missing faiss import to fix #4287 by @alvarobartt in #4288

New Contributors

@shanyas10 made their first contribution in #4240
@apsdehal made their first contribution in #4178
@wschella made their first contribution in #4239
@TristanThrush made their first contribution in #4258
@yanaiela made their first contribution in #4153
@mo6zes made their first contribution in #4262
@nrajani made their first contribution in #4244
@sanchit-gandhi made their first contribution in #4266
@cakiki made their first contribution in #4207
@monk1337 made their first contribution in #4064
@alvarobartt made their first contribution in #4288

Full Changelog: 2.1.0...2.2.0

@kashif

Datasets Changes

New: initial monash time series forecasting by @kashif in #3743
New: Roman Urdu Hate Speech dataset by @bp-high in #3972
New: Adversarial GLUE by @jxmorris12 in #3849
New: MetaShift by @dnaveenr in #3900
New: GSM8K by @jon-tow in #4103
New: SBU Captions Photo by @thomasw21 in #4130
Deprecated: Multilingual Librispeech - deprecate dataset in favor of facebook/multilingual_librispeechby @polinaeterna in #4060
Update (BREAKING): TIMIT - Redirect users to download data manually from LDC by @lhoestq in #4145
Update: Wikipedia by @albertvillanova in #3821 and #3989
Update: conll2012_ontonotesv5 - Support streaming by @albertvillanova in #4002
Update: daily_dialog - Support streaming by @albertvillanova in #4008
Update: id_clickbait - Support streaming by @albertvillanova in #4014
Update: blimp - Support streaming by @albertvillanova in #4016
Update: scan - Support streaming by @albertvillanova in #4017
Update: yelp_review_full - Replace data url by @lhoestq in #4018
Update: yelp_polarity - Support streaming by @lhoestq in #4019
Update: amazon_polarity - Replace data URL by @lhoestq in #4020
Update: dbpedia_14 - Replace data url by @lhoestq in #4022
Update: xtreme - Support streaming dataset for bucc18 config by @albertvillanova in #4026
Update: yahoo_answers_topics - Replace data url by @lhoestq in #4023* Update: ASSIN 2 dataset - replace broken Google Drive URLS by links on github by @ruanchaves in #4004
Update: xcopa - Support streaming by @albertvillanova in #4039
Update: medical_dialog - Add configs with processed data by @albertvillanova in #4127
Update: xtreme - Support streaming for udpos config by @albertvillanova in #4131
Update: xtreme - Support streaming for PAWS-X config by @albertvillanova in #4132
Update: xtreme - Support streaming for PAN-X config by @albertvillanova in #4135
Update: SQuAD v2 - Use a constant for the articles regex by @bryant1410 in #4030
Update: HANS - Support streaming by @mariosasko in #4155
Fix: cats_vs_dogs - fix checksum error dataset by @albertvillanova in #4033
Fix: xcopa - fix null checksum by @albertvillanova in #4034
Fix: amazon_us_reviews - fix metadata - 4/4/2022 by @trentonstrong in #4092

Dataset Cards

Updated annotations for nli_tr dataset by @e-budur in #4058
Add missing label for emotion description by @lijiazheng99 in #4151
Remove unncessary 'pylint disable' message in ReadMe by @Datta0 in #3955
Improve RedCaps dataset card by @mariosasko in #4100
Fix duplicate key in multi_news by @lhoestq in #4164

Datasets Tags and Search on the Hugging Face Hub

Tasks alignment with models by @lhoestq in #4066
Update datasets task tags to align tags with models by @lhoestq in #4067

Metrics Changes

Xtreme-S Metrics by @patrickvonplaten in #3799
Fix xtreme s metrics by @patrickvonplaten in #3957
Avoid info log messages from transformers in FrugalScore metric by @albertvillanova in #3938
Add exact match metric by @emibaylor in #3899
Fix comet metric by @lhoestq in #3945
Add zero_division argument to precision and recall metrics by @albertvillanova in #4035
Support float data types in pearsonr/spearmanr metrics by @albertvillanova in #4054
Remove GLEU metric by @emibaylor in #3949

Metric Cards

Perplexity Metric Card by @emibaylor in #3905
Create README.md by @sashavor in #3917
Create README.md for CER metric by @sashavor in #3911
Create README.md by @sashavor in #3944
Update README.md by @sashavor in #3933
Create SARI metric card by @sashavor in #3932
Create MAUVE metric card by @sashavor in #3934
Create CoVAL metric card by @sashavor in #3940
Google BLEU Metric Card by @emibaylor in #3948
Create metric card for BERTScore by @sashavor in #3966
Rename wer to cer by @pmgautam in #4012
Create metric card for XNLI by @sashavor in #4046
Create metric card for the Code Eval metric by @sashavor in #4049
Add TER metric card by @emibaylor in #3981
BLEU metric card by @emibaylor in #3947
Create metric card for CUAD by @sashavor in #4043
Create metric card for METEOR by @sashavor in #4065
Create a metric card for Competition MATH by @sashavor in #4073
Create metric card for seqeval by @sashavor in #4070
Create README.md by @sashavor in #3930
Create metric card for Frugal Score by @sashavor in #4089
Updating FrugalScore metric card by @sashavor in #4097
Proposing WikiSplit metric card by @sashavor in #4098
Fix formatting in BLEU metric card by @mariosasko in #4157

Documentation

Doc maintenance by @stevhliu in #3926
[Doc] Don't use v for version tags on GitHub by @sgugger in #3943
Use templates for doc-builidng jobs by @sgugger in #3914
Add align_labels_with_mapping docs by @stevhliu in #3931
Add tip on how to speed up loading with ImageFolder by @mariosasko in #3980
Fix main_classes docs index by @lhoestq in #3925
More consistent references in docs by @mariosasko in #3988
Docs maintenance by @stevhliu in #3999
Add ROUGE Metric Card by @emibaylor in #4076
Add chrF(++) Metric Card by @emibaylor in #4082
Add SacreBLEU Metric Card by @emibaylor in #4083

General improvements and bug fixes

Fix flatten of complex feature types by @mariosasko in #3723
Fix flatten of Sequence feature type by @lhoestq in #3962
Exclude Google Drive tests of the CI by @lhoestq in #3982
Close PIL.Image file handler in Image.decode_example by @mariosasko in #3995
Fix Faiss custom_index device by @albertvillanova in #3987
Fix None issue with Sequence of dict by @lhoestq in #4010
Update main readme by @lhoestq in #3927
Fix map remove_columns on empty dataset by @lhoestq in #4021
Fix Audio.encode_example() when writing an array by @polinaeterna in #3998
Use audio feature in ASR task template by @lhoestq in #4006
Improve out of bounds error message by @lhoestq in #4068
Increase max retries for GitHub metrics by @albertvillanova in #4063
Fix CLI dummy data generation by @albertvillanova in #4045
Fix docs on audio feature installation by @albertvillanova in #4028
Add installation instructions to image_process doc by @mariosasko in #4072
Fix GithubMetricModuleFactory instantiation with None download_config by @albertvillanova in #4078
Increase max retries for GitHub datasets by @albertvillanova in #4079
Close parquet writer properly in push_to_hub by @lhoestq in #4081
fix typo in rename_column error message by @hunterlang in #4095
Fix BeamWriter output Parquet file by @albertvillanova in #4087
Remove unused legacy Beam utils by @albertvillanova in #4088
Hotfix failing CI tests on Windows by @albertvillanova in #4119
Update security policy by @albertvillanova in #4111
Avoid writing empty license files by @albertvillanova in #4090
Support huggingface_hub 0.5 by @lhoestq in #4106
Pretty print dataset info files by @mariosasko in #4116
Add single dataset citations for TweetEval by @gchhablani in #4137
Adjust path to datasets tutorial in How-To by @NimaBoscarino in #4147
Applied index-filters on scores in search.py. by @vishalsrao in #3971
More robust cast_to_python_objects in TypedSequence by @mariosasko in #4128
Sync Features dictionaries by @mariosasko in #3997
Avoid rate limit in update hub repositories by @lhoestq in #4167

New Contributors

@bp-high made their first contribution in #3972
@ruanchaves made their first contribution in #4004
@pmgautam made their first contribution in #4012
@hunterlang made their first contribution in #4095
@trentonstrong made their first contribution in #4092
@NimaBoscarino made their first contribution in #4147
@jon-tow made their first contribution in #4103
@lijiazheng99 made their first contribution in #4151
@Datta0 made their first contribution in #3955
@vishalsrao made their first contribution in #3971

Full Changelog: 2.0.0...2.1.0

@nateraw

🤗 Datasets 2.0.0

We're happy to announce that our new documentation is available at hf.co/docs/datasets !

Dataset Features

Load a folder of images using the imagefolder dataset loader:
- Add imagefolder dataset by @nateraw in #2830
- Faster ImageFolder + add option to drop labels by @mariosasko in #3887
Push your image and audio datasets on the Hugging Face Hub with push_to_hub:
- Add support for Audio and Image feature in push_to_hub by @mariosasko in #3685
New processing methods for streaming datasets:
- Add IterableDataset.filter by @lhoestq in #3826
- Manipulate columns on IterableDataset (rename columns, cast, etc.) by @lhoestq in #3862
- Add the new methods to IterableDatasetDict by @lhoestq in #3923
And more:
- Add more compression types for to_json by @bhavitvyamalik in #3551
- Multi-GPU support for FaissIndex by @rentruewang in #3721

Breaking changes

API changes for map and shuffle for datasets loaded in streaming mode:
- Align map when streaming: update instead of overwrite + add missing parameters by @lhoestq in #3801
- Align IterableDataset.shuffle with Dataset.shuffle by @lhoestq in #3842
Rename GenerateMode to DownloadMode by @albertvillanova in #3759
Remove deprecated methods/params (preparation for v2.0) by @mariosasko in #3803
Remove deprecated remove_columns param in filter by @mariosasko in #3827
Module namespace cleanup for v2.0 by @mariosasko in #3875

Dataset Changes

New: CFPB Consumer Complaints by @kayvane1 in #3617
New: told-br (brazilian hate speech) by @JAugusto97 in #3683
New: electricity load diagram by @kashif in #3722
New: MIT Scene Parsing Benchmark by @mariosasko in #3607
New: ElkarHizketak v1.0 by @antxa in #3780
New: wikitablequestions by @SivilTaram in #3870
New: ontonotes_conll by @richarddwang in #3853
Update: BnL Historical Newspapers - make the dataset streamable by @albertvillanova in #3616
Update: Common voice - add validated partition by @shalymin-amzn in #3669
Update: Common Voice - add local paths to audio files by @lhoestq in #3736
Update: Common Voice - simplify code by @lhoestq in #3817
Update: Natural Questions - add dev-only configuration by @albertvillanova in #3699
Update: pubmed - update data url by @albertvillanova in #3692
Update: pubmed - make the dataset streamable by @abhi-mosaic in #3740
Update: RedCaps - make the dataset streamable by @mariosasko in #3737
Update: cats_vs_dogs - update metadata by @albertvillanova in #3752
Update: newsroom - update manual download url by @albertvillanova in #3779
Update: xcopa - update to new version by @albertvillanova in #3810
Update: cats_vs_dogs size by @mariosasko in #3878
Fix: sem_eval_2018_task_1 - fix download location by @maxpel in #3643
Fix: newsqa - fix unique keys by @albertvillanova in #3696
Fix: The Pile datasets - fix host urls by @albertvillanova in #3627
Fix: Evidence Infer Treatment - fix dataset script by @albertvillanova in #3718
Fix: NewsQA - fix dataset script by @albertvillanova in #3734
Fix: head_qa - fix data url by @albertvillanova in #3766
Fix: msr_sqa - fix unique keys by @albertvillanova in #3771
Fix: reddit_tifu - fix data url by @albertvillanova in #3774
Fix: wiki_lingua - fix spanish data file url by @albertvillanova in #3806
Fix: beans - fix data urls by @mariosasko in #3890
Fix: CRD3 - fix NonMatchingChecksumError by @albertvillanova in #3921
Fix: MultiWOZ 2.2 - fix NonMatchingChecksumError by @albertvillanova in #3922

Dataset cards

Add code example in wikipedia card by @lhoestq in #3678
Fix Multi-News dataset metadata and card by @albertvillanova in #3731
Reddit dataset card additions by @anna-kay in #3781
Update gigaword card and info by @mariosasko in #3775
Reddit dataset card contribution by @anna-kay in #3797

Metric Changes

New: FrugalScore by @moussaKam in #3674
New: Mahalanobis distance by @JoaoLages in #3794
New: mIoU by @NielsRogge in #3745
New: MSE and MAE - V2 by @dnaveenr in #3874
Fix: METEOR - fix bug due to nltk version by @albertvillanova in #3884

Metric cards

Add perplexity to metrics by @emibaylor in #3757
Create SQuAD metric README.md by @sashavor in #3873
SQuAD v2 metric: create README.md by @sashavor in #3879
Update README.md for SQuAD v2 metric by @sashavor in #3908
Update README.md for SQuAD metric by @sashavor in #3907
Create README.md for WER metric by @sashavor in #3898
Create README.md for GLUE by @sashavor in #3916

New documentation

Update docs to new frontend/UI by @mishig25 in #3690
Image process doc by @stevhliu in #3882

General improvements and bug fixes

Better TQDM output by @mariosasko in #3654
Prioritize module.builder_kwargs over defaults in TestCommand by @lvwerra in #3672
Extend support for streaming datasets that use os.path.relpath by @albertvillanova in #3623
Add Fon language tag by @albertvillanova in #3620
Remove unnecessary 'r' arg in by @bryant1410 in #3661
Fix TestCommand to copy dataset_infos to local dir with only data files by @albertvillanova in #3680
Upgrade black to version ~=22.0 by @LysandreJik in #3691
Fix streaming for servers not supporting HTTP range requests by @albertvillanova in #3689
Pin ElasticSearch by @lhoestq in #3701
Raise informative error when loading a save_to_disk dataset by @albertvillanova in #3705
Fix ClassLabel to/from dict when passed names_file by @albertvillanova in #3695
Fix CI code quality issue by @albertvillanova in #3710
Check if indices values in Dataset.select are within bounds by @mariosasko in #3719
Pin pandas to avoid bug in streaming mode by @albertvillanova in #3725
Use config pandas version in CSV dataset builder by @albertvillanova in #3726
Set base path to hub url for canonical datasets by @lhoestq in #3709
Fix ValueError message formatting in int2str by @akulchik in #3742
Patch all module attributes in its namespace by @albertvillanova in #3727
Fix typo in train split name by @albertvillanova in #3751
feat: 🎸 generate info if dataset_infos.json does not exist by @severo in #3670
Support streaming in size estimation function in push_to_hub by @mariosasko in #3732
Expose method and fix param by @severo in #3767
Fix HfFileSystem docstring by @lhoestq in #3768
process .opus files (for Multilingual Spoken Words) by @polinaeterna in #3666
Fix: dataset name is stored in keys by @thomasw21 in #3772
Use the same seed to shuffle shards and metadata in streaming mode by @lhoestq in #3746
Start removing canonical datasets logic by @lhoestq in #3777
Support passing str to iter_files by @albertvillanova in #3783
Fix Google Drive URL to avoid Virus scan warning by @albertvillanova in #3787
Skip checksum computation if ignore_verifications is True by @mariosasko in #3796
Fix error message in CSV loader for newer Pandas versions by @mariosasko in #3798
Add data_dir to data_files resolution and misc improvements to HfFileSystem by @mariosasko in #3791
Error of writing with different schema, due to nonpreservation of nullability by @richarddwang in #3782
Handle Nones in PyArrow struct by @mariosasko in #3814
Fix iter_archive getting reset by @lhoestq in #3815
Added computer vision tasks by @merveenoyan in #3800
Fix typo in doc build yml by @mishig25 in #3819
Allow not specifying feature cols other than predictions/references in Metric.compute by @mariosasko in #3824
Logo float left by @mishig25 in #3836
Pin responses to fix CI for Windows by @albertvillanova in #3840
Fix dead dataset scripts creation link. by @dnaveenr in #3834
Remove decode: true for image feature in head_qa by @craffel in #3805
Update faiss device docstring by @lhoestq in #3846
Udpate index.mdx margins by @gary149 in #3858
Fix push_to_hub with null images by @lhoestq in #3856
Redundant add dataset information and dead link. by @dnaveenr in #3852
Update image dataset tags by @mariosasko in #3864
Bring back imgs so that forsk dont get broken by @mishig25 in #3866
Small typos in How-to-train tutorial. by @lkhphuc in #3833
Small doc fixes by @mishig25 in #3860
add pandas to env command by @patrickvonplaten in #3871
Ignore duplicate keys if ignore_verifications=True by @mariosasko in #3868
Update code blocks by @lhoestq in #3863
Fix download_mode in dataset_module_factory by @albertvillanova in #3876
Fix some shuffle docs by @lhoestq in #3885
Fix race condition in doc build by @lhoestq in #3891
Add default branch for doc building by @sgugger in #3893
[docs] make dummy data creation optional by @lhoestq in #3894
Fix code examples indentation by @lhoestq in #3895
Align tqdm control/cache control with Transformers by @mariosasko in #3897
Fix CLI test checksums by @albertvillanova in #3892
Fix Google Drive URL to avoid Virus scan warning in streaming mode by @mariosasko in #3843
Change the framework switches to the new syntax by @sgugger in #3880

New Contributors

@kayvane1 made their first contribution in #3617
@JAugusto97 made their first contribution in #3683
@shalymin-amzn made their first contribution in #3669
@kashif made their first contribution in #3722
@akulchik made their first contribution in #3742
@abhi-mosaic made their first contribution in #3740
@emibaylor made their first contribution in #3757
@anna-kay made their first contribution in #3781
@JoaoLages made their first contribution in #3794
@mishig25 made their first contribution in #3690
@antxa made their first contribution in #3780
@dnaveenr made their first contribution in #3834
@lkhphuc made their first contribution in #3833
@rentruewang made their first contribution in #3721
@gary149 made their first contribution in #3858
@NielsRogge made their first contribution in #3745
@sashavor made their first contribution in #3873
@SivilTaram made their first contribution in #3870
Document cases for github datasets by @lhoestq in #3924
Fix text loader to split only on universal newlines by @albertvillanova in #3910
Retry HfApi call inside push_to_hub when 504 error by @albertvillanova in #3886

Full Changelog: 1.18.3...0.0.0

@lvwerra

Bug fixes

Prioritize module.builder_kwargs over defaults in TestCommand #3672 (@lvwerra)
Fix TestCommand to copy dataset_infos to local dir with only data files #3680 (@albertvillanova)
Upgrade black to version ~=22.0 #3691 (@LysandreJik)
Fix streaming for servers not supporting HTTP range requests #3689 (@albertvillanova)
Pin ElasticSearch #3701 (@lhoestq)
Fix ClassLabel to/from dict when passed names_file #3695 (@albertvillanova)
Fix CI code quality issue #3710 (@albertvillanova)
Check if indices values in Dataset.select are within bounds #3719 (@mariosasko)
Pin pandas to avoid bug in streaming mode #3725 (@albertvillanova)
Use config pandas version in CSV dataset builder #3726 (@albertvillanova)
Fix dataset mirroring (@lhoestq)
Fix ValueError message formatting in int2str #3742 (@akulchik)
Patch all module attributes in its namespace #3727 (@albertvillanova)
Fix HfFileSystem docstring #3768 (@lhoestq)
Fix: dataset name is stored in keys #3772 (@thomasw21)
Fix Google Drive URL to avoid Virus scan warning #3787 (@albertvillanova)
Fix error message in CSV loader for newer Pandas versions #3798 (@mariosasko)
Pin responses to fix CI for Windows #3840 (@albertvillanova)

Full Changelog: 1.18.3...1.18.4

Dataset Features

Dataset changes

Dataset Cards

Documentation

General improvements and bug fixes

New Contributors

Contributors

Assets

Bug fixes

New Contributors

Contributors

Assets

Bug fixes

Contributors

Assets

Datasets Changes

Dataset Features

Dataset Cards

Documentation

Other improvements and bug fixes

New Contributors

Contributors

Assets

Datasets fixes

Bug fixes

Dataset Cards

Docs

Other improvements and bug fixes

New Contributors

Contributors

Assets

Datasets bug fixes

General improvements and bug fixes

New Contributors

Contributors

Assets

Dataset Changes

Dataset Features

Dataset Cards

Metrics Changes

Metric Cards

Documentation

General improvements and bug fixes

New Contributors

Contributors

Assets

Datasets Changes

Dataset Cards

Datasets Tags and Search on the Hugging Face Hub

Metrics Changes

Metric Cards

Documentation

General improvements and bug fixes

New Contributors

Contributors

Assets

🤗 Datasets 2.0.0

Dataset Features

Breaking changes

Dataset Changes

Dataset cards

Metric Changes

Metric cards

New documentation

General improvements and bug fixes

New Contributors

Contributors

Assets

Bug fixes

Contributors

Assets