Fix time type `_arrow_to_datasets_dtype` conversion #4628

mariosasko · 2022-07-04T16:20:15Z

The issue stems from the fact that pa.array([time_data]).type returns DataType(time64[unit]), which doesn't expose the unit attribute, instead of Time64Type(time64[unit]). I believe this is a bug in PyArrow. Luckily, the both types have the same str(), so in this PR I call pa.type_for_alias(str(type)) to convert them both to the Time64Type(time64[unit]) format.

cc @severo

github-actions · 2022-07-04T16:17:55Z

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008753 / 0.011353 (-0.002600)	0.004159 / 0.011008 (-0.006850)	0.029933 / 0.038508 (-0.008575)	0.035409 / 0.023109 (0.012300)	0.307991 / 0.275898 (0.032093)	0.332821 / 0.323480 (0.009341)	0.006628 / 0.007986 (-0.001357)	0.004980 / 0.004328 (0.000652)	0.007406 / 0.004250 (0.003156)	0.038529 / 0.037052 (0.001477)	0.285959 / 0.258489 (0.027470)	0.343784 / 0.293841 (0.049943)	0.031949 / 0.128546 (-0.096597)	0.009837 / 0.075646 (-0.065809)	0.252360 / 0.419271 (-0.166912)	0.052521 / 0.043533 (0.008988)	0.292870 / 0.255139 (0.037731)	0.314168 / 0.283200 (0.030968)	0.092384 / 0.141683 (-0.049299)	1.817181 / 1.452155 (0.365026)	1.895129 / 1.492716 (0.402413)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.329868 / 0.018006 (0.311862)	0.578615 / 0.000490 (0.578125)	0.020492 / 0.000200 (0.020292)	0.000150 / 0.000054 (0.000095)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026803 / 0.037411 (-0.010608)	0.109268 / 0.014526 (0.094742)	0.116576 / 0.176557 (-0.059980)	0.162952 / 0.737135 (-0.574184)	0.119381 / 0.296338 (-0.176957)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.420224 / 0.215209 (0.205015)	4.183476 / 2.077655 (2.105822)	1.795067 / 1.504120 (0.290947)	1.592892 / 1.541195 (0.051697)	1.715241 / 1.468490 (0.246751)	0.438456 / 4.584777 (-4.146321)	4.822341 / 3.745712 (1.076628)	2.222622 / 5.269862 (-3.047240)	0.942113 / 4.565676 (-3.623563)	0.052940 / 0.424275 (-0.371335)	0.012035 / 0.007607 (0.004428)	0.518633 / 0.226044 (0.292589)	5.202614 / 2.268929 (2.933686)	2.214123 / 55.444624 (-53.230501)	1.887806 / 6.876477 (-4.988670)	2.059426 / 2.142072 (-0.082647)	0.561196 / 4.805227 (-4.244032)	0.122998 / 6.500664 (-6.377667)	0.062548 / 0.075469 (-0.012921)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.613139 / 1.841788 (-0.228648)	14.902228 / 8.074308 (6.827920)	26.542624 / 10.191392 (16.351232)	0.871870 / 0.680424 (0.191447)	0.528610 / 0.534201 (-0.005591)	0.494000 / 0.579283 (-0.085283)	0.515395 / 0.434364 (0.081031)	0.329894 / 0.540337 (-0.210443)	0.348506 / 1.386936 (-1.038430)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008817 / 0.011353 (-0.002535)	0.004320 / 0.011008 (-0.006688)	0.029670 / 0.038508 (-0.008838)	0.035356 / 0.023109 (0.012247)	0.314365 / 0.275898 (0.038467)	0.329691 / 0.323480 (0.006211)	0.006897 / 0.007986 (-0.001088)	0.003889 / 0.004328 (-0.000439)	0.007736 / 0.004250 (0.003486)	0.040925 / 0.037052 (0.003873)	0.293496 / 0.258489 (0.035007)	0.336601 / 0.293841 (0.042760)	0.031702 / 0.128546 (-0.096844)	0.009894 / 0.075646 (-0.065753)	0.252631 / 0.419271 (-0.166641)	0.052360 / 0.043533 (0.008827)	0.298285 / 0.255139 (0.043146)	0.324213 / 0.283200 (0.041014)	0.097796 / 0.141683 (-0.043887)	1.853045 / 1.452155 (0.400890)	1.894968 / 1.492716 (0.402252)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.347924 / 0.018006 (0.329917)	0.569546 / 0.000490 (0.569056)	0.034078 / 0.000200 (0.033879)	0.000433 / 0.000054 (0.000378)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028256 / 0.037411 (-0.009155)	0.107045 / 0.014526 (0.092519)	0.114863 / 0.176557 (-0.061694)	0.156924 / 0.737135 (-0.580211)	0.116138 / 0.296338 (-0.180201)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.422840 / 0.215209 (0.207631)	4.227137 / 2.077655 (2.149482)	1.889461 / 1.504120 (0.385341)	1.702655 / 1.541195 (0.161460)	1.853991 / 1.468490 (0.385500)	0.437951 / 4.584777 (-4.146826)	4.613988 / 3.745712 (0.868276)	3.711988 / 5.269862 (-1.557874)	0.930517 / 4.565676 (-3.635159)	0.053542 / 0.424275 (-0.370733)	0.012215 / 0.007607 (0.004608)	0.534874 / 0.226044 (0.308829)	5.385805 / 2.268929 (3.116876)	2.319419 / 55.444624 (-53.125205)	2.015965 / 6.876477 (-4.860511)	2.207421 / 2.142072 (0.065349)	0.560491 / 4.805227 (-4.244736)	0.124876 / 6.500664 (-6.375788)	0.061886 / 0.075469 (-0.013583)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.612840 / 1.841788 (-0.228948)	15.091749 / 8.074308 (7.017441)	26.719293 / 10.191392 (16.527900)	0.876356 / 0.680424 (0.195932)	0.526170 / 0.534201 (-0.008031)	0.494383 / 0.579283 (-0.084900)	0.508831 / 0.434364 (0.074467)	0.323148 / 0.540337 (-0.217189)	0.333935 / 1.386936 (-1.053001)

HuggingFaceDocBuilderDev · 2022-07-04T16:26:37Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Thanks !

Fix time type _arrow_to_datasets_dtype conversion

1ac71f7

lhoestq approved these changes Jul 7, 2022

View changes

mariosasko merged commit e662d75 into main Jul 7, 2022
8 checks passed

mariosasko deleted the fix-4620 branch Jul 7, 2022

Fix time type `_arrow_to_datasets_dtype` conversion #4628

Fix time type `_arrow_to_datasets_dtype` conversion #4628

mariosasko commented Jul 4, 2022 •

edited

github-actions bot commented on `1ac71f7` Jul 4, 2022

HuggingFaceDocBuilderDev commented Jul 4, 2022 •

edited

lhoestq left a comment

Fix time type _arrow_to_datasets_dtype conversion #4628

Fix time type _arrow_to_datasets_dtype conversion #4628

Conversation

mariosasko commented Jul 4, 2022 • edited

github-actions bot commented on 1ac71f7 Jul 4, 2022

Choose a reason for hiding this comment

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Jul 4, 2022 • edited

lhoestq left a comment

Fix time type `_arrow_to_datasets_dtype` conversion #4628

Fix time type `_arrow_to_datasets_dtype` conversion #4628

mariosasko commented Jul 4, 2022 •

edited

github-actions bot commented on `1ac71f7` Jul 4, 2022

HuggingFaceDocBuilderDev commented Jul 4, 2022 •

edited