[SPARK-38591][SQL] Add flatMapSortedGroups to KeyValueGroupedDataset #35899

EnricoMi · 2022-03-17T15:53:21Z

What changes were proposed in this pull request?

This adds a sorted version of ds.groupByKey(…).flatMapGroups(…).

Why are the changes needed?

The existing method flatMapGroups provides an iterator of rows for each group key. If user code would requires those rows in a particular order, that iterator would have to be sorted first, which is against the idea of an iterator in the first place. For groups that do not fit into memory of one executor, this approach does not work.

org.apache.spark.sql.KeyValueGroupedDataset:

Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList) unless they are sure that this is possible given the memory
constraints of their cluster.

The implementation of KeyValueGroupedDataset.flatMapGroups already sorts each partition according to the group key. By additionally sorting by some data columns, the iterator can be guaranteed to provide some order.

Does this PR introduce any user-facing change?

This adds KeyValueGroupedDataset.flatMapSortedGroups.

How was this patch tested?

There is test DatasetSuite."groupBy function, flatMapSorted by func" and DatasetSuite."groupBy function, flatMapSorted by expr".

AmplabJenkins · 2022-03-19T14:44:46Z

Can one of the admins verify this patch?

EnricoMi · 2022-04-11T10:03:23Z

@HyukjinKwon @cloud-fan @rxin @WeichenXu123

HyukjinKwon · 2022-04-29T03:56:38Z

sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala

+   *
+   * @since 3.4.0
+   */
+  def flatMapSortedGroups[S: Encoder, U : Encoder]


I think it's too much to add an API to only allow sorting data part. Especially, we can already do this by sorting the iterator right? the only problem this API solves is that when each group is too big to fix in the memory at the executor.

Another problem is that, what if we want to sort in the reversed order or only a couple of columns?

With V => S you can pick any columns of V you like:

case class Value(id, seq, timestamp, value) val ds: Dataset[Value] ds.groupBy(v => v.id).flatMapSortedGroups(v => (v.seq, v.timestamp)) { (_, iter) => iter }

The sort order can be added to the function so it can easily be given by the user:
(s: V => S, direction: SortDirection = Ascending)

The Column-variant of flatMapSortedGroups below also allows for any number of columns and sort direction.

github-actions · 2022-08-12T00:19:33Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

EnricoMi changed the title ~~Add flatMapSortedGroups to KeyValueGroupedDataset~~ [SPARK-38591][SQL] Add flatMapSortedGroups to KeyValueGroupedDataset Mar 17, 2022

github-actions bot added SQL STRUCTURED STREAMING labels Mar 17, 2022

HyukjinKwon reviewed Apr 29, 2022

View reviewed changes

EnricoMi added 6 commits May 3, 2022 13:38

Add flatMapSortedGroups to KeyValueGroupedDataset

8931693

Add flatMapSortedGroups with sort expressions

c4c3846

Fix tests

5ccb1f8

Add docstring to column-based flatMapSortedGroups

715aa99

Bump since to 3.4.0

9feeada

Add sort direction to flatMapSortedGroups with func

3c8165d

EnricoMi force-pushed the branch-sorted-groups branch from 65392ee to 3c8165d Compare May 3, 2022 12:39

github-actions bot added the Stale label Aug 12, 2022

github-actions bot closed this Aug 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38591][SQL] Add flatMapSortedGroups to KeyValueGroupedDataset #35899

[SPARK-38591][SQL] Add flatMapSortedGroups to KeyValueGroupedDataset #35899

EnricoMi commented Mar 17, 2022

AmplabJenkins commented Mar 19, 2022

EnricoMi commented Apr 11, 2022

HyukjinKwon Apr 29, 2022

EnricoMi May 2, 2022

github-actions bot commented Aug 12, 2022

[SPARK-38591][SQL] Add flatMapSortedGroups to KeyValueGroupedDataset #35899

[SPARK-38591][SQL] Add flatMapSortedGroups to KeyValueGroupedDataset #35899

Conversation

EnricoMi commented Mar 17, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Mar 19, 2022

EnricoMi commented Apr 11, 2022

HyukjinKwon Apr 29, 2022

Choose a reason for hiding this comment

EnricoMi May 2, 2022

Choose a reason for hiding this comment

github-actions bot commented Aug 12, 2022