Skip to content

[SPARK-31798][SHUFFLE][API] Shuffle Writer API changes to return custom map output metadata #28616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

mccheah
Copy link
Contributor

@mccheah mccheah commented May 22, 2020

Introduces the concept of a MapOutputMetadata opaque object that can be returned from map output writers.

Note that this PR only proposes the API changes on the shuffle writer side. Following patches will be proposed for actually accepting the metadata on the driver and persisting it in the driver's shuffle metadata storage plugin.

Why are the changes needed?

For a more complete design discussion on this subject as a whole, refer to this design document.

Does this PR introduce any user-facing change?

Enables additional APIs for the shuffle storage plugin tree. Usage will become more apparent as the API evolves.

How was this patch tested?

No tests here, since this is only an API-side change that is not consumed by core Spark itself.

@mccheah mccheah changed the title [SPARK-31798] Shuffle Writer API changes to return custom map output metadata [SPARK-31798][SHUFFLE][API] Shuffle Writer API changes to return custom map output metadata May 22, 2020
@SparkQA
Copy link

SparkQA commented May 22, 2020

Test build #123020 has finished for PR 28616 at commit 6ecd3ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -63,7 +64,7 @@
* The returned array should contain, for each partition from (0) to (numPartitions - 1), the
* number of bytes written by the partition writer for that partition id.
*/
long[] commitAllPartitions() throws IOException;
MapOutputCommitMessage commitAllPartitions() throws IOException;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need rewrite above comment?

private final long[] partitionLengths;
private final Optional<MapOutputMetadata> mapOutputMetadata;

MapOutputCommitMessage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can make this constructor private if we would like to propagate the usage of the static factory method MapOutputCommitMessage.of.

@SparkQA
Copy link

SparkQA commented Jun 3, 2020

Test build #123503 has finished for PR 28616 at commit 4fd056d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mccheah
Copy link
Contributor Author

mccheah commented Jun 9, 2020

retest this please

@mccheah
Copy link
Contributor Author

mccheah commented Jun 9, 2020

@holdenk @squito can you take a look? Starting off the SPARK-25299 features.

@SparkQA
Copy link

SparkQA commented Jun 9, 2020

Test build #123661 has finished for PR 28616 at commit 4fd056d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124014 has finished for PR 28616 at commit 4fd056d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 20, 2020

Test build #124308 has finished for PR 28616 at commit 4fd056d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Jun 22, 2020

Test build #124341 has finished for PR 28616 at commit 4fd056d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @mccheah .
Merged to master for Apache Spark 3.1. (December 2020).

@HyukjinKwon
Copy link
Member

LGTM too

* All implementations must be serializable since this is sent from the executors to
* the driver.
*/
public interface MapOutputMetadata extends Serializable {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for commenting on closed PR, looking at this to review newer pro - https://github.com/apache/spark/pull/28618/files - these should probably be annotated with @SInCE

Also should these be @evolving or DeveloperApi vs Private? this by itself doesn't do any good and the intention is for people to be able to implement it right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I roughly remember I asked the same thing to @squito before. The reason was that it's not stable yet (?) and presumably wants to test it internally before making an API .. I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants