-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-31798][SHUFFLE][API] Shuffle Writer API changes to return custom map output metadata #28616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #123020 has finished for PR 28616 at commit
|
@@ -63,7 +64,7 @@ | |||
* The returned array should contain, for each partition from (0) to (numPartitions - 1), the | |||
* number of bytes written by the partition writer for that partition id. | |||
*/ | |||
long[] commitAllPartitions() throws IOException; | |||
MapOutputCommitMessage commitAllPartitions() throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need rewrite above comment?
private final long[] partitionLengths; | ||
private final Optional<MapOutputMetadata> mapOutputMetadata; | ||
|
||
MapOutputCommitMessage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: we can make this constructor private
if we would like to propagate the usage of the static factory method MapOutputCommitMessage.of
.
Test build #123503 has finished for PR 28616 at commit
|
retest this please |
@holdenk @squito can you take a look? Starting off the SPARK-25299 features. |
Test build #123661 has finished for PR 28616 at commit
|
Test build #124014 has finished for PR 28616 at commit
|
Retest this please. |
Test build #124308 has finished for PR 28616 at commit
|
Retest this please |
Test build #124341 has finished for PR 28616 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @mccheah .
Merged to master for Apache Spark 3.1. (December 2020).
LGTM too |
* All implementations must be serializable since this is sent from the executors to | ||
* the driver. | ||
*/ | ||
public interface MapOutputMetadata extends Serializable {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for commenting on closed PR, looking at this to review newer pro - https://github.com/apache/spark/pull/28618/files - these should probably be annotated with @SInCE
Also should these be @evolving or DeveloperApi vs Private? this by itself doesn't do any good and the intention is for people to be able to implement it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I roughly remember I asked the same thing to @squito before. The reason was that it's not stable yet (?) and presumably wants to test it internally before making an API .. I guess.
Introduces the concept of a
MapOutputMetadata
opaque object that can be returned from map output writers.Note that this PR only proposes the API changes on the shuffle writer side. Following patches will be proposed for actually accepting the metadata on the driver and persisting it in the driver's shuffle metadata storage plugin.
Why are the changes needed?
For a more complete design discussion on this subject as a whole, refer to this design document.
Does this PR introduce any user-facing change?
Enables additional APIs for the shuffle storage plugin tree. Usage will become more apparent as the API evolves.
How was this patch tested?
No tests here, since this is only an API-side change that is not consumed by core Spark itself.