-
Notifications
You must be signed in to change notification settings - Fork 24k
Include CollectiveKernel in inductor debug visualization #146561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146561
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 085fdac with merge base 98bd2bd ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797) [ghstack-poisoned]
@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
ghstack-source-id: 756320a54467e4e55c3e3a49cb35b661426f076d Pull Request resolved: pytorch/pytorch#146561
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797) [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797) [ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible to add test ? People often dont test debug utilities, and then they end up breaking as a result.
Starting merge as part of PR stack under #152060 |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Pull Request resolved: #146561 Approved by: https://github.com/eellison ghstack dependencies: #152060
This is a new pass to replace the pre-existing passes. It has the same basic goal, to achieve communication overlap (latency hiding), but also constrains the solution to not increase peak memory. The principles of operation are detailed in code comments, but summarized here: - never reorder collectives relative to each other (TBD if we should relax this later) - before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies - estimate peak memory and current memory at each scheduler node - move collective nodes forward one position at a time, if the move does not increaes curr memory beyond peak memory The pass logs a summary table for each graph to TORCH_LOGS=overlap. e.g. (exact format may have been tweaked but this shows the idea). ``` rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node initial exposed final exposed improvement limiting factor moves [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ----------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------- --------------- ------------- ------------------- ------- [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns) 12141.6 6514.53 5627.08 prefetch limit 75 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns) 32265.8 28429.2 3836.61 data dependency 78 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns) 10800.6 10732.3 68.254 peak memory 1 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns) 10809.5 10809.5 0 data dependency 4 [rank ``` Pull Request resolved: #146562 Approved by: https://github.com/eellison ghstack dependencies: #152060, #146561
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov