Include CollectiveKernel in inductor debug visualization #146561

wconstab · 2025-02-06T01:41:46Z

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-02-06T01:41:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146561

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 085fdac with merge base 98bd2bd ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

wconstab · 2025-02-13T21:45:49Z

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797) [ghstack-poisoned]

wconstab · 2025-03-06T04:46:40Z

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ghstack-source-id: 756320a54467e4e55c3e3a49cb35b661426f076d Pull Request resolved: pytorch/pytorch#146561

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797) [ghstack-poisoned]

eellison

Possible to add test ? People often dont test debug utilities, and then they end up breaking as a result.

pytorchmergebot · 2025-04-24T20:36:25Z

Starting merge as part of PR stack under #152060

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

wconstab · 2025-04-29T00:46:05Z

@pytorchbot merge

pytorchmergebot · 2025-04-29T00:47:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #146561 Approved by: https://github.com/eellison ghstack dependencies: #152060

This is a new pass to replace the pre-existing passes. It has the same basic goal, to achieve communication overlap (latency hiding), but also constrains the solution to not increase peak memory. The principles of operation are detailed in code comments, but summarized here: - never reorder collectives relative to each other (TBD if we should relax this later) - before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies - estimate peak memory and current memory at each scheduler node - move collective nodes forward one position at a time, if the move does not increaes curr memory beyond peak memory The pass logs a summary table for each graph to TORCH_LOGS=overlap. e.g. (exact format may have been tweaked but this shows the idea). ``` rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node initial exposed final exposed improvement limiting factor moves [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ----------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------- --------------- ------------- ------------------- ------- [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns) 12141.6 6514.53 5627.08 prefetch limit 75 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns) 32265.8 28429.2 3836.61 data dependency 78 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9') (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns) 10800.6 10732.3 68.254 peak memory 1 [rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14') (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns) 10809.5 10809.5 0 data dependency 4 [rank ``` Pull Request resolved: #146562 Approved by: https://github.com/eellison ghstack dependencies: #152060, #146561

Improve comms debug visualization

522238f

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Feb 6, 2025

wconstab added 3 commits February 10, 2025 15:18

Update on "Improve comms debug visualization"

f7b1d93

[ghstack-poisoned]

Update on "Improve comms debug visualization"

fc85a7f

[ghstack-poisoned]

Update on "Improve comms debug visualization"

61cc598

[ghstack-poisoned]

wconstab added the topic: not user facing topic category label Feb 13, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 13, 2025

Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request Apr 22, 2025

Improve comms debug visualization

23d02bf

ghstack-source-id: 756320a54467e4e55c3e3a49cb35b661426f076d Pull Request resolved: pytorch/pytorch#146561

wconstab changed the title ~~Improve comms debug visualization~~ Include CollectiveKernel in inductor debug visualization Apr 23, 2025

wconstab requested a review from eellison April 23, 2025 21:23

This was referenced Apr 23, 2025

Add 'step' counter to visualize_overlap log #152060

Closed

Add graph inputs/outputs to comm overlap pass signature #152061

Closed

eellison approved these changes Apr 24, 2025

View reviewed changes

Update on "Include CollectiveKernel in inductor debug visualization"

085fdac

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

pytorchmergebot added the merging label Apr 29, 2025

pytorchmergebot closed this in 4b61564 Apr 29, 2025

pytorchmergebot added Merged and removed merging labels Apr 29, 2025

pytorchmergebot pushed a commit that referenced this pull request Apr 29, 2025

Include CollectiveKernel in inductor debug visualization (#146561)

17d5b48

Pull Request resolved: #146561 Approved by: https://github.com/eellison ghstack dependencies: #152060

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include CollectiveKernel in inductor debug visualization #146561

Include CollectiveKernel in inductor debug visualization #146561

wconstab commented Feb 6, 2025 •

edited

Loading

pytorch-bot bot commented Feb 6, 2025 •

edited

Loading

wconstab commented Feb 13, 2025

wconstab commented Mar 6, 2025

eellison left a comment

pytorchmergebot commented Apr 24, 2025

wconstab commented Apr 29, 2025

pytorchmergebot commented Apr 29, 2025

Include CollectiveKernel in inductor debug visualization #146561

Include CollectiveKernel in inductor debug visualization #146561

Conversation

wconstab commented Feb 6, 2025 • edited Loading

pytorch-bot bot commented Feb 6, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146561

✅ No Failures

wconstab commented Feb 13, 2025

wconstab commented Mar 6, 2025

eellison left a comment

Choose a reason for hiding this comment

pytorchmergebot commented Apr 24, 2025

wconstab commented Apr 29, 2025

pytorchmergebot commented Apr 29, 2025

Merge started

wconstab commented Feb 6, 2025 •

edited

Loading

pytorch-bot bot commented Feb 6, 2025 •

edited

Loading