Skip to content

Include CollectiveKernel in inductor debug visualization #146561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

Copy link

pytorch-bot bot commented Feb 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146561

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 085fdac with merge base 98bd2bd (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@wconstab wconstab added the topic: not user facing topic category label Feb 13, 2025
@wconstab
Copy link
Contributor Author

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 13, 2025
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797)

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

wconstab commented Mar 6, 2025

@wconstab has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Divigroup-RAP pushed a commit to Divigroup-RAP/PYTORCH that referenced this pull request Apr 22, 2025
ghstack-source-id: 756320a54467e4e55c3e3a49cb35b661426f076d
Pull Request resolved: pytorch/pytorch#146561
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797)

[ghstack-poisoned]
@wconstab wconstab changed the title Improve comms debug visualization Include CollectiveKernel in inductor debug visualization Apr 23, 2025
@wconstab wconstab requested a review from eellison April 23, 2025 21:23
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797)

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov

Differential Revision: [D69617797](https://our.internmc.facebook.com/intern/diff/D69617797)

[ghstack-poisoned]
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to add test ? People often dont test debug utilities, and then they end up breaking as a result.

@pytorchmergebot
Copy link
Collaborator

Starting merge as part of PR stack under #152060


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 amjames chauhang aakhundov 

[ghstack-poisoned]
@wconstab
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Apr 29, 2025
pytorchmergebot pushed a commit that referenced this pull request Apr 29, 2025
This is a new pass to replace the pre-existing passes.  It has the same
basic goal, to achieve communication overlap (latency hiding), but also
constrains the solution to not increase peak memory.

The principles of operation are detailed in code comments, but
summarized here:
- never reorder collectives relative to each other (TBD if we should
  relax this later)
- before performing reordering, push all comm and wait nodes as late as possible, respecting data dependencies
- estimate peak memory and current memory at each scheduler node
- move collective nodes forward one position at a time, if the move does
  not increaes curr memory beyond peak memory

The pass logs a summary table for each graph to TORCH_LOGS=overlap.

e.g. (exact format may have been tweaked but this shows the idea).

```
rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] Collective node                                                                                                                                                initial exposed    final exposed    improvement  limiting factor        moves
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] -----------------------------------------------------------------------------------------------------------------------------------------------------------  -----------------  ---------------  -------------  -------------------  -------
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op2')  (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[2256, 256], stride=[256, 1]) (buf2) (12142 ns)               12141.6          6514.53       5627.08   prefetch limit            75
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op6')  (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[282, 256], stride=[256, 1]) (buf7) (32266 ns)                 32265.8         28429.2        3836.61   data dependency           78
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op9')  (torch.ops._c10d_functional.all_gather_into_tensor.default) (size=[256], stride=[1]) (buf11) (10801 ns)                         10800.6         10732.3          68.254  peak memory                1
[rank0]:[rank0]:I0210 17:24:28.494000 2711253 torch/_inductor/comms.py:195] [0/0] [__overlap] ExternKernelSchedulerNode(name='op14')  (torch.ops._c10d_functional.reduce_scatter_tensor.default) (size=[32], stride=[1]) (buf17) (10810 ns)                          10809.5         10809.5           0      data dependency            4
[rank
```

Pull Request resolved: #146562
Approved by: https://github.com/eellison
ghstack dependencies: #152060, #146561
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants