Change persistent reduction threshold to 32 #147899

PaulZhang12 · 2025-02-25T23:03:01Z

Summary:

Increasing threshold for inductor multikernel flag from 16->32 can lead to significant performance gain. This change is safe as TORCHINDUCTOR_MULTI_KERNEL is disabled by defaul

Example benchmark:

import torch
import torch.nn.functional as F
from triton.testing import do_bench
from torch._inductor import config as inductor_config
import math

def position_bias_softmax(scores, weight=None, pw_bias=False):
    scores = scores.to(torch.float32)
    context_position = torch.arange(2048, dtype=torch.long, device="cuda")[:, None]
    memory_position = torch.arange(2048, dtype=torch.long, device="cuda")[None, :]
    relative_position = memory_position - context_position  # shape (query_length, key_length)
    relative_buckets = 0
    num_buckets=32
    max_distance=128
    relative_position = -torch.min(relative_position, torch.zeros_like(relative_position))
    max_exact = num_buckets // 2
    is_small = relative_position < max_exact
    relative_position_if_large = max_exact + (
        torch.log(relative_position.float() / max_exact)
        / math.log(max_distance / max_exact)
        * (num_buckets - max_exact)
    ).to(torch.long)
    relative_position_if_large = torch.min(
        relative_position_if_large, torch.full_like(relative_position_if_large, num_buckets - 1)
    )

    relative_buckets += torch.where(is_small, relative_position, relative_position_if_large)
    values = F.embedding(relative_buckets, weight)
    values = values.permute([2, 0, 1]).unsqueeze(0) 
    scores = scores + values

    return F.softmax(scores, dim=-1).to(torch.float16)


scores = torch.randn(8, 2048, 2048, device="cuda", dtype=torch.float16)
weight = torch.randn(32, 1, device="cuda")
position_bias_softmax(scores, weight)
compiled = torch.compile(position_bias_softmax)

compiled(scores, weight=weight)
gb = 2 * scores.element_size() * scores.numel() / 1e9
sec = do_bench(lambda: compiled(scores, weight=weight)) / 1e3
print(f"weighted bias gb/s: {gb/sec}")

With this change: gb/s: 987.0799446648006
Baseline: gb/s: 693.3391918370983

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @shunting314 @eellison

pytorch-bot · 2025-02-25T23:03:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147899

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Long queue for macOS runners

❌ 1 New Failure

As of commit 93e71d1 with merge base 1e894d2 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.4-py3.10-gcc9-sm89 / test (default, 3, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh)
'test/inductor/test_cooperative_reductions.py::MultiKernelCooperativeReductionTests::test_non_power_of_2_bs_1_count_1048575'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-02-25T23:03:06Z

The committers listed above are authorized under a signed CLA.

✅ login: PaulZhang12 / name: Paul Zhang (93e71d1)

shunting314 · 2025-02-25T23:43:48Z

More context from this meta internal post: https://fb.workplace.com/groups/1075192433118967/posts/1612836222687916/

eellison

This works, but ideally would work out of the box, following similar analysis we did of #141916 where persistent reductions result in less memory.

We should be able to use the memory analysis that @jansel did #142026.

cc @FindHao

jansel

I agree we should use SIMDKernelFeatures to write a better heuristic here. Though this change seems fine in the shorter term.

PaulZhang12 · 2025-02-26T18:57:12Z

/easycla

PaulZhang12 · 2025-02-26T19:01:28Z

/easycla

github-actions · 2025-04-28T23:34:45Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Change persistent reduction threshold to 32

93e71d1

pytorch-bot bot added ciflow/inductor module: inductor labels Feb 25, 2025

FindHao added the topic: not user facing topic category label Feb 25, 2025

shunting314 self-requested a review February 25, 2025 23:43

shunting314 approved these changes Feb 25, 2025

View reviewed changes

eellison reviewed Feb 26, 2025

View reviewed changes

jansel approved these changes Feb 26, 2025

View reviewed changes

github-actions bot added the Stale label Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change persistent reduction threshold to 32 #147899

Change persistent reduction threshold to 32 #147899

PaulZhang12 commented Feb 25, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 25, 2025 •

edited

Loading

linux-foundation-easycla bot commented Feb 25, 2025 •

edited

Loading

shunting314 commented Feb 25, 2025

eellison left a comment •

edited

Loading

jansel left a comment

PaulZhang12 commented Feb 26, 2025

PaulZhang12 commented Feb 26, 2025

github-actions bot commented Apr 28, 2025

Change persistent reduction threshold to 32 #147899

Are you sure you want to change the base?

Change persistent reduction threshold to 32 #147899

Conversation

PaulZhang12 commented Feb 25, 2025 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Feb 25, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147899

❗ 1 Active SEVs

❌ 1 New Failure

linux-foundation-easycla bot commented Feb 25, 2025 • edited Loading

shunting314 commented Feb 25, 2025

eellison left a comment • edited Loading

Choose a reason for hiding this comment

jansel left a comment

Choose a reason for hiding this comment

PaulZhang12 commented Feb 26, 2025

PaulZhang12 commented Feb 26, 2025

github-actions bot commented Apr 28, 2025

PaulZhang12 commented Feb 25, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 25, 2025 •

edited

Loading

linux-foundation-easycla bot commented Feb 25, 2025 •

edited

Loading

eellison left a comment •

edited

Loading