[TensorExpr] Cache use of fallback in kernel invocation #47812

eellison · 2020-11-12T00:38:14Z

Stack from ghstack:

[NNC] Preserve strided output #48264 [NNC] Preserve strided output
Remove inferred from tensor type ctors #48263 Remove inferred from tensor type ctors
[TensorExpr Fuser] Add support for nodes which have tensor constant inputs #47814 [TensorExpr Fuser] Add support for nodes which have tensor constant inputs
[NNC] Compute Tensor Output Properties in ininitialization #47813 [NNC] Compute Tensor Output Properties in ininitialization
[TensorExpr] Cache use of fallback in kernel invocation #47812 [TensorExpr] Cache use of fallback in kernel invocation
Dont use symbolic shapes check #47810 Dont use symbolic shapes check

Summary:

Previously we were checking the environment every kernel invocation for tensorExprFuserEnabled, which checks the environment for PYTORCH_TENSOREXPR. This is only a dev-exposed API, so I think it is fine to only check once when the kernel is initialized. The disable_optimization flag which is user-exposed more or less covers the same functionality.

For fun, some benchmarking. I compared scripted before and after of

def foo(x, y):
    return x + y

for x, y = torch.tensor([1]). I also removed the prim::TypeCheck node to better
isolate the kernel (I cheated). Here is gist: https://gist.github.com/eellison/39f3bc368f5bd1f25ded4827feecd15e

Without Changes Run 1:
no fusion: sum 6.416894399004377 min: 0.6101883250012179 median 0.6412974080012646
with fusion: sum 6.437897570998757 min: 0.6350401220006461 median 0.6446951820034883

Without Changes Run2:
no fusion: sum 6.601341788002173 min: 0.6292048720024468 median 0.6642187059987918
with fusion: sum 6.734651455997664 min: 0.6365462899993872 median 0.6755226659988693

With Changes Run1:
no fusion: sum 6.097717430002376 min: 0.5977709550024883 median 0.613631643998815
with fusion: sum 6.1299369639964425 min: 0.5857932209983119 median 0.6159247440009494

With Changes Run2:
no fusion: sum 6.5672018059995025 min: 0.6245676209982776 median 0.6386050750006689
with fusion: sum 6.489086147994385 min: 0.6236886289989343 median 0.6535737619997235

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: D25286210

Summary: I compared scripted before of def foo(x, y): return x + y for x, y = torch.tensor([1]). I also removed the prim::TypeCheck node to better isolate the kernel (I cheated). Here is gist: https://gist.github.com/eellison/39f3bc368f5bd1f25ded4827feecd15e Without Changes Run 1: no fusion: sum 6.416894399004377 min: 0.6101883250012179 median 0.6412974080012646 with fusion: sum 6.437897570998757 min: 0.6350401220006461 median 0.6446951820034883 Without Changes Run2: no fusion: sum 6.601341788002173 min: 0.6292048720024468 median 0.6642187059987918 with fusion: sum 6.734651455997664 min: 0.6365462899993872 median 0.6755226659988693 With Changes Run1: no fusion: sum 6.097717430002376 min: 0.5977709550024883 median 0.613631643998815 with fusion: sum 6.1299369639964425 min: 0.5857932209983119 median 0.6159247440009494 With Changes Run2: no fusion: sum 6.5672018059995025 min: 0.6245676209982776 median 0.6386050750006689 with fusion: sum 6.489086147994385 min: 0.6236886289989343 median 0.6535737619997235 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

dr-ci · 2020-11-12T00:39:37Z

💊 CI failures summary and remediations

As of commit f838a25 (more details on the Dr. CI page):

2/5 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)
3/5 broken upstream at merge base fc0a3a1 on Dec 09 from 1:34pm to 3:07pm PDT (3 commits; f5e9ffb - bfa95f9)

🚧 2 ongoing upstream failures:

These were probably caused by upstream breakages that are not fixed yet:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 since Dec 09
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 since Dec 09
- 🔁 rerun

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)

If your commit is older than viable/strict:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_windows_vs2019_py36_cuda10.1_test2 on Dec 09 from 1:34pm to 3:07pm PDT (3 commits; f5e9ffb - bfa95f9)
- 🔁 rerun

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-format

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm3.9-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

Summary: I compared scripted before of def foo(x, y): return x + y for x, y = torch.tensor([1]). I also removed the prim::TypeCheck node to better isolate the kernel (I cheated). Here is gist: https://gist.github.com/eellison/39f3bc368f5bd1f25ded4827feecd15e Without Changes Run 1: no fusion: sum 6.416894399004377 min: 0.6101883250012179 median 0.6412974080012646 with fusion: sum 6.437897570998757 min: 0.6350401220006461 median 0.6446951820034883 Without Changes Run2: no fusion: sum 6.601341788002173 min: 0.6292048720024468 median 0.6642187059987918 with fusion: sum 6.734651455997664 min: 0.6365462899993872 median 0.6755226659988693 With Changes Run1: no fusion: sum 6.097717430002376 min: 0.5977709550024883 median 0.613631643998815 with fusion: sum 6.1299369639964425 min: 0.5857932209983119 median 0.6159247440009494 With Changes Run2: no fusion: sum 6.5672018059995025 min: 0.6245676209982776 median 0.6386050750006689 with fusion: sum 6.489086147994385 min: 0.6236886289989343 median 0.6535737619997235 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Krovatkin

Krovatkin · 2020-11-12T06:58:34Z

torch/csrc/jit/tensorexpr/kernel.cpp

    runKernel(stack);
-  } catch (...) {
-    fallback_ = true;
+  } else if (!use_fallback_ && allow_fallback_) {


nitpick: I wonder if this structure might be a bit easier to read

void TensorExprKernel::run(Stack& stack) { if (!use_fallback_) { try { runKernel(stack); return; } catch (std::exception& e) { if (!allow_fallback_) { throw e; // re-throw } // fall-through to `fallback` } } fallback(stack); }

I was trying to optimize for the common case - i guess try catch doesn't have a pereformance penalty if exceptions aren;t thrown though (?)

Krovatkin · 2020-11-12T07:01:41Z

torch/csrc/jit/tensorexpr/kernel.cpp

    compile();
    return;
  }

+  use_fallback_ = fallbackEnforced();


nitpick: maybe renaming use_fallback_ to failed_comp_or_required_fallback_ makes it a bit more obvious in which cases we are using the fallback?

bertmaher

Is fallback actually useful anymore? Since we default it to off should we consider killing it?

protonu · 2020-11-13T07:06:33Z

Is fallback actually useful anymore? Since we default it to off should we consider killing it?

I think I use the fallback for Block Codegen. I can look into removing the dependency on fallback.

Summary: Previously we were checking the environment every kernel invocation for `tensorExprFuserEnabled`, which checks the environment for `PYTORCH_TENSOREXPR`. This is only a dev-exposed API, so I think it is fine to only check once when the kernel is initialized. The `disable_optimization` flag which is user-exposed more or less covers the same functionality. For fun, some benchmarking. I compared scripted before and after of ``` def foo(x, y): return x + y ``` for x, y = torch.tensor([1]). I also removed the prim::TypeCheck node to better isolate the kernel (I cheated). Here is gist: https://gist.github.com/eellison/39f3bc368f5bd1f25ded4827feecd15e Without Changes Run 1: no fusion: sum 6.416894399004377 min: 0.6101883250012179 median 0.6412974080012646 with fusion: sum 6.437897570998757 min: 0.6350401220006461 median 0.6446951820034883 Without Changes Run2: no fusion: sum 6.601341788002173 min: 0.6292048720024468 median 0.6642187059987918 with fusion: sum 6.734651455997664 min: 0.6365462899993872 median 0.6755226659988693 With Changes Run1: no fusion: sum 6.097717430002376 min: 0.5977709550024883 median 0.613631643998815 with fusion: sum 6.1299369639964425 min: 0.5857932209983119 median 0.6159247440009494 With Changes Run2: no fusion: sum 6.5672018059995025 min: 0.6245676209982776 median 0.6386050750006689 with fusion: sum 6.489086147994385 min: 0.6236886289989343 median 0.6535737619997235 Test Plan: Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Previously we were checking the environment every kernel invocation for `tensorExprFuserEnabled`, which checks the environment for `PYTORCH_TENSOREXPR`. This is only a dev-exposed API, so I think it is fine to only check once when the kernel is initialized. The `disable_optimization` flag which is user-exposed more or less covers the same functionality. For fun, some benchmarking. I compared scripted before and after of ``` def foo(x, y): return x + y ``` for x, y = torch.tensor([1]). I also removed the prim::TypeCheck node to better isolate the kernel (I cheated). Here is gist: https://gist.github.com/eellison/39f3bc368f5bd1f25ded4827feecd15e Without Changes Run 1: no fusion: sum 6.416894399004377 min: 0.6101883250012179 median 0.6412974080012646 with fusion: sum 6.437897570998757 min: 0.6350401220006461 median 0.6446951820034883 Without Changes Run2: no fusion: sum 6.601341788002173 min: 0.6292048720024468 median 0.6642187059987918 with fusion: sum 6.734651455997664 min: 0.6365462899993872 median 0.6755226659988693 With Changes Run1: no fusion: sum 6.097717430002376 min: 0.5977709550024883 median 0.613631643998815 with fusion: sum 6.1299369639964425 min: 0.5857932209983119 median 0.6159247440009494 With Changes Run2: no fusion: sum 6.5672018059995025 min: 0.6245676209982776 median 0.6386050750006689 with fusion: sum 6.489086147994385 min: 0.6236886289989343 median 0.6535737619997235 Test Plan: Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D25286210](https://our.internmc.facebook.com/intern/diff/D25286210) [ghstack-poisoned]

facebook-github-bot · 2020-12-10T21:19:58Z

This pull request has been merged in 0e666a9.

eellison requested a review from apaszke as a code owner November 12, 2020 00:38

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Nov 12, 2020

This was referenced Nov 12, 2020

Dont use symbolic shapes check #47810

Closed

micro benchmark #47811

Closed

[NNC] Compute Tensor Output Properties in ininitialization #47813

Closed

[TensorExpr Fuser] Add support for nodes which have tensor constant inputs #47814

Closed

eellison changed the title ~~Cache use of fallback in kernel invocation~~ [TensorExpr] Cache use of fallback in kernel invocation Nov 12, 2020

eellison requested review from ZolotukhinM, Krovatkin and bertmaher November 12, 2020 01:50

Krovatkin approved these changes Nov 12, 2020

View reviewed changes

bertmaher approved these changes Nov 13, 2020

View reviewed changes

This was referenced Nov 19, 2020

Remove inferred from tensor type ctors #48263

Closed

[NNC] Preserve strided output #48264

Closed

Elias Ellison and others added 3 commits December 1, 2020 12:01

facebook-github-bot closed this in 0e666a9 Dec 10, 2020

facebook-github-bot added the Merged label Dec 10, 2020

facebook-github-bot deleted the gh/eellison/132/head branch December 14, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TensorExpr] Cache use of fallback in kernel invocation #47812

[TensorExpr] Cache use of fallback in kernel invocation #47812

eellison commented Nov 12, 2020 •

edited

Loading

dr-ci bot commented Nov 12, 2020 •

edited

Loading

Krovatkin left a comment

Krovatkin Nov 12, 2020

eellison Nov 12, 2020

Krovatkin Nov 12, 2020

bertmaher left a comment

protonu commented Nov 13, 2020

facebook-github-bot commented Dec 10, 2020

[TensorExpr] Cache use of fallback in kernel invocation #47812

[TensorExpr] Cache use of fallback in kernel invocation #47812

Conversation

eellison commented Nov 12, 2020 • edited Loading

dr-ci bot commented Nov 12, 2020 • edited Loading

💊 CI failures summary and remediations

🚧 2 ongoing upstream failures:

🚧 1 fixed upstream failure:

Extra GitHub checks: 1 failed

ci.pytorch.org: 1 failed

Krovatkin left a comment

Choose a reason for hiding this comment

Krovatkin Nov 12, 2020

Choose a reason for hiding this comment

eellison Nov 12, 2020

Choose a reason for hiding this comment

Krovatkin Nov 12, 2020

Choose a reason for hiding this comment

bertmaher left a comment

Choose a reason for hiding this comment

protonu commented Nov 13, 2020

facebook-github-bot commented Dec 10, 2020

eellison commented Nov 12, 2020 •

edited

Loading

dr-ci bot commented Nov 12, 2020 •

edited

Loading