`nan` gradient when `tf.where` is used #38349

0x0badc0de · 2020-04-08T09:30:26Z

Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock
example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g.,
Linux Ubuntu 16.04): Debian GNU/Linux 10 (buster)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if
the issue happens on mobile device:
TensorFlow installed from (source or
binary): binary
TensorFlow version (use command below): v2.1.0-rc2-17-ge5bf8de 2.1.0 / v1.12.1-29016-g38797a1c8b 2.2.0-dev20200407
Python version: 3.7.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: - GPU model and memory:

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior
Well-defined function with tf.where has nan gradients at points where tf.where inactive branch is undefined.

Describe the expected behavior
Inactive branch should be ignored in gradients calculations.

Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.

import tensorflow as tf

for ex in range(-3, 3):
    x = tf.convert_to_tensor(10.**ex)
    with tf.GradientTape() as g:
        g.watch(x)
        y = tf.where(x >= -1., x, tf.math.log1p(-x))
#         y = tf.where(x >= -1., x, tf.math.log(1.-x))
#         y = tf.where(x >= -1., x, 1./(1.-x))
    dy_dx = g.gradient(y, x)
    print(f'y({x})={y}, dy/dx({x})={dy_dx}')

All 3 functions above are well defined for positive values used for testing. Still they show no gradient at point 1.. while it has to be equal to 1.

y(0.0010000000474974513)=0.0010000000474974513, dy/dx(0.0010000000474974513)=1.0
y(0.009999999776482582)=0.009999999776482582, dy/dx(0.009999999776482582)=1.0
y(0.10000000149011612)=0.10000000149011612, dy/dx(0.10000000149011612)=1.0
y(1.0)=1.0, dy/dx(1.0)=nan
y(10.0)=10.0, dy/dx(10.0)=1.0
y(100.0)=100.0, dy/dx(100.0)=1.0

Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.

ravikyram · 2020-04-08T16:16:55Z

I have tried on colab with TF version 2.1.0 , 2.2.0-rc2 and was able to reproduce the issue.Please, find the gist here. Thanks!

mdanatg · 2020-04-08T21:21:52Z

This is due to a limitation limitation in how gradients are calculated. Unfortunately, it is unlikely to be fixed in the foreseable future.

You can find more detail here, along with a recipe for how to avoid it: https://stackoverflow.com/questions/33712178/tensorflow-nan-bug/42497444#42497444

In short, if the input to a tf.where contains NaNs, the gradient will always be NaN, regardless whether the input is actually used or not, and the workaround is to prevent the inputs from ever containing NaNs.

tensorflow-butler · 2020-04-08T21:21:54Z

Are you satisfied with the resolution of your issue?
Yes
No

0x0badc0de · 2020-04-08T21:27:19Z

Shouldn't this be documented with big warning in tf.where docs in this case?

mdanatg · 2020-04-08T22:25:09Z

Indeed it should.

joemaren · 2020-04-09T02:42:48Z

@mdanatg Hello, this is my first time contributing to TensofFlow lib. From the thread I gather you would require the tf.where be updated. If it is so can I work on this?

anorak-k · 2020-04-11T18:53:42Z

Hello @0x0badc0de , @mdanatg
Should the updated doc contain a something like a warning? or will a small note at the end, about the input not being Nan will do? Also should the workaround for avoiding it also be added to the doc?

mdanatg · 2020-04-11T19:16:08Z

@joemaren @anorak-k

Sorry for the delay. Feel free to send a PR - it's only a matter of adding a paragraph to the docstring.

The text should be more in the lines of a warning. Something like: Important: if any of the inputs contain NaN values, etc.. And yes, it should include the workaround as well, which is something in the lines of: instead of tf.where(x, ops_that_can_nan(z), ...), write tf.where(x, ops_that_can_nan(tf.where(x, z, safe_value)), ...).

anorak-k · 2020-04-13T10:02:31Z

@mdanatg I have added the change and raised a PR #38467

mkaze · 2020-04-18T22:25:55Z

@mdanatg Thanks for your reply. However, I would like to mention that this behavior also happens when the generated value in the inactive branch is not finite (i.e. inf or -inf). Here is a minimal reproducible example:

import tensorflow as tf

a = tf.Variable(10.)
with tf.GradientTape() as tape:
  out = tf.where(a < 15., a, tf.math.pow(10.0, tf.math.exp(a)))
  grads = tape.gradient(out, a)

print(grads)
# tf.Tensor(nan, shape=(), dtype=float32)

And also if we reverse the condition such that the branch with infinite value is selected, the gradient would be infinite (which is a bit surprising that it does not generate nan instead, like above):

with tf.GradientTape() as tape:
  out = tf.where(a > 15., a, tf.math.pow(10.0, tf.math.exp(a)))
  grads = tape.gradient(out, a)

print(grads)
# tf.Tensor(inf, shape=(), dtype=float32)

So this behavior happens for both nan and infinite values in inactive branch. I wish it wasn't like this, because it's a bit unreasonable and makes it impossible to use user-defined ops/functions which generate extremely large values for some input values; hence, that inner tf.where workaround may not be practical always (unfortunately, even gradient clipping does not help with this, because clipping a nan value produces nan in TF).

CC: @anorak-k for potential consideration in your PR after @mdanatg confirms this.

mdanatg · 2020-04-19T02:27:57Z

@mkaze that's true - nan, inf and any other special FP value will disrupt the gradient calculation.

What happens internally is that the gradients are aggregated in this fashion: 1 * <grad of branch taken> + 0 * <grad of branch not taken>. In the former case, you have 0 * inf = nan. In the latter case, you have 1 * inf = inf. I agree it's very confusing, unfortunately a naive fix would add significant overhead to gradient calculations.

Moreover, the forward calculation doesn't need to result in a nan or inf. You can also get weird results if the gradient alone is nan or inf. For example, the cube root function is defined and well-behaved everywhere, but its derivative at zero is infinite. So this will give you a nan gradient too:

a = tf.Variable(0.0)
with tf.GradientTape() as tape:
  out = tf.where(a < 1, a, tf.pow(a, 1.0/3.0))
  grads = tape.gradient(out, a)
print(grads)

I think the tf.where workaround is useful with infinite values as well, so long as the branch not taken is forced to take a gradient that can be safely multiplied by 0. For your example, it would be something like this:

dummy_safe_value = 0
safe_a = tf.where(a > 15., dummy_safe_value, a)
out = tf.where(a > 15., a, tf.math.pow(10.0, tf.math.exp(safe_a)))

I agree that it sometimes can be impractical to do, but in principle it should always be possible as long as you control the inputs to the sensitive functions - all they have to do is force finite values in all the elements that are dropped.

kari-554 · 2020-05-07T14:42:20Z

I want to fix the issue #38349

tushar-dalal · 2020-05-14T15:47:54Z

This is due to a limitation limitation in how gradients are calculated. Unfortunately, it is unlikely to be fixed in the foreseable future.

You can find more detail here, along with a recipe for how to avoid it: https://stackoverflow.com/questions/33712178/tensorflow-nan-bug/42497444#42497444

In short, if the input to a tf.where contains NaNs, the gradient will always be NaN, regardless whether the input is actually used or not, and the workaround is to prevent the inputs from ever containing NaNs.

You can simply have it raise a value error if its getting Nan inputs. Or does it not work like that?

unicorn-io · 2020-05-29T11:17:54Z

Can I work on this issue if someone isn't now?

mdanatg · 2020-05-29T14:36:50Z

@tushar-dalal The challenge is that verifying for such NaN inputs can be taking on performance. When debugging, tf.debugging.check_numerics can indeed help with that.

@unicorn-io Feel free to tackle it, but note that it's extremely challenging to solve. That said, there was a PR (#38467) to add a warning message to the docs of tf.where, it would be useful to revive it.

unicorn-io · 2020-05-29T16:19:20Z

I am motivated to do this can you give me some tips to start with I will try my best to understand and resolve this issue.

unicorn-io · 2020-06-02T01:38:13Z

I am motivated to do this can you give me some tips to start with I will try my best to understand and resolve this issue. @mdanatg

mdanatg · 2020-06-02T12:11:03Z

@unicorn-io You can start by looking at the gradient code and understanding how it works. Then you can reproduce when happens in the case of a where with bad gradients.

unicorn-io · 2020-06-02T15:26:53Z

Cool I'll get to it

madamalarevanth · 2020-06-08T09:39:52Z

Hey i would like to work on it. can also help please

AbhinavTalari · 2020-06-17T11:34:28Z

Cool I'll get to it

This bug cannot be fixed as of now it seems.

mdanatg · 2020-06-17T11:51:13Z

It's indeed very challenging to fix. However, the documentation of affected ops, like tf.where can still be updated to alert the users about it.

unicorn-io · 2020-06-18T05:00:16Z

@mdanatg isn't #38497 addressing this and is closed?

mdanatg · 2020-06-18T11:40:47Z

You mean #38467? It's closed due to staleness, and it would be useful to revive. By the looks of it it's safe to assume noone else is working on it.

EbiereVO · 2020-07-01T16:44:28Z

Seems like its a long time since the last activity. Is this issue still open to be worked on?

mdanatg · 2020-07-01T16:47:47Z

I think so. There are two parts to it: (1) updating the docs of tf.where, which is fairly straightforward, and (2) actually trying to address the issue, which is a significant undertaking because it involves a rather fundamental issue.

iamharshit13 · 2020-07-08T11:41:29Z

Is this issue still addressable ?

0x0badc0de added the type:bug label Apr 8, 2020

tensorflow-butler bot assigned ravikyram Apr 8, 2020

0x0badc0de mentioned this issue Apr 8, 2020

Add Johnson's SU distribution tensorflow/probability#800

Merged

ravikyram added comp:ops TF 2.1 TF 2.2 labels Apr 8, 2020

ravikyram assigned ymodak and unassigned ravikyram Apr 8, 2020

mdanatg closed this Apr 8, 2020

mdanatg reopened this Apr 8, 2020

mdanatg added good first issue stat:contributions welcome labels Apr 8, 2020

ymodak removed their assignment Apr 9, 2020

anorak-k mentioned this issue Apr 12, 2020

Added a warning note in tf.where documentation for 'NaN' gradient issue with workaround #38467

Closed

tensorflow / tensorflow

`nan` gradient when `tf.where` is used #38349

`nan` gradient when `tf.where` is used #38349

0x0badc0de commented Apr 8, 2020

ravikyram commented Apr 8, 2020

mdanatg commented Apr 8, 2020

tensorflow-butler bot commented Apr 8, 2020

0x0badc0de commented Apr 8, 2020

mdanatg commented Apr 8, 2020

joemaren commented Apr 9, 2020

anorak-k commented Apr 11, 2020 •

edited

mdanatg commented Apr 11, 2020 •

edited

anorak-k commented Apr 13, 2020

mkaze commented Apr 18, 2020 •

edited

mdanatg commented Apr 19, 2020

kari-554 commented May 7, 2020

tushar-dalal commented May 14, 2020

unicorn-io commented May 29, 2020

mdanatg commented May 29, 2020

unicorn-io commented May 29, 2020

unicorn-io commented Jun 2, 2020

mdanatg commented Jun 2, 2020 •

edited

unicorn-io commented Jun 2, 2020

madamalarevanth commented Jun 8, 2020

AbhinavTalari commented Jun 17, 2020 •

edited

mdanatg commented Jun 17, 2020

unicorn-io commented Jun 18, 2020

mdanatg commented Jun 18, 2020

EbiereVO commented Jul 1, 2020

mdanatg commented Jul 1, 2020

iamharshit13 commented Jul 8, 2020

tensorflow / tensorflow

Join GitHub today

`nan` gradient when `tf.where` is used #38349

`nan` gradient when `tf.where` is used #38349

Comments

0x0badc0de commented Apr 8, 2020

ravikyram commented Apr 8, 2020

mdanatg commented Apr 8, 2020

tensorflow-butler bot commented Apr 8, 2020

0x0badc0de commented Apr 8, 2020

mdanatg commented Apr 8, 2020

joemaren commented Apr 9, 2020

anorak-k commented Apr 11, 2020 • edited

mdanatg commented Apr 11, 2020 • edited

anorak-k commented Apr 13, 2020

mkaze commented Apr 18, 2020 • edited

mdanatg commented Apr 19, 2020

kari-554 commented May 7, 2020

tushar-dalal commented May 14, 2020

unicorn-io commented May 29, 2020

mdanatg commented May 29, 2020

unicorn-io commented May 29, 2020

unicorn-io commented Jun 2, 2020

mdanatg commented Jun 2, 2020 • edited

unicorn-io commented Jun 2, 2020

madamalarevanth commented Jun 8, 2020

AbhinavTalari commented Jun 17, 2020 • edited

mdanatg commented Jun 17, 2020

unicorn-io commented Jun 18, 2020

mdanatg commented Jun 18, 2020

EbiereVO commented Jul 1, 2020

mdanatg commented Jul 1, 2020

iamharshit13 commented Jul 8, 2020

anorak-k commented Apr 11, 2020 •

edited

mdanatg commented Apr 11, 2020 •

edited

mkaze commented Apr 18, 2020 •

edited

mdanatg commented Jun 2, 2020 •

edited

AbhinavTalari commented Jun 17, 2020 •

edited