container: protect health monitor channel #35482

stevvooe · 2017-11-13T21:39:29Z

While this code was likely called from a single thread before, we have
now seen panics, indicating that it could be called in parallel. This
change adds a mutex to protect opening and closing of the channel. There
may be another root cause associated with this panic, such as something
that led to the calling of this in parallel, as this code is old and we
had seen this condition until recently.

This fix is by no means a permanent fix. Typically, bugs like this
indicate misplaced channel ownership. In idiomatic uses, the channel
should have a particular "owner" that coordinates sending and closure.
In this case, the owner of the channel is unclear, so it gets opened
lazily. Synchronizing this access is a decent solution, but a refactor
may yield better results.

Signed-off-by: Stephen J Day stephen.day@docker.com

Terror Bird: They ranged in height from 1–3 metres (3.3–9.8 ft) tall. Their closest modern-day relatives are believed to be the 80 cm-tall seriemas. Titanis walleri, one of the larger species, is known from Texas and Florida in North America.

While this code was likely called from a single thread before, we have now seen panics, indicating that it could be called in parallel. This change adds a mutex to protect opening and closing of the channel. There may be another root cause associated with this panic, such as something that led to the calling of this in parallel, as this code is old and we had seen this condition until recently. This fix is by no means a permanent fix. Typically, bugs like this indicate misplaced channel ownership. In idiomatic uses, the channel should have a particular "owner" that coordinates sending and closure. In this case, the owner of the channel is unclear, so it gets opened lazily. Synchronizing this access is a decent solution, but a refactor may yield better results. Signed-off-by: Stephen J Day <stephen.day@docker.com>

thaJeztah

LGTM, but would like @cpuguy83 to have a look as well

tonistiigi · 2017-11-14T00:47:54Z

cc @talex5

vieux

LGTM

andrewhsu · 2017-11-14T22:30:33Z

in the future, would be good to have a health_test.go to accompany a fix like this to exercise what we can verify

thaJeztah · 2017-11-14T22:44:53Z

@andrewhsu testing was already discussed, but the issue this is fixing is not possible to replicate in a testing scenario

andrewhsu · 2017-11-14T23:03:16Z

if not to replicate the testing scenario, just having the framework in place for a unit test prompts any future contributor to write an accompanying test if it relates.

just saying...not all fixes scenarios can be tested in a unit test, but some of the auxiliary code can be verified to not have changed behaviour in a fix attempt.

thaJeztah · 2017-11-15T00:00:07Z

Adding an empty boilerplate with the testing framework in place would make compilation fail (unused imports), and although I generally agree on having (unit)-tests; we should guard against adding tests that don't test anything, as they give a false sense of coverage.

stevvooe · 2017-11-15T00:28:06Z

@andrewhsu If a valid test were possible here, I would have included one. Unfortunately, races such as these fall into the category of a negative assertion. One cannot prove that there is not a race with a test that doesn't cover all of time (without the race detector). We have to rely on our mental faculties and experience to identify and address these kinds of conditions. In that framework, we can also only show that we are reasonably certain.

In this case, we have a race condition, where the value of the stop field ends up with a few different values depending on the race. In all cases, the first if statement sees that stop is not nil. After this branch, we get the problems. In one case, the thread sees stop and simply closes the channel and sets it to nil. Another thread comes along, seeing the partial result of that operation, leading to a panic on close or a panic on nil, depending on what point the value of stop is synced into the other core on the other thread.

Indeed, we can reproduce that situation in a test case by running CloseMonitorChannel and OpenMonitorChannel alternately in separate threads to produce the condition. The problem is that reproducing the ordering here is going to be probabilistic, so you'll have to identify a number of threads that will reproduce the timing with a high probability. In practice, this is a huge waste of resources: the test only has a chance of reproducing the condition. Even then, we'd really only be testing Go's runtime behavior and the implementation of the synchronization primitives, so the value is limited.

Ideally, the functionality here is covered by some functional test that ensures these implementation details are not working. It seems like the concurrency level of that testing might not be sufficient. I am not wildly familiar with the root cause, but there was some talk of a higher level condition that could cause this. A test there, with the race detector enabled, would likely more sufficiently address this issue, and expose other possible issues.

GordonTheTurtle added the status/0-triage label Nov 13, 2017

thaJeztah added status/2-code-review and removed status/0-triage labels Nov 13, 2017

thaJeztah approved these changes Nov 13, 2017

View reviewed changes

vieux approved these changes Nov 14, 2017

View reviewed changes

yongtang merged commit e4d0fe8 into moby:master Nov 14, 2017

andrewhsu mentioned this pull request Nov 15, 2017

[17.11] container: protect health monitor channel docker-archive/docker-ce#306

Merged

stevvooe deleted the protect-health-monitor-channel branch November 15, 2017 00:07

thaJeztah mentioned this pull request Nov 17, 2017

[17.09] container: protect health monitor channel docker-archive/docker-ce#313

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container: protect health monitor channel #35482

container: protect health monitor channel #35482

stevvooe commented Nov 13, 2017

thaJeztah left a comment

tonistiigi commented Nov 14, 2017

vieux left a comment

andrewhsu commented Nov 14, 2017

thaJeztah commented Nov 14, 2017 •

edited

Loading

andrewhsu commented Nov 14, 2017

thaJeztah commented Nov 15, 2017

stevvooe commented Nov 15, 2017

container: protect health monitor channel #35482

container: protect health monitor channel #35482

Conversation

stevvooe commented Nov 13, 2017

thaJeztah left a comment

Choose a reason for hiding this comment

tonistiigi commented Nov 14, 2017

vieux left a comment

Choose a reason for hiding this comment

andrewhsu commented Nov 14, 2017

thaJeztah commented Nov 14, 2017 • edited Loading

andrewhsu commented Nov 14, 2017

thaJeztah commented Nov 15, 2017

stevvooe commented Nov 15, 2017

thaJeztah commented Nov 14, 2017 •

edited

Loading