Skip to content

container: protect health monitor channel #35482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 14, 2017

Conversation

stevvooe
Copy link
Contributor

While this code was likely called from a single thread before, we have
now seen panics, indicating that it could be called in parallel. This
change adds a mutex to protect opening and closing of the channel. There
may be another root cause associated with this panic, such as something
that led to the calling of this in parallel, as this code is old and we
had seen this condition until recently.

This fix is by no means a permanent fix. Typically, bugs like this
indicate misplaced channel ownership. In idiomatic uses, the channel
should have a particular "owner" that coordinates sending and closure.
In this case, the owner of the channel is unclear, so it gets opened
lazily. Synchronizing this access is a decent solution, but a refactor
may yield better results.

Signed-off-by: Stephen J Day stephen.day@docker.com

Terror Bird doesn't like race conditions

Terror Bird: They ranged in height from 1–3 metres (3.3–9.8 ft) tall. Their closest modern-day relatives are believed to be the 80 cm-tall seriemas. Titanis walleri, one of the larger species, is known from Texas and Florida in North America.

While this code was likely called from a single thread before, we have
now seen panics, indicating that it could be called in parallel. This
change adds a mutex to protect opening and closing of the channel. There
may be another root cause associated with this panic, such as something
that led to the calling of this in parallel, as this code is old and we
had seen this condition until recently.

This fix is by no means a permanent fix. Typically, bugs like this
indicate misplaced channel ownership. In idiomatic uses, the channel
should have a particular "owner" that coordinates sending and closure.
In this case, the owner of the channel is unclear, so it gets opened
lazily. Synchronizing this access is a decent solution, but a refactor
may yield better results.

Signed-off-by: Stephen J Day <stephen.day@docker.com>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but would like @cpuguy83 to have a look as well

@tonistiigi
Copy link
Member

cc @talex5

Copy link
Contributor

@vieux vieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yongtang yongtang merged commit e4d0fe8 into moby:master Nov 14, 2017
@andrewhsu
Copy link
Member

in the future, would be good to have a health_test.go to accompany a fix like this to exercise what we can verify

@thaJeztah
Copy link
Member

thaJeztah commented Nov 14, 2017

@andrewhsu testing was already discussed, but the issue this is fixing is not possible to replicate in a testing scenario

@andrewhsu
Copy link
Member

if not to replicate the testing scenario, just having the framework in place for a unit test prompts any future contributor to write an accompanying test if it relates.

just saying...not all fixes scenarios can be tested in a unit test, but some of the auxiliary code can be verified to not have changed behaviour in a fix attempt.

@thaJeztah
Copy link
Member

Adding an empty boilerplate with the testing framework in place would make compilation fail (unused imports), and although I generally agree on having (unit)-tests; we should guard against adding tests that don't test anything, as they give a false sense of coverage.

@stevvooe
Copy link
Contributor Author

@andrewhsu If a valid test were possible here, I would have included one. Unfortunately, races such as these fall into the category of a negative assertion. One cannot prove that there is not a race with a test that doesn't cover all of time (without the race detector). We have to rely on our mental faculties and experience to identify and address these kinds of conditions. In that framework, we can also only show that we are reasonably certain.

In this case, we have a race condition, where the value of the stop field ends up with a few different values depending on the race. In all cases, the first if statement sees that stop is not nil. After this branch, we get the problems. In one case, the thread sees stop and simply closes the channel and sets it to nil. Another thread comes along, seeing the partial result of that operation, leading to a panic on close or a panic on nil, depending on what point the value of stop is synced into the other core on the other thread.

Indeed, we can reproduce that situation in a test case by running CloseMonitorChannel and OpenMonitorChannel alternately in separate threads to produce the condition. The problem is that reproducing the ordering here is going to be probabilistic, so you'll have to identify a number of threads that will reproduce the timing with a high probability. In practice, this is a huge waste of resources: the test only has a chance of reproducing the condition. Even then, we'd really only be testing Go's runtime behavior and the implementation of the synchronization primitives, so the value is limited.

Ideally, the functionality here is covered by some functional test that ensures these implementation details are not working. It seems like the concurrency level of that testing might not be sufficient. I am not wildly familiar with the root cause, but there was some talk of a higher level condition that could cause this. A test there, with the race detector enabled, would likely more sufficiently address this issue, and expose other possible issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants