Fix "duplicate mount point" when --tmpfs /dev/shm is used #35467

kolyshkin · 2017-11-10T23:59:13Z

This is a fix to #35455, i.e.:

$ docker run --tmpfs /dev/shm busybox sh
docker: Error response from daemon: linux mounts: Duplicate mount point '/dev/shm'.

In current code (daemon.createSpec()), tmpfs mount from --tmpfs is added
to list of mounts (ms), when the mount from IpcMounts() is added.
While IpcMounts() is checking for existing mounts first, it does that
by using container.HasMountFor() function which only checks container.Mounts
but not container.Tmpfs.

Ultimately, the solution is to get rid of container.Tmpfs (moving its
data to container.Mounts). Current workaround is to add checking
of container.Tmpfs into container.HasMountFor().

A unit test case is included.

Unfortunately we can't call daemon.createSpec() from a unit test,
as the code relies a lot on various daemon structures to be initialized
properly, and it is hard to achieve. Therefore, we minimally mimick
the code flow of daemon.createSpec() -- barely enough to reproduce
the issue.

tonistiigi · 2017-11-16T19:27:32Z

LGTM

kolyshkin · 2017-11-16T19:29:54Z

The followup to this is to make a client use mount api for --tmpfs if it's available from the engine.

thaJeztah · 2017-11-20T12:35:56Z

Trying some of the examples mentioned in #6758 (comment), the exec option is applied, but it looks like the size option is still ignored when this patch is applied;

$ docker run -it --rm alpine sh
/ # mount | grep shm
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)

$ docker run -it --rm --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=1g alpine sh
/ # mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,relatime,size=65536k)


$ docker run -it --rm -v /dev/shm --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=1g alpine sh
/ # mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,relatime,size=65536k)

I'm testing this inside the dev-container, so the --ipc=none workaround cannot be used (due to the Docker 17.06 CLI performing validation in the client)

kolyshkin · 2017-11-20T23:07:19Z

but it looks like the size option is still ignored when this patch is applied

Honoring the size= option is the subject of #35316, if you don't have commits from that PR applied you will see any tmpfs to use the default size.

I will check it, too.

thaJeztah · 2017-11-20T23:23:18Z

Oh, right, I assumed this was rebased, let me try rebasing on master and see if it works

The code in question looks up mounts two times: first by using HasMountFor(), and then directly by looking in container.MountPoints. There is no need to do it twice. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

This is a fix to the following issue: $ docker run --tmpfs /dev/shm busybox sh docker: Error response from daemon: linux mounts: Duplicate mount point '/dev/shm'. In current code (daemon.createSpec()), tmpfs mount from --tmpfs is added to list of mounts (`ms`), when the mount from IpcMounts() is added. While IpcMounts() is checking for existing mounts first, it does that by using container.HasMountFor() function which only checks container.Mounts but not container.Tmpfs. Ultimately, the solution is to get rid of container.Tmpfs (moving its data to container.Mounts). Current workaround is to add checking of container.Tmpfs into container.HasMountFor(). A unit test case is included. Unfortunately we can't call daemon.createSpec() from a unit test, as the code relies a lot on various daemon structures to be initialized properly, and it is hard to achieve. Therefore, we minimally mimick the code flow of daemon.createSpec() -- barely enough to reproduce the issue. moby#35455 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin · 2017-11-21T03:03:34Z

I have rebased it and ran some manual tests:

# docker run -it --rm alpine mount | grep shm
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)

# docker run -it --rm --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=1g alpine mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,relatime,size=1048576k)

# docker run -it --rm -v /dev/shm --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=1g alpine mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,relatime,size=1048576k)

# docker run -it --rm -v /dev/shm --tmpfs /dev/shm:rw,nosuid,nodev,exec alpine mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,relatime)

# docker run -it --rm alpine df -h /dev/shm
Filesystem                Size      Used Available Use% Mounted on
shm                      64.0M         0     64.0M   0% /dev/shm

# docker run -it --rm --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=1g alpine  df -h /dev/shm
Filesystem                Size      Used Available Use% Mounted on
tmpfs                     1.0G         0      1.0G   0% /dev/shm

# docker run -it --rm -v /dev/shm --tmpfs /dev/shm:rw,nosuid,nodev,exec,size=1g alpine df -h /dev/shm
Filesystem                Size      Used Available Use% Mounted on
tmpfs                     1.0G         0      1.0G   0% /dev/shm

# docker run -it --rm -v /dev/shm --tmpfs /dev/shm:rw,nosuid,nodev,exec alpine df -h /dev/shm
Filesystem                Size      Used Available Use% Mounted on
tmpfs                     3.8G         0      3.8G   0% /dev/shm

Note the last one is interesting -- if tmpfs size is not set, the in-kernel default is used (half the RAM as per kernel Documentation/filesystems/tmpfs.txt).

thaJeztah · 2017-11-21T11:14:27Z

I have rebased it and ran some manual tests:

Thanks! That was indeed the reason my earlier testing didn't work 😊

Note the last one is interesting -- if tmpfs size is not set, the in-kernel default is used (half the RAM as per kernel Documentation/filesystems/tmpfs.txt).

Yes, also noticed that while testing; I think we should apply the default size if no size was provided, because even in the most minimal case (e.g., just adding :rw - which is already the default - --tmpfs=/dev/shm:rw) causes the default size to get lost.

wdyt?

thaJeztah · 2017-11-21T11:19:04Z

Hm, there's actually some prior discussion on this for tmpfs (in general, so not /dev/shm); #23398, #22420, and #22438

So, hm, I don't know what the behaviour should be

ping @justincormack as well 😄

kolyshkin · 2017-11-22T04:18:47Z

I'm not sure what would be the right thing to do. The default size we have is for the default /dev/shm created (provided, supplied) by dockerd, while the --tmpfs /dev/shm is definitely not the default one.

Anyway, by looking at git commits I see there was one that added the default size:

git commit 397a6fe (PR Inconsistent --tmpfs behavior #22438)
and then another that basically rolled it back:
git commit 982c5f1 (PR fix default tmpfs size to prevent breakage #23398)

That last PR refers to issue #23394. The issue is with /tmp being 64M things get broken. Interestingly, I tried running centos, fedora, and ubuntu, and there is no mount for /tmp. I saw that issue mentions DCOS and indeed it does create a mount for /tmp (https://github.com/dcos/dcos-docker/blob/2533532432d84199960bc4a72b76fa78796b8d3a/Makefile#L54), albeit its size is specified (as per commit mesosphere-backup/dcos-docker@4a1058b), as they probably hit the issue.

Anyway, my understanding is, some people rely on --tmpfs of unspecified size being big enough.

Our options here are in abundance:

Leave it as is, i.e. up to the kernel, which currently makes it RAM/2;
Limit it to hard-coded 64M;
Limit it to user-specified shm-size (which defaults to 64M and can be overridden on a per-daemon or a per-container basis);
Limit it to half the container's RAM (if --memory is specified);
Any clever combination of the above.

Based on my experience with OpenVZ, having restrictive options by default (as in 2 and 3 above) is good and bad at the same time.

Being restrictive by default is good because:

it makes the system more secure overall (less chance of DoS initiated from a container)
it makes users think about simple questions like "how much RAM/diskspace/whatever do I actually need for this container?", makes them learn some resource management, and ultimately better know and control their systems.

Being restrictive by default is bad because:

manual configuration and better system knowledge is required (in other words, steeper learning curve);
oftentimes, users hit the (restrictive) limits and just give up (in other words, the software as a whole doesn't work for them).

Other source of wisdom is UNIX; unfortunately it sends us mixed signals, from Unix is user-friendly — it's just choosy about who its friends are (applicable to the case, that would mean "be restrictive by default and let users figure out all the knobs and dials"), to UNIX gives its users many ways to shoot themselves in a foot (applicable to this case, that'd mean "don't impose any restrictions").

Another way to look at it is "what do we usually do in such cases". Historically, Docker imposes no limits by default (say, RAM and disk space is not limited by default -- although perhaps it happened that way because there were no means to control such resources at that time). This is probably what users are expecting.

To sum it up: while personally I'd prefer limiting, I think leaving it as is would be more appropriate. Option 4, limit it to half a RAM (of container, if configured) is also a sensible option, but I guess it is better to be done in the kernel, which means it's the same as option 1.

thaJeztah · 2017-11-22T11:08:59Z

Thanks for summarising the options, @kolyshkin!

I was briefly discussing this with @justincormack yesterday, and although we set defaults for /dev/shm, we do not set any defaults for custom tmpfs (--tmpfs) mounts;

$ docker run --rm --tmpfs /foo busybox sh -c 'mount | grep shm'
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)

$ docker run --rm --tmpfs /foo busybox sh -c 'mount | grep foo'
tmpfs on /foo type tmpfs (rw,nosuid,nodev,noexec,relatime)

So, even though using --tmpfs /dev/shm happens to be overwriting the default /dev/shm that was set in the container, from Moby's perspective it's just another tmpfs, and it's the users responsibility to set the options that are desired (i.e., we should not be making assumptions).

Based on the above, I consider the current implementation to be correct.

What we do have to do is:

Update the documentation;
- Describe that Moby/Docker doesn't set defaults, but uses the defaults as set by the kernel
- Add some examples / use-cases, e.g.:
  - "how to increase shm for services (/containers)" (Allow setting of shm-size for docker service create #26714)
  - "how to enable exec on /dev/shm" (Enable 'exec' on /dev/shm, without --privileged #6758)
- If we don't have that yet: provide an overview of all the mounts that are set by default in a container, and what their options are
Related, but important:
- Describe how --shm-size, --tmpfs, --mount, --volume, and --device interact (in what order are they applied / which one takes precedence)
- Consider combining above options with conflicting paths to be invalid (may be complicated to check)
- Consider marking (some of) the above options to be deprecated
- Decide if we really want to add --shm-size to services, or use --mount as the real solution (Allow setting of shm-size for docker service create #26714)

thaJeztah

LGTM

GordonTheTurtle added the status/0-triage label Nov 10, 2017

kolyshkin force-pushed the dup-mnt branch 2 times, most recently from 1723486 to bbfc7a3 Compare November 11, 2017 06:10

yongtang added the rebuild/powerpc label Nov 12, 2017

GordonTheTurtle removed the rebuild/powerpc label Nov 12, 2017

yongtang added rebuild/powerpc status/2-code-review and removed status/0-triage labels Nov 12, 2017

GordonTheTurtle removed the rebuild/powerpc label Nov 12, 2017

kolyshkin added 2 commits November 20, 2017 18:48

container.NetworkMounts(): don't lookup mounts twice

eab3ac3

The code in question looks up mounts two times: first by using HasMountFor(), and then directly by looking in container.MountPoints. There is no need to do it twice. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>

kolyshkin force-pushed the dup-mnt branch from bbfc7a3 to 1861abd Compare November 21, 2017 02:49

thaJeztah approved these changes Nov 22, 2017

View reviewed changes

thaJeztah added the impact/changelog label Nov 22, 2017

thaJeztah merged commit d032264 into moby:master Nov 22, 2017

This was referenced Nov 22, 2017

"docker run --tmpfs /dev/shm" fails with "Duplicate mount point" #35455

Closed

Enable 'exec' on /dev/shm, without --privileged #6758

Closed

Docker CLI should use mount API for --tmpfs docker/cli#710

Open

kolyshkin deleted the dup-mnt branch March 8, 2018 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix "duplicate mount point" when --tmpfs /dev/shm is used #35467

Fix "duplicate mount point" when --tmpfs /dev/shm is used #35467

kolyshkin commented Nov 10, 2017

tonistiigi commented Nov 16, 2017

kolyshkin commented Nov 16, 2017

thaJeztah commented Nov 20, 2017

kolyshkin commented Nov 20, 2017

thaJeztah commented Nov 20, 2017

kolyshkin commented Nov 21, 2017

thaJeztah commented Nov 21, 2017

thaJeztah commented Nov 21, 2017

kolyshkin commented Nov 22, 2017

thaJeztah commented Nov 22, 2017

thaJeztah left a comment

Fix "duplicate mount point" when --tmpfs /dev/shm is used #35467

Fix "duplicate mount point" when --tmpfs /dev/shm is used #35467

Conversation

kolyshkin commented Nov 10, 2017

tonistiigi commented Nov 16, 2017

kolyshkin commented Nov 16, 2017

thaJeztah commented Nov 20, 2017

kolyshkin commented Nov 20, 2017

thaJeztah commented Nov 20, 2017

kolyshkin commented Nov 21, 2017

thaJeztah commented Nov 21, 2017

thaJeztah commented Nov 21, 2017

kolyshkin commented Nov 22, 2017

thaJeztah commented Nov 22, 2017

thaJeztah left a comment

Choose a reason for hiding this comment