While working on a feature for snapd, we had a need to perform a
"secure bind mount". In this context, "secure" meant:
- The source and/or target of the mount is owned by a less privileged
- User processes will continue to run while we're performing the
mount (so solutions that involve suspending all user processes are
- While we can't prevent the user from moving the mount point, they
should not be able to trick us into mounting to locations they
don't control (e.g. by replacing the path with a symbolic link).
The main problem is that the mount system call uses string path names to
identify the mount source and target. While we can perform checks on the
paths before the mounts, we have no way to guarantee that the paths
don't point to another location when we move on to the
call: a classic time of check to time of
One suggestion was to modify the kernel to add a
MS_NOFOLLOW flag to
prevent symbolic link attacks. This turns out to be harder than it would
appear, since the kernel is documented as ignoring any flags other than
MS_REC when performing a bind mount. So even if a
patched kernel also recognised the
MS_NOFOLLOW, there would be no way
to distinguish its behaviour from an unpatched kernel. Fixing this
properly would probably require a new system call, which is a rabbit
hole I don't want to dive down.
So what can we do using the tools the kernel gives us? The common way to
reuse a reference to a file between system calls is the file descriptor.
We can securely open a file descriptor for a path using the following
- Break the path into segments, and check that none are empty,
- Open the root directory with
- Open the first segment with
openat(parent_fd, "segment", O_PATH|O_NOFOLLOW|O_DIRECTORY).
- Repeat for each of the remaining file descriptors, closing parent
descriptors as needed.
Now we just need to find a way to use these file descriptors with the
mount system call. I came up with two strategies to achieve this.
Use the current working directory
The first idea I tried was to make use of the fact that the mount system
call accepts relative paths. We can use the
fchdir system call to
change to a directory identified by a file descriptor, and then refer to
".". Putting those together, we can perform a secure bind mount
as a multi step process:
fchdir to the mount source directory.
- Perform a bind mount from
"." to a private stash directory.
fchdir to the mount target directory.
- Perform a bind mount from the private stash directory to
- Unmount the private stash directory.
While this works, it has a few downsides. It requires a third
intermediate location to stash the mount. It could interfere with
anything else that relies on the working directory. It also only works
for directory bind mounts, since you can't
fchdir to a regular file.
Faced with these downsides, I started thinking about whether there was
any simpler options available.
/proc symbolic links
For every open file descriptor in a process, there is a corresponding
/proc/self/fd/. These files appear to be symbolic links that
point at the file associated with the descriptor. So what if we pass
/proc/self/fd/NNN paths to the mount system call?
The obvious question is that if these paths are symbolic links, is this
any different than passing the real paths directly? It turns out that
there is a difference, because the kernel does not resolve these
symbolic links in the standard fashion. Rather than recursively
resolving link targets, the kernel short circuits the process and uses
the path structure associated with the file descriptor. So it will
follow file moves and can even refer to paths in other mount namespaces.
Furthermore the kernel keeps track of deleted paths, so we will get a
clean error if the user deletes a directory after we've opened it.
So in pseudo-code, the secure mount operation becomes:
source_fd = secure_open("/path/to/source", O_PATH);
target_fd = secure_open("/path/to/target", O_PATH);
NULL, MS_BIND, NULL);
This resolves all the problems I had with the first solution: it
doesn't alter the working directory, it can do file bind mounts, and
doesn't require a protected third location.
The only downside I've encountered is that if I wanted to flip the bind
mount to read-only, I couldn't use
refer to the mount point. My best guess is that it continues to refer to
the directory shadowed by the mount point, rather than the mount point
itself. So it is necessary to re-open the mount point, which is another
opportunity for shenanigans. One possibility would be to read
/proc/self/fdinfo/NNN to determine the mount ID associated with the
file descriptor and then double check it against