,

Interaction between User namespaces and Capabilities

User namespaces and capabilities are important kernel functionality to make containers secure. They allow to better isolate containers and limit the privileges a container might have. A while back a user reported a bug where some odd behavior was noticed when namespaces are shared between containers which could lead to security problems. Lets take closer look what can go wrong if you are not aware of the behavior.

First lets build a container image with nftables installed:

$ cat Containerfile 
FROM fedora
RUN dnf install -y nftables
$ podman build -t testimg .

Now we have nft installed in the image and can use it. nft manages the firewall and as such needs the CAP_NET_ADMIN capability to function otherwise it will error. So now when we run

$ podman run --rm testimg nft list ruleset
Operation not permitted (you must be root)
netlink: Error: cache initialization failed: Operation not permitted
$ podman run --rm --cap-add CAP_NET_ADMIN testimg nft list ruleset
<Results omitted for brevity>

Only the second command works because CAP_NET_ADMIN is not given by default to a container. Note that the command does return any output because there are no rules.

Now assume we run two containers and want to share the network between them, this can be done by using the --network container:<name> option. In the first terminal run

$ podman run --rm --name test -it testimg 
[root@a6a4ddc4a7c8 /]# nft list ruleset
Operation not permitted (you must be root)
netlink: Error: cache initialization failed: Operation not permitted
[root@a6a4ddc4a7c8 /]#

In a second terminal run

$ podman run --rm --network container:test -it testimg
[root@6d4b196e75be /]# nft list ruleset
Operation not permitted (you must be root)
netlink: Error: cache initialization failed: Operation not permitted

In both cases we were not able to modify the namespace as expected.

Lets run the commands again but this time we use --userns keep-id for the first container.

# terminal 1
$ podman run --rm --name test --userns keep-id --user 0:0 -it testimg 
[root@91896ef2fac8 /]# nft list ruleset
Operation not permitted (you must be root)
netlink: Error: cache initialization failed: Operation not permitted
[root@91896ef2fac8 /]#

# terminal 2
 podman run --rm --network container:test -it testimg
[root@59ad31cb4de1 /]# nft list ruleset
[root@59ad31cb4de1 /]#

Now even though we did not give the second container CAP_NET_ADMIN it can still modify the network. This is unexpected for most people and can thus creating a security problem if you do something like this and are not aware of it.

Now lets do it one last time but use --userns keep-id for the second container.

# terminal 1
$ podman run --rm --name test -it testimg 
[root@8860d6004a8d /]# nft list ruleset
Operation not permitted (you must be root)
netlink: Error: cache initialization failed: Operation not permitted
[root@8860d6004a8d /]# 

# terminal 2
$ podman run --rm --network container:test --userns keep-id --user 0:0 -it testimg
[root@10cb690d5b93 /]# nft list ruleset
Operation not permitted (you must be root)
netlink: Error: cache initialization failed: Operation not permitted
[root@10cb690d5b93 /]#

Now again neither container can modify the network. You can even add --cap-add CAP_NET_ADMIN to the second container and it will still not work.

What is happening?

The behavior may seems unexpected and broken at first but this works exactly as designed by the kernel security checks. Capabilities are always per user namespace, when a new user namespace is created the process will get all capabilities in the new namespace but drops all in the parent namespace. It is also important to know that the kernel checks the permissions of namespaces always based from which user namespace the other namespace was created. If the namespace was created from any parent user namespace then the kernel will not allow you to modify it, you basically have no capabilities for them. On the other side for namespaces created by child user namespaces it will always have all capabilities even if a process drop them.

This is exactly what was happening in the second case. The OCI runtime first creates the child user namespace then within it it created a new network namespace. The second container then is in the parent user namespace so it will always have all capabilities for the shared network namespace.

In the third case it is the other way around the second container is part of the user namespace and thus the child, therefore it can never modify the network namespace which was created by the parent user namespace.

To fix this issue it is best to also share the user namespace between the containers, as long as both containers are part of the same user namespace you get your expected behavior were the given capabilities are respected. Lets try the second case again but this time with --userns container: for the second container as well.

# terminal 1
$ podman run --rm --name test --userns keep-id --user 0:0 -it testimg 
[root@198c6101c2cd /]# nft list ruleset
Operation not permitted (you must be root)
netlink: Error: cache initialization failed: Operation not permitted
[root@198c6101c2cd /]#

# terminal 2
$ podman run --rm --network container:test --userns container:test  -it testimg
[root@43644d997e2d /]# nft list ruleset
Operation not permitted (you must be root)
netlink: Error: cache initialization failed: Operation not permitted
[root@43644d997e2d /]# 

The network namespace was just one example the rule applies to the other namespaces as well. You can read more about the behavior in the man page user_namespaces(7). With this knowledge you can avoid accidentally giving your containers more privileges than they should have.

Leave a Reply

Subscribe

Sign up with your email address to receive updates by email from this website.


Search