I recently had a discussion in a container-selinux issue on why we allow certain capabilities by default for containers. The conversation is around DAC_OVERRIDE, a Linux capability which allows privileged processes, usually root, to ignore ownership and read/write permissions Discretionary Access Control (DAC).
“As @wrabcak notes in Why do you see DAC_OVERRIDE SELinux denials?, In most cases [the use of dac_override] is a bug in the application package, and as @rhatdan notes in SELinux team works to remove DAC_OVERRIDE Permissions, In most cases the requirement for DAC_OVERRIDE is a simple programmer error in the way he sets up his application and can be fixed by adjusting the permissions/ownership on file system objects. Loosening the SELinux constraints should be the last resort and When I look at containers, we allow DAC_OVERRIDE by default, because so many containers are badly written, but I think it would be great for us to be able to remove this permission by default.“
Saying that containers are badly written and need DAC_OVERRIDE is incorrect. Almost all containers actually do not need DAC_OVERRIDE and a whole bunch of other capabilities.
A couple of years ago, I was giving a talk on Goldilocks and the three bears. In this talk, I explain that for container security, we walk a balancing line to try to make containers as secure as possible but still allow most applications to run without turning off security. If users are confronted with permission denied, their instinct is to just do –privileged mode, which turns off almost all container security.
In my blog, Container permission denied: How to diagnose this error, I explained how difficult it is to diagnose permission denied errors. The bottom line when it comes to SELinux confinement of containers is, I went lenient in control of Capabilities and network access.
With container-selinux policy, the main goal is to limit the container’s access to file systems using Mandatory Access Control (MAC).
Discretionary access permissions like capabilities are handled directly in the Linux kernel along with user-namespaces. If SELinux blocked all capability access by default, then we would need to have different types for every combination of capabilities.
podman run --cap-add CAP_DAC_OVERRIDE –security-opt label=type:container_dac_override_t …
Because there are over 40 different Linux capabilities, it would end up with 40 factorial (40!) different types, just for all possible combinations of Linux capabilities. It was possible to use booleans as well, but it is difficult to distinguish which capabilities are more powerful than the others. For example, CAP_SYS_ADMIN is more dangerous than even CAP_DAC_OVERRIDE. Adding 40 or so booleans complicates things too much for the average user, who barely understands Linux security. Similarly, we would need controls for the network stack.
Other parts of the kernel support controlling capabilities and network access. These controls are much more flexible than SELinux type enforcement. In containers, the dropping of container capabilities and the use of network namespaces are both enforced via the same kernel as SELinux. We decided to concentrate on the most common container escape, the file system.
When I wrote the general-purpose container-selinux policy to be used to control containers forced me to make Goldilocks-like compromises. In a perfect world, everyone would run their containers as securely as possible, and users can use a tool like Udica to generate custom policies to further lock down specific containers. You can even go and grab a containers-selinux policy and write your own types to run your containers with. For example:
podman run –security-opt label=type:confined_container_t …
Relying on container engines to control other security subsystems like capabilities like DAC_OVERRIDE has proven effective. The container-selinux package has been used for more than ten years to control millions of containers launched by container engines like Podman, Docker, CRI-O, Containerd, and Buildah. The container SELinux policy runs under container orchestrators like Kubernetes. OpenShift, Red Hat’s Kubernetes, provides separation for them. Furthermore SELinux has been incredibly effective and has blocked many container escapes.