by Chris Evich
Let’s say you’re test-driving a service that needs to interact with a rootless Podman socket. The service is potentially destructive to containers, volumes, or images, so you don’t want to risk your host installation. How can you do this all from inside a rootless container? The answer is somewhat complicated, but in this article, I’ll try to walk you through each of the challenges step-by-step.
Nested Podman inside Podman isn’t a new thing. On the host side, we’ve got to run --privileged
and provide the container storage from the host (-v container-storage:/home/<name>/.local/share/containers:Z
). This is necessary to avoid an ugly overlay-in-overlay situation. In any case, this is fairly simple and unsurprising as the following Containerfile demonstrates:
FROM registry.fedoraproject.org/fedora:latest RUN dnf upgrade -y && \ dnf install -y podman && \ rpm --setcaps shadow-utils 2>/dev/null && \ dnf clean all RUN groupadd -g 1000 fred && \ useradd -u 1000 -g 1000 fred && \ echo -e "fred:1:999\nfred:1001:64535" | \ tee /etc/subuid > /etc/subgid VOLUME /home/fred/.local/share/containers
The base-image package upgrade and installation shouldn’t be surprising. The --setcaps
on shadow-utils is a fedora-specific workaround for a long-standing bug. Next, we add our user and set up the nested subuid/subgid mapping. This is needed so nested containers don’t grab IDs that overlap with the namespaced UID 0
(the root user) and UID 1000
(Fred). Were that to happen, users in the nested-containers could have unintended access to files/directories of the outer container. So this subuid/gid
setup is important.
Since Fred is our only nested user and group namespace, we can simply clobber the default contents in the /etc/sub{uid,gid}
files. If there were other users with known IDs, they should similarly be excluded from Fred’s namespace ranges to avoid unintentional clashes. Finally, the image requests a volume from the host to provide the above-mentioned container storage.
Running a nested container inside this container image is straightforward and unsurprising:
$ podman build -t piptest . ...cut... $ podman run -it --rm --privileged --user fred --hostname outer piptest [fred@outer /]$ podman run -it --rm --privileged --hostname inner fedora:latest ...cut... Writing manifest to image destination Storing signatures [root@inner /]# echo "no magic here" no magic here [root@inner /]# exit exit [fred@outer /]$ exit exit
Next comes the tricky part. Since Podman doesn’t have a daemon, a running Podman process is needed to service API socket requests. While we could run podman system service -t0
as the container’s command, this won’t allow us to use the socket at the same time from an app. With containers, any time you’re contemplating needing multiple high-level processes, unless they’re entirely trivial (which these aren’t), you’ll want an init-system like Systemd.
Running Systemd inside a container complicates things – Primarily due to the additional services and their non-trivial configuration. However; it’s necessary in this use-case because we want the container to do multiple things. Specifically, Systemd is needed to properly handle a (potentially) large number of Podman child-process popping in and out of existence, along with signal-handling up/down their sub-trees. I haven’t even mentioned the app which will be using the Podman socket, it will inevitably need systemd handling as well.
In other words, without a PID 1 process manager, our container will quickly end up an uncoordinated miss-mash, constantly testing its own imminent collapse.
Back in the Containerfile
, updating the dnf install
line to include systemd
is easy enough. Though some additional magic is needed to coax life into a Podman socket service and keep operations observable when the container starts. Assuming your host is system-based, and has podman installed, the podman systemd files can simply be copied into the container build context, for example:
$ cp /lib/systemd/system/podman.s* ./
The first thing that needs changing is a minor update to the podman.service
file – so it logs to the console at the warn level (default on Fedora is the info
level):
[Unit] Description=Podman API Service Requires=podman.socket After=podman.socket Documentation=man:podman-system-service(1) StartLimitIntervalSec=0 [Service] Delegate=true Type=exec KillMode=process Environment=LOGGING="--log-level=warn" ExecStart=/usr/bin/podman $LOGGING system service StandardOutput=journal+console StandardError=inherit [Install] WantedBy=default.target
Secondly, let’s have systemd manage the listening podman.sock
file in Fred’s home directory where it’s easier to interact with, and saves a bit of typing:
[Unit] Description=Podman API Socket Documentation=man:podman-system-service(1) [Socket] ListenStream=%h/podman.sock SocketMode=0660 [Install] WantedBy=sockets.target
Installing the socket, service, and setting up a systemd-slice for Fred, happens in the Containerfile
, which to this point looks like this:
FROM registry.fedoraproject.org/fedora:latest RUN dnf upgrade -y && \ dnf install -y podman systemd && \ rpm --setcaps shadow-utils 2>/dev/null && \ dnf clean all RUN useradd -u 1000 fred && \ echo -e "fred:1:999\npodman:1001:64535" | tee /etc/subuid > /etc/subgid VOLUME /home/fred/.local/share/containers ADD /podman.service /podman.socket /home/fred/.config/systemd/user/ RUN cd /home/fred/.config/systemd/user/ && \ mkdir sockets.target.wants && \ ln -s ../podman.socket ./sockets.target.wants/ && \ mkdir -p /var/lib/systemd/linger && \ touch /var/lib/systemd/linger/fred && \ chown -R 1000:1000 /home/fred ENTRYPOINT /sbin/init
The key to having the user-slice services start with the container, is creating the file /var/lib/systemd/linger/fred
which is the equivalent to running: loginctl enable-linger fred
. However, neither systemd nor dbus are available in a container-image build environment, so the file is simply touched into existence.
The final Containerfile
steps fix Fred’s file ownership and indicate that the container should start init (systemd) as PID 1
. However, since the container will now startup as the (namespaced) root user, we need to pre-create the container-storage volume as follows; otherwise, the ownership will be incorrect – undoubtedly generating a ton of permission-denied errors.:
$ podman volume create -o o=uid=1000,gid=1000 freds-containers
Should you want to view or manipulate the contents of that volume, be sure to prefix your commands with podman unshare
to enter your user-namespace. Otherwise, that’s it, all the technical bits uncovered, and magic unobscured. All that’s left is to start the container and prove the nested, remote, rootless Podman connection is functional:
$ podman build -t piptest . ...cut... $ podman run -dt --rm --privileged --hostname outer \ -v freds-containers:/home/fred/.local/share/containers \ --systemd true piptest 0af2a31de3467d0ddb2b540072c0864f7738d5de5745aa5b3596d70f4a5f7a04 $ podman exec -itl bash [root@outer /]# ls -la /home/fred/podman.sock srw-rw----. 1 fred fred 0 Feb 9 17:06 /home/fred/podman.sock [root@outer /]# export CONTAINER_HOST=unix:///home/fred/podman.sock [root@outer /]# podman --remote info --format={{.Store.GraphRoot}} /home/fred/.local/share/containers/storage [root@outer /]# podman --remote run -it --rm --hostname inner fedora:latest ...cut... Writing manifest to image destination Storing signatures [root@inner /]# echo "Hello from $HOSTNAME" Hello from inner [root@inner /]# exit exit [fred@outer /]$ exit exit $ podman stop -l 0af2a31de3467d0ddb2b540072c0864f7738d5de5745aa5b3596d70f4a5f7a04
Here you can see the container is built and then run in --privileged
mode (required for nested rootless containers) with the pre-created container-storage volume. Then exec’d into as the namespaced root user, where we connect to Fred’s Podman socket and print out the container storage location. Though the use of root is merely a convenience, you can connect to the socket as any configured/namespaced user with appropriate permissions. As demonstrated above, the location and permissions are configurable in the podman.socket
file.
At this point, development of any service which connects to a Podman socket is possible and will be isolated from the host’s Podman setup. Adding more users, systemd-slices, and making them linger should be fairly trivial. Though you’ll need to remember to exclude other namespaced UID/GID
from the podman-user’s range within the nested /etc/sub{uid,gid}
to prevent clashes.
Despite needing to run the top-level container in --privileged
mode, the containerized systemd-slices provide some additional level of isolation. This setup is certainly not as secure as without systemd or the additional privileges. However; the privileged option isn’t nearly as bad as if the container were run as root. In this case, the extra enabled capabilities are of limited effect, due to the rootless user namespace. So it’s a perfectly fine arrangement for testing, development, or non-critical purposes.
Overall these containers succeed in isolating their software, dependencies, and runtime environments from the host operating system. Further, since they’re already systemd-enabled, it should be easy to add additional apps and service files – directing them toward the nested Podman socket location. However, since this article is already a bit long, these are all left as exercises for the reader.
Leave a Reply