Surviving a database corruption

I have yet to bump into perfect software. Bugs, failures, and short-comings are the reality of software developers. They often have upsides whether it might be learning about a new area of code in a larger application or coming up with ideas to prevent problems.

We had an interesting problem brought to our attention recently where the Podman database had become corrupted. The database’s primary job is to keep track of containers, their states, and their configurations. In this case, the reporters were testing Podman simulating power loss on a special appliance. While the Podman team has done a limited amount of power loss testing as has our users either on purpose or accidentally, we rarely see database corruptions because on order for the database to become corrupted, it must be writing (at least theoretically). And writing to our database is usually a very quick action, so the window is quite small.

While we could not pin down root cause immediately, we were asked about how to recover from this type of failure. If your database is corrupted, most Podman commands will not work. In the case of this situation, their containers were normally recreated on each boot and was done so programatically using the RESTFul API.

Podman DB corruption

If your Podman database becomes corrupted, in almost all cases you will not be able to recover any existing containers.

In this case, the containers were being run by a privileged user: root. As such, I will show the recovery as a privileged user and add in any notes for rootless users. The procedure is roughly the same. First, the error:

$ sudo podman ps
panic: invalid freelist page: 66, page type is leaf

goroutine 1 [running]:
go.etcd.io/bbolt.(*freelist).read(0x50c95d?, 0x7f96a7e42000)
	/home/baude/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/freelist.go:266 +0x22e
	/home/baude/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:323 +0xb8
sync.(*Once).doSlow(0xc00011e1c8?, 0x10?)
	/usr/lib/golang/src/sync/once.go:74 +0xc2
	/home/baude/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:316 +0x47
go.etcd.io/bbolt.Open({0x7ffd1855f26a, 0x23}, 0x1b6?, 0x0)
	/home/baude/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/db.go:293 +0x48b
main.(*CheckCommand).Run(0xc00005fe58, {0xc0000141a0, 0x1, 0x1})
	/home/baude/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/cmd/bbolt/main.go:202 +0x1a5
main.(*Main).Run(0xc000104f40, {0xc000014190, 0x2, 0x2})
	/home/baude/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/cmd/bbolt/main.go:112 +0x979
	/home/baude/go/pkg/mod/go.etcd.io/bbolt@v1.3.6/cmd/bbolt/main.go:70 +0xae

The first step to recovery is to delete the existing database.

$ sudo rm /var/lib/containers/storage/libpod/bolt_state.db

Rootless database path

The privileged user’s database is by default stored at /var/lib/containers/storage/libpod/bolt_state.db. The rootless user’s is stored at ~/.local/share/containers/storage/libpod/bolt_state.db by default.

When Podman is unable to find its database, it will create a new empty database.

$ sudo podman ps -a

Now we can recreate our containers, which were called container1, container2, and container3.

$ sudo podman create --name container1 alpine top
Error: creating container storage: the container name "container1" is already in use by e77786c096e083b258bad2e196255f7dc1a2859cfb9dd35436648e1541bdce23. You have to remove that container to be able to reuse that name: that name is already in use

How can the container already exist but not be seen in a list of all containers? The error message could be more helpful. There is a little known option called `–external`for `podman ps` to view containers in this external storage state.

$ sudo podman ps -a --external
CONTAINER ID  IMAGE                            COMMAND     CREATED             STATUS                    PORTS       NAMES
69f78dfaa0a6  docker.io/library/alpine:latest  storage     About a minute ago  Storage                               container1
e5db3ad9125e  docker.io/library/alpine:latest  storage     About a minute ago  Storage                               container2
52591b8b7676  docker.io/library/alpine:latest  storage     About a minute ago  Storage                               container3

Notice how the STATUS column lists STORAGE for all three of the containers. Again, this is because these containers still are on the filesystem, which explains the error earlier. Rather than trying to delete them and recreating the containers (which is also perfectly valid), we can simply use the --replace option which is available for both podman run and create.

$ sudo podman create --replace --name container1 alpine top

The --replace option will perform the removal of the previous container from the filesystem and run or create the new container with that name. This option is also applicable to containers in your database and regular storage.

Leave a Reply

Your email address will not be published. Required fields are marked *


Sign up with your email address to receive updates by email from this website.