Accelerating Parallel Layer Creation

Earlier this year, my colleague Jan Rodak worked on Accelerating Parallel Layer Removal. It worked out so well that we wanted to tackle the other side of the problem: Layer Creation. Layer creation is even slower than removal as we must write all data to disk and not just just unlink files so there is potential for much bigger gains. However, creation is also a more complex process than the removal so there were more challenges for me to solve.

Measuring Lock times

We knew that we held locks during the layer tar decompression and extraction and we know on big layers that can take a while. We didn’t however have an accurate picture of how big of a problem this is. So in order to get some accurate data I created a bpftrace script to track our lock times.

To write such a script we need to know some internals of how the locking is implemented and what we want to track. I like to get some data on how long each lock is held and how long we block waiting for it. Figuring out how the storage locks work is fairly easy as we can just read the code to see what types of locks it uses. Our library uses file locks using the fcntl() syscall with the F_SETLKW type. In addition it does not make use of F_UNLCK to unlock but rather closes the fd to unlock.

Now I just had to find the right bpftrace attach points in the kernel which corresponds to these events:

kfunc:fcntl_setlk: fctnl() entry when taking the lock
kretfunc:fcntl_setlk: on the fctnl exit so we can get the time delta between entry and exit so we know how long we were blocked
tracepoint:syscalls:sys_enter_close: to know when the lock is unlocked again

However the bpftrace program will capture all processes that hit these probes, I only care about our storage locks so I used the filenames to filter them out. Of course this isn’t perfect as other processes could use file locks with the same names but I just assumed that is not the case on my system.

You can fine the full script I came up with on github. I won’t go more into detail on how it works but you can find a lot of other bpftrace script examples online if you are interested in its capabilities.

Test Case

In order to get a good view of the problem I decided to pull two big images in parallel: ghcr.io/home-assistant/home-assistant:stable (the official home-assistant image) and quay.io/libpod/get_ci_vm:latest (an image we use to run our CI VMs in order reproduce CI issues).

The exact images here do not matter, I just picked them because they are big and have some big layers which should be ideal to showcase the problem. The home-assistant image is 2.18 GB in total size on disk and the get_ci_vm image is 938 MB.

I have a somewhat slow internet connection so in order to avoid any variances and avoid the network as bottle-neck I first setup a local registry on my system and mirror both images there.

I run my bpftrace script in one terminal and then in another as test command I simply pull both images in parallel from my local registry using this command:

podman pull --tls-verify=false 192.168.122.100:5000/home-assistant:stable 2>/dev/null & podman pull --tls-verify=false 192.168.122.100:5000/get_ci_vm:latest

The Problem

Running the bpftrace script and then the test from above with podman 5.7 generates output like this:

@lock_max[storage.lock]: 5

@lock_duration[storage.lock]:
[0]                  314 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                    1 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 1 |                                                    |

This is the default bpftrace output format of the max and histogram maps we create. While this looked decent enough for some quick comparison on the cli to see if my changes worked I transformed the data here into some pretty graphs so you can follow along easily.

Lockfile	Maximum Duration held
storage.lock	5 ms
images.lock	23 ms
layers.lock	7221 ms
containers.lock	7220 ms

This shows the maximum duration each lock was held in milliseconds, as we can see both the containers.lock and layers.lock were held at one point for over 7 full seconds. That is of course not great.

This shows a logarithm based histogram with the values captured from my bpftrace script. The X-axis grows exponentially with base 2 while the Y-axis grows exponentially with base 10.

The majority of lock times are short; there are however still plenty of outliers for the containers.lock and layers.lock. The long lock times is then what causes other processes to become unresponsive as they can do nothing but wait until they get the lock. Looking at the results for the blocked time shows exactly this:

We can see that we only block on the layers.lock and in the majority of cases we also get this one quickly. There are however a few outliers that block for a while simply because of the other lock duration outliers above that hold the locks for so long. The longest time a process was blocked was for 7221 ms, which matches the longest time the lock was held so we got the worst possible case there.

Note, interestingly there is no contention on containers.lock, this may seem surprising at first but in the pull code layers.lock and containers.lock got taken in serial as such anytime we locked layers.lock containers.lock was locked as well. With a bit more profiling of the code I could indeed confirm our initial suspicion that the slow parts when the locks are held for a while are indeed all because of the slow extracting of the layer tarballs.

The Fix

It does sound easy to say well just extract without holding the locks but the locks are there for a reason. What happens if two processes create the same layer, or another one deletes the parent layer while we create ours. What happens if the process is killed mid way through extraction? The metadata is stored in a single json file, it cannot be written without the lock as we risk corrupting it otherwise.

Now thankfully I didn’t start at zero, I could adopt the tricks that were used by Jan for the Layer Removal speed up as well for the Creation. Let’s extract the tarball to a temporary directory and then take the locks, create the layer metadata and then rename the temporary directory to the final regular layer location which is fast of course. For the Layer Removal we have already added the `TempDir` object which already takes care of the removal problem if the operation fails in the middle or the process is killed. As such I only had to extend it to support adding content.

The more challenging part was figuring out various prerequisites that were done before the extraction to set specific metadata such as the layers ID-Mapping. The IDs are critical during extraction as we must write the files with the correct UID and GID to the filesystem. In addition some of the extraction code path is driver specific and cannot be generalized. It would also be nice to not do all the extraction work when the layer already exists or the parent doesn’t exist.

So in total I managed to get something like this working:

Lock the store and check if the layer exists or if the parent doesn’t then we can quickly return an error. If the parent is there then we get the required metadata such as the ID-Mapping. Then we can unlock everything again and perform the tar extraction on our temporary directory. Here it is important that we must pick a directory in the same file system as the final layer location as otherwise the renaming would fail.

Once the extraction is complete, lock the store again and then check again that the parent still exists and ensure the same layer with our ID wasn’t created in the meantime. If it wasn’t we then add the metadata for this new layer to the store which is now a much quicker process as it doesn’t have to deal with the extraction.

One thing to mention is due to the storage driver specific implementation requirements I only implemented this feature for overlayfs, other store drivers will still do the extraction while holding the lock. They would also need to get ported over to the new code paths I added, however they did pose additional challenges which is why I didn’t do it.

In addition the containers.lock file was previously held for the entire duration of the layer creation path. This was never needed because we only use it to compare that the parent layer ID doesn’t conflict with a container. Once the check is done we unlock the container store again quickly now.

You can see the total amount of changes in my PR: https://github.com/containers/container-libs/pull/378

The Result

Let us see the numbers using a podman binary with my patches. I am using the same script and pull command again to verify the improvements. So let’s see the same numbers again:

Lockfile	Maximum Duration held
storage.lock	1 ms
images.lock	24 ms
layers.lock	24 ms
containers.lock	0 ms

We can already see a big difference, the longest time we held a lock was 24 milliseconds.

The difference is massive, keep in mind the X-axis scale is much smaller now. We only blocked once for 6 milliseconds on layers.lock and another time for 1 milliseconds. All our big lock contentions between the two pulls are gone.

The containers.lock is now held for a very short time less than 1 millisecond and the longest time we hold the layers.lock and images.lock files are just 24 milliseconds. Remember we held both the layers.lock and containers.lock for over 7 seconds before. We are talking about orders of magnitudes in terms of difference.

When I look at the time it took to complete both pull commands with the new version it is about 2 seconds faster. In theory one could expect better improvements but just because we no longer content on a lock doesn’t mean that there aren’t other bottlenecks. In particular the extraction is still limited to our storage speed. Just because we can extract in parallel doesn’t magically double my SSD speed.

Keep in mind the numbers overall are not meant to be representative of your real world performance on your systems. There are many things that could affect the timings here and I measured the times on the same system with the same versions except the different podman binary. So I can confidently compare the numbers between both runs and the difference is clear and well outside any margin of error. I did run the tests several times and they were all very similar.

Conclusion

We have successfully removed another bottleneck during image layer creation. The reduction in lock times is massive and results in a real world noticeable difference in terms of command response times. While my example here has only focused on two parallel pulls, many of our commands need to take the four storage locks for various operations. For example running podman images in parallel to a pull can also block on the layers.lock file. As such this fix does not just help parallel pulls it helps basically any parallel running podman commands to another pull.

However it is worth pointing out that there is no actual speed advantage on a single pull. The time that takes will be about the same so if you only used podman from the terminal one command at a time this doesn’t matter for you.

These improvements are contained in our storage library and work fully within the existing API interfaces which means that not just Podman but also Buildah, Skopeo (when copying to the local storage) and even CRI-O benefit from this change as long as they are using the overlayfs driver. Therefore I assume that many users benefit in one way or another. We are planning to release this change in podman v5.8.

One response to “Accelerating Parallel Layer Creation”

Podman Test Days: Try the New Backend & Parallel Pulls – Fedora Magazine

February 20, 2026

[…] This release brings optimizations to how Podman downloads image layers, specifically when pulling multiple images at the same time. For a deep dive into the engineering behind this, check out the developer blog post on Accelerating Parallel Layer Creation. […]

Containers

Subscribe

Thank you for your response. ✨

Latest news

Categories

Search