30x faster!

Text translated from: https://www.scrivano.org/posts/2022-10-21-the-journey-to-spee...

The original author is a Red Hat engineer Giuseppe Scrivano , which reviews a 30x improvement in OCI container startup time.

When I started working on crun ( https://github.com/containers/crun ), I was looking for a way to start and stop containers faster by improving the OCI runtime, the component in the OCI stack that is responsible for ultimately interacting with the kernel and setting up the environment the container is in.

The OCI runtime has a very limited runtime, and its job is to execute a series of system calls that map directly to the OCI configuration file.

I was surprised that such a trivial task could take so long.

Disclaimer: For my tests, I used the default kernel and all libraries available in the Fedora installation. In addition to the fixes described in this blog post, there may have been other fixes over the years that could affect overall performance.

All crun versions used for testing below are the same.

For all tests I use hyperfine , which is installed via cargo.

How it was in 2017

To compare how far we have come from the past, we need to go back to 2017, or just install an old Fedora image. For the tests below, I used Fedora 24 based on Linux kernel 4.5.5.

On a fresh installation of Fedora 24, run build from master branch:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):     159.2 ms ±  21.8 ms    [User: 43.0 ms, System: 16.3 ms]
  Range (min ... max):    73.9 ms ... 194.9 ms    39 runs

User time and system time refer to the time spent by a process in user mode and kernel mode respectively.

160 milliseconds is a lot, and as far as I know, similar to what I observed five years ago.

Analysis of the OCI runtime immediately shows that the majority of user time is spent on libseccomp compiling seccomp filters.

To verify this, let's try running a container with the same configuration but without the seccomp profile:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):     139.6 ms ±  20.8 ms    [User: 4.1 ms, System: 22.0 ms]
  Range (min ... max):    61.8 ms ... 177.0 ms    47 runs

We're using 1/10 of the user time (43 ms -> 4.1 ms) that was previously required, and the overall time has improved too!

So there are mainly two different problems: 1) system time is rather high, and 2) user time is controlled by libseccomp. We need to address both of these issues at the same time.

Let's focus on system time for now, we'll get back to seccomp later.

system time

Create and destroy network namespaces

Creating and destroying network namespaces used to be very expensive, just using the unshare tool to reproduce the problem, on Fedora 24 I get:

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
  Time (mean ± σ):      47.7 ms ±  51.4 ms    [User: 0.6 ms, System: 3.2 ms]
  Range (min ... max):     0.0 ms ... 190.5 ms    365 runs

This is a very long time-consuming!

I tried to fix it in the kernel and came up with a patch patch . Florian Westphal rewrote it in a better way and merged it into the Linux kernel:

commit 8c873e2199700c2de7dbd5eedb9d90d5f109462b
Author: Florian Westphal
Date:   Fri Dec 1 00:21:04 2017 +0100

    netfilter: core: free hooks with call_rcu

    Giuseppe Scrivano says:
      "SELinux, if enabled, registers for each new network namespace 6
        netfilter hooks."

    Cost for this is high.  With synchronize_net() removed:
       "The net benefit on an SMP machine with two cores is that creating a
       new network namespace takes -40% of the original time."

    This patch replaces synchronize_net+kvfree with call_rcu().
    We store rcu_head at the tail of a structure that has no fixed layout,
    i.e. we cannot use offsetof() to compute the start of the original
    allocation.  Thus store this information right after the rcu head.

    We could simplify this by just placing the rcu_head at the start
    of struct nf_hook_entries.  However, this structure is used in
    packet processing hotpath, so only place what is needed for that
    at the beginning of the struct.

    Reported-by: Giuseppe Scrivano
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

commit 26888dfd7e7454686b8d3ea9ba5045d5f236e4d7
Author: Florian Westphal
Date:   Fri Dec 1 00:21:03 2017 +0100

    netfilter: core: remove synchronize_net call if nfqueue is used

    since commit 960632ece6949b ("netfilter: convert hook list to an array")
    nfqueue no longer stores a pointer to the hook that caused the packet
    to be queued.  Therefore no extra synchronize_net() call is needed after
    dropping the packets enqueued by the old rule blob.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

commit 4e645b47c4f000a503b9c90163ad905786b9bc1d
Author: Florian Westphal
Date:   Fri Dec 1 00:21:02 2017 +0100

    netfilter: core: make nf_unregister_net_hooks simple wrapper again

    This reverts commit d3ad2c17b4047
    ("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").

    Nothing wrong with it.  However, followup patch will delay freeing of hooks
    with call_rcu, so all synchronize_net() calls become obsolete and there
    is no need anymore for this batching.

    This revert causes a temporary performance degradation when destroying
    network namespace, but its resolved with the upcoming call_rcu conversion.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

These patches made a huge difference, now the time to create and destroy network namespaces has dropped to an incredible level, here is the stats for a modern 5.19.15 kernel:

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
  Time (mean ± σ):       1.5 ms ±   0.5 ms    [User: 0.3 ms, System: 1.3 ms]
  Range (min ... max):     0.8 ms ...   6.7 ms    1907 runs

mount mqueue

Mounting an mqueue is also a relatively expensive operation.

On Fedora 24 it used to look like this:

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
  Time (mean ± σ):      16.8 ms ±   3.1 ms    [User: 2.6 ms, System: 5.0 ms]
  Range (min ... max):     9.3 ms ...  26.8 ms    261 runs

In this case I also try to fix it and come up with a patch . It wasn't accepted, but Al Viro came up with a better version to solve the problem:

commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21
Author: Al Viro
Date:   Mon Dec 25 19:43:35 2017 -0500

    mqueue: switch to on-demand creation of internal mount

    Instead of doing that upon each ipcns creation, we do that the first
    time mq_open(2) or mqueue mount is done in an ipcns.  What's more,
    doing that allows to get rid of mount_ns() use - we can go with
    considerably cheaper mount_nodev(), avoiding the loop over all
    mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
    instance in O(1) time instead of O(instances) mount_ns() would've
    cost us.

    Based upon the version by Giuseppe Scrivano ; I've
    added handling of userland mqueue mounts (original had been broken in
    that area) and added a switch to mount_nodev().

    Signed-off-by: Al Viro

After this patch, the cost of creating mqueue mounts also dropped:

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
  Time (mean ± σ):       0.7 ms ±   0.5 ms    [User: 0.5 ms, System: 0.6 ms]
  Range (min ... max):     0.0 ms ...   3.1 ms    772 runs

Create and destroy IPC namespaces

I put off speeding up container startup times for a few years and started again in early 2020. Another issue I realized was the time to create and destroy the IPC namespace.

As with network namespaces, the issue can be reproduced using only the following unshare tool:

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
  Time (mean ± σ):      10.9 ms ±   2.1 ms    [User: 0.5 ms, System: 1.0 ms]
  Range (min ... max):     4.2 ms ...  17.2 ms    310 runs

Unlike the previous two attempts, this time the patch I sent was accepted upstream:

commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
Author: Giuseppe Scrivano
Date:   Sun Jun 7 21:40:10 2020 -0700

    ipc/namespace.c: use a work queue to free_ipc

    the reason is to avoid a delay caused by the synchronize_rcu() call in
    kern_umount() when the mqueue mount is freed.

    the code:

        #define _GNU_SOURCE

        int main()
            int i;

            for (i = 0; i < 1000; i++)
                if (unshare(CLONE_NEWIPC) < 0)
                    error(EXIT_FAILURE, errno, "unshare");

    goes from

            Command being timed: "./ipc-namespace"
            User time (seconds): 0.00
            System time (seconds): 0.06
            Percent of CPU this job got: 0%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05


            Command being timed: "./ipc-namespace"
            User time (seconds): 0.00
            System time (seconds): 0.02
            Percent of CPU this job got: 96%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03

    Signed-off-by: Giuseppe Scrivano
    Signed-off-by: Andrew Morton
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Manfred Spraul
    Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com
    Signed-off-by: Linus Torvalds

With this patch, the time to create and destroy IPC s has also been drastically reduced, as outlined in the commit message, on a modern 5.19.15 kernel I'm getting now:

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
  Time (mean ± σ):       0.1 ms ±   0.2 ms    [User: 0.2 ms, System: 0.4 ms]
  Range (min ... max):     0.0 ms ...   1.5 ms    1966 runs

user time

Kernel mode time seems to be under control now. What can we do to reduce user time?

As we've found out before, libseccomp is the culprit here, so we need to fix that first, which happens after a fix for IPC in the kernel.

Most of libseccomp's cost is caused by system calls looking up code. The OCI configuration file contains a list of syscalls by name, and each syscall is looked up by the seccomp_syscall_resolve_name function call, which returns the syscall number for a given syscall name.

libseccomp is used to perform a linear search of each syscall name through the syscall table, which for example looks like this for x86_64:

/* NOTE: based on Linux v5.4-rc4 */
const struct arch_syscall_def x86_64_syscall_table[] = { \
    { "_llseek", __PNR__llseek },
    { "_newselect", __PNR__newselect },
    { "_sysctl", 156 },
    { "accept", 43 },
    { "accept4", 288 },
    { "access", 21 },
    { "acct", 163 },

int x86_64_syscall_resolve_name(const char *name)
    unsigned int iter;
    const struct arch_syscall_def *table = x86_64_syscall_table;

    /* XXX - plenty of room for future improvement here */
    for (iter = 0; table[iter].name != NULL; iter++) {
        if (strcmp(name, table[iter].name) == 0)
            return table[iter].num;

    return __NR_SCMP_ERROR;

The complexity of building a seccomp profile with libseccomp is O(n*m), where n is the number of syscalls in the profile and m is the number of syscalls known to libseccomp.

I followed the advice in the code comments and spent some time trying to fix it. In January 2020, I developed a patch , to solve this problem by using a perfect hash function to find the syscall name.

The patch for libseccomp is this:

commit 9b129c41ac1f43d373742697aa2faf6040b9dfab
Author: Giuseppe Scrivano
Date:   Thu Jan 23 17:01:39 2020 +0100

    arch: use gperf to generate a perfact hash to lookup syscall names

    This patch significantly improves the performance of
    seccomp_syscall_resolve_name since it replaces the expensive strcmp
    for each syscall in the database, with a lookup table.

    The complexity for syscall_resolve_num is not changed and it
    uses the linear search, that is anyway less expensive than
    seccomp_syscall_resolve_name as it uses an index for comparison
    instead of doing a string comparison.

    On my machine, calling 1000 seccomp_syscall_resolve_name_arch and
    seccomp_syscall_resolve_num_arch over the entire syscalls DB passed
    from ~0.45 sec to ~0.06s.

    PM: After talking with Giuseppe I made a number of additional
    changes, some substantial, the highlights include:
    * various style tweaks
    * .gitignore fixes
    * fixed subject line, tweaked the description
    * dropped the arch-syscall-validate changes as they were masking
      other problems
    * extracted the syscalls.csv and file deletions to other patches
      to keep this one more focused
    * fixed the x86, x32, arm, all the MIPS ABIs, s390, and s390x ABIs as
      the syscall offsets were not properly incorporated into this change
    * cleaned up the ABI specific headers
    * cleaned up generate_syscalls_perf.sh and renamed to
    * fixed problems with automake's file packaging

    Signed-off-by: Giuseppe Scrivano
    Reviewed-by: Tom Hromatka
    [PM: see notes in the "PM" section above]
    Signed-off-by: Paul Moore

The patch was merged and released, and building a seccomp profile now has O(n) complexity, where n is the number of syscalls in the profile.

The improvement is significant, with a sufficiently recent libseccomp:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):      28.9 ms ±   5.9 ms    [User: 16.7 ms, System: 4.5 ms]
  Range (min ... max):    19.1 ms ...  41.6 ms    73 runs

The user time is only 16.7ms. It used to be more than 40ms, and it was about 4ms when seccomp was not used at all.

So using 4.1ms as the user time cost without seccomp, we have:

time_used_by_seccomp_before = 43.0ms - 4.1ms = 38.9ms
time_used_by_seccomp_after = 16.7ms - 4.1ms = 12.6ms

More than 3 times faster! Syscall lookups are only part of what libseccomp does, and a considerable amount of time is spent compiling BPF filters.

BPF filter compilation

Can we do better?

BPF filter compilation is done by the seccomp_export_bpf function, which is still quite expensive.

A simple observation is that most containers reuse the same seccomp configuration files over and over, with very little customization.

So it makes sense to cache the compilation result and reuse it where possible.

there is one new running features to cache the results of BPF filter compilation. At the time of writing, the patch has not been merged, although it is almost done.

With this, the cost of compiling the seccomp configuration file is only paid if the generated BPF filter is not in the cache, which is what we have now:

# hyperfine 'crun-from-the-future run foo'
Benchmark 1: 'crun-from-the-future run foo'
  Time (mean ± σ):       5.6 ms ±   3.0 ms    [User: 1.0 ms, System: 4.5 ms]
  Range (min ... max):     4.2 ms ...  26.8 ms    101 runs

in conclusion

Over five years, the total time required to create and destroy an OCI container has accelerated from nearly 160 milliseconds to just over 5 milliseconds.

That's almost a 30x improvement!

Tags: Go Kubernetes Programmer DevOps Cloud Native

Posted by umguy on Tue, 28 Mar 2023 21:18:32 +1030