Manjusaka

Manjusaka

Continue discussing the number one process in the container

After discussing some overviews of the init process in containers in last week's article, under the guidance and collaboration of my mentor, Mr. Chuan (you can find him on GitHub, jschwinger23), we explored the implementations of two widely used init processes in mainstream containers: dumb-init and tini, and I’m continuing to write this commentary.

Main Text#

Why do we need an init process, and what responsibilities do we expect it to undertake?#

Before continuing the discussion on dumb-init and tini, we need to review a question: Why do we need an init process? And what responsibilities should the init process we choose undertake?

In fact, there are two main scenarios where we need an init process to be hosted in front in the container context:

  1. For scenarios involving graceful upgrades of binaries within the container, one mainstream approach is to fork a new process, exec a new binary file, with the new process handling new connections and the old process handling old connections. (Nginx adopts this solution)

  2. Situations where signals are not properly forwarded and processes are not correctly reaped.

  3. In some cases like calico-node, for the sake of convenience in packaging, we run multiple binaries in the same container.

There isn’t much to say about the first scenario, so let’s look at the testing for the second point.

First, we prepare the simplest Python file, demo1.py

import time

time.sleep(10000)

Then, as usual, we wrap it in a bash script.

#!/bin/bash

python /root/demo1.py

Finally, we write the Dockerfile.

FROM python:3.9

ADD demo1.py /root/demo1.py
ADD demo1.sh /root/demo1.sh

ENTRYPOINT ["bash", "/root/demo1.sh"]

After building and starting execution, let’s first look at the process structure.

Process Structure

No problem, now we use strace to trace the two processes 2049962 and 2050009, and then send a SIGTERM signal to the bash process 2049962.

Let’s look at the results.

Trace Result of Process 2049962

Trace Result of Process 2050009

We can clearly see that when process 2049962 receives SIGTERM, it does not forward it to process 2050009. After we manually SIGKILL 2049962, 2050009 also exits immediately. Some may wonder why 2050009 exits after 2049962 exits.

This is due to the characteristics of the PID namespace. Let’s take a look at the relevant introduction in pid_namespaces:

If the "init" process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal.

When the init process in the current PID namespace exits, the kernel directly SIGKILLs the remaining processes in that PID namespace.

OK, when we combine this with the container scheduling framework, many pitfalls can arise in production. Here’s a segment of my previous complaint:

We have a test service, Spring Cloud, where after going offline, the node cannot be removed from the registry, and after much confusion, we found the problem. Essentially, when the POD is removed, the K8S Scheduler sends a SIGTERM signal to the POD's ENTRYPOINT and waits thirty seconds (the default graceful shutdown timeout), and if there’s no response, it will SIGKILL directly.
The issue is that our Eureka version of the service starts via start.sh, ENTRYPOINT ["/home/admin/start.sh"], and the default shell in the container is /bin/sh in fork/exec mode, causing my service process to not correctly receive the SIGTERM signal and thus be SIGKILLed.

Isn’t that frustrating? Besides the inability to handle signal forwarding properly, a common issue with applications is the appearance of Z processes, where child processes cannot be correctly reaped after they finish. For example, the notorious Z process issue with early puppeteer. In such cases, aside from issues with the application itself, another possible reason is that in daemon processes, orphaned processes do not have the capability to reap child processes after being re-parented.

OK, after reviewing the common issues above, let’s review the responsibilities that the init process in a container should undertake:

  1. Signal forwarding

  2. Reaping Z processes

Currently, in container scenarios, people mainly use two solutions as their init processes: dumb-init and tini. Both solutions handle orphaned and Z processes in containers reasonably well. However, the implementation of signal forwarding is quite complicated. Now, let’s get to the discussion!

The Flaws of dumb-init#

To some extent, dumb-init is a prime example of false advertising. The code implementation is very rough.

Let’s take a look at the official promotion:

dumb-init runs as PID 1, acting like a simple init system. It launches a single process and then proxies all received signals to a session rooted at that child process.

Here, dumb-init claims to use process sessions in Linux. We all know that a process session, by default, shares a Process Group ID. So we can understand that dumb-init can forward signals to every process in the process group. Sounds great, right?

Let’s test it out.

The test code is as follows, demo2.py:

import os
import time

pid = os.fork()
if pid == 0:
    cpid = os.fork()
time.sleep(1000)

The Dockerfile is as follows:

FROM python:3.9

RUN wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64
RUN chmod +x /usr/local/bin/dumb-init

ADD demo2.py /root/demo2.py

ENTRYPOINT ["/usr/local/bin/dumb-init", "--"]

CMD ["python", "/root/demo2.py"]

Build and run, let’s first look at the process structure.

Process Structure of demo2

Then, as usual, we use strace on processes 2103908, 2103909, and 2103910, and then we send a SIGTERM signal to the dumb-init process.

strace 2103908

strace 2103909

strace 2103910

Hey? What happened, dumb-init? Why did 2103909 get SIGKILL directly without receiving SIGTERM?

Here we need to look at the key implementation of dumb-init.

void handle_signal(int signum) {
    DEBUG("Received signal %d.\n", signum);

    if (signal_temporary_ignores[signum] == 1) {
        DEBUG("Ignoring tty hand-off signal %d.\n", signum);
        signal_temporary_ignores[signum] = 0;
    } else if (signum == SIGCHLD) {
        int status, exit_status;
        pid_t killed_pid;
        while ((killed_pid = waitpid(-1, &status, WNOHANG)) > 0) {
            if (WIFEXITED(status)) {
                exit_status = WEXITSTATUS(status);
                DEBUG("A child with PID %d exited with exit status %d.\n", killed_pid, exit_status);
            } else {
                assert(WIFSIGNALED(status));
                exit_status = 128 + WTERMSIG(status);
                DEBUG("A child with PID %d was terminated by signal %d.\n", killed_pid, exit_status - 128);
            }

            if (killed_pid == child_pid) {
                forward_signal(SIGTERM);  // send SIGTERM to any remaining children
                DEBUG("Child exited with status %d. Goodbye.\n", exit_status);
                exit(exit_status);
            }
        }
    } else {
        forward_signal(signum);
        if (signum == SIGTSTP || signum == SIGTTOU || signum == SIGTTIN) {
            DEBUG("Suspending self due to TTY signal.\n");
            kill(getpid(), SIGSTOP);
        }
    }
}

This is the signal handling code of dumb-init. Upon receiving a signal, it forwards all signals except SIGCHLD (note that SIGKILL is an unhandleable signal). Let’s look at the signal forwarding logic.

void forward_signal(int signum) {
    signum = translate_signal(signum);
    if (signum != 0) {
        kill(use_setsid ? -child_pid : child_pid, signum);
        DEBUG("Forwarded signal %d to children.\n", signum);
    } else {
        DEBUG("Not forwarding signal %d to children (ignored).\n", signum);
    }
}

By default, it directly sends signals using kill, where -child_pid has this characteristic:

If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.

Directly forwarding to the process group seems fine, right? So what’s the reason for the issue? Let’s revisit the previous statement: the logic of sending signals to the process group is sig is sent to every process. Got it, it’s an O(N) traversal. No problem, right? Well, here’s the catch: dumb-init's implementation has a race condition.

As we just mentioned, sending a signal to the process group is an O(N) traversal, so some processes will receive the signal before others. For example, assuming our dumb-init's child process receives SIGTERM first, gracefully exits, and then dumb-init receives the SIGCHLD signal, it waits for the child process ID, determines it is a process it directly manages, and then exits. Since dumb-init is the init process in our current PID namespace, let’s revisit the characteristics of the PID namespace.

If the "init" process of a PID namespace terminates, the kernel terminates all of the processes in the namespace via a SIGKILL signal.

After dumb-init exits, the remaining processes will be directly SIGKILLed by the kernel. This leads to the situation we observed where the child process did not receive the forwarded signal!

So let’s emphasize this: the claim made by dumb-init that it can forward signals to all processes is completely false advertising!

Moreover, please note that dumb-init claims to manage processes within a session! However, in reality, they only perform signal forwarding for a process group! This is completely false advertising! Fake News!

Additionally, as mentioned above, in scenarios like hot updating binaries, dumb-init directly commits suicide after the process exits. This is no different from not using an init process at all!

We can test this with the following code, demo3.py:

import os
import time

pid = os.fork()
time.sleep(1000)

Forking one process results in two processes total.

The Dockerfile is as follows:

FROM python:3.9

RUN wget -O /usr/local/bin/dumb-init https://github.com/Yelp/dumb-init/releases/download/v1.2.5/dumb-init_1.2.5_x86_64
RUN chmod +x /usr/local/bin/dumb-init

ADD demo3.py /root/demo3.py

ENTRYPOINT ["/usr/local/bin/dumb-init", "--"]

CMD ["python", "/root/demo3.py"]

Build and execute, let’s first look at the process structure.

Process Structure of demo3

Then, simulating the old process exiting, we directly SIGKILL 2134836, and let’s look at the strace result of 2134837.

strace 2134837

As expected, after dumb-init commits suicide, 2134837 is SIGKILLed by the kernel.

So let’s review dumb-init’s flaws! Now, let’s discuss the implementation of tini.

A Friendly Discussion on Tini#

Fairly speaking, the implementation of tini, while still having some flaws, is much more refined than dumb-init. Let’s take a look at the code.

while (1) {
    /* Wait for one signal, and forward it */
    if (wait_and_forward_signal(&parent_sigset, child_pid)) {
        return 1;
    }

    /* Now, reap zombies */
    if (reap_zombies(child_pid, &child_exitcode)) {
        return 1;
    }

    if (child_exitcode != -1) {
        PRINT_TRACE("Exiting: child has exited");
        return child_exitcode;
    }
}

First, tini does not set a signal handler; it continuously loops through wait_and_forward_signal and reap_zombies.

int wait_and_forward_signal(sigset_t const* const parent_sigset_ptr, pid_t const child_pid) {
    siginfo_t sig;

    if (sigtimedwait(parent_sigset_ptr, &sig, &ts) == -1) {
        switch (errno) {
            case EAGAIN:
                break;
            case EINTR:
                break;
            default:
                PRINT_FATAL("Unexpected error in sigtimedwait: '%s'", strerror(errno));
                return 1;
        }
    } else {
        /* There is a signal to handle here */
        switch (sig.si_signo) {
            case SIGCHLD:
                /* Special-cased, as we don't forward SIGCHLD. Instead, we'll
                 * fallthrough to reaping processes.
                 */
                PRINT_DEBUG("Received SIGCHLD");
                break;
            default:
                PRINT_DEBUG("Passing signal: '%s'", strsignal(sig.si_signo));
                /* Forward anything else */
                if (kill(kill_process_group ? -child_pid : child_pid, sig.si_signo)) {
                    if (errno == ESRCH) {
                        PRINT_WARNING("Child was dead when forwarding signal");
                    } else {
                        PRINT_FATAL("Unexpected error when forwarding signal: '%s'", strerror(errno));
                        return 1;
                    }
                }
                break;
        }
    }

    return 0;
}

Using sigtimedwait to receive signals, it filters out SIGCHLD for forwarding.

int reap_zombies(const pid_t child_pid, int* const child_exitcode_ptr) {
    pid_t current_pid;
    int current_status;

    while (1) {
        current_pid = waitpid(-1, &current_status, WNOHANG);

        switch (current_pid) {

            case -1:
                if (errno == ECHILD) {
                    PRINT_TRACE("No child to wait");
                    break;
                }
                PRINT_FATAL("Error while waiting for pids: '%s'", strerror(errno));
                return 1;

            case 0:
                PRINT_TRACE("No child to reap");
                break;

            default:
                /* A child was reaped. Check whether it's the main one. If it is, then
                 * set the exit_code, which will cause us to exit once we've reaped everyone else.
                 */
                PRINT_DEBUG("Reaped child with pid: '%i'", current_pid);
                if (current_pid == child_pid) {
                    if (WIFEXITED(current_status)) {
                        /* Our process exited normally. */
                        PRINT_INFO("Main child exited normally (with status '%i')", WEXITSTATUS(current_status));
                        *child_exitcode_ptr = WEXITSTATUS(current_status);
                    } else if (WIFSIGNALED(current_status)) {
                        /* Our process was terminated. Emulate what sh / bash
                         * would do, which is to return 128 + signal number.
                         */
                        PRINT_INFO("Main child exited with signal (with signal '%s')", strsignal(WTERMSIG(current_status)));
                        *child_exitcode_ptr = 128 + WTERMSIG(current_status);
                    } else {
                        PRINT_FATAL("Main child exited for unknown reason");
                        return 1;
                    }

                    // Be safe, ensure the status code is indeed between 0 and 255.
                    *child_exitcode_ptr = *child_exitcode_ptr % (STATUS_MAX - STATUS_MIN + 1);

                    // If this exitcode was remapped, then set it to 0.
                    INT32_BITFIELD_CHECK_BOUNDS(expect_status, *child_exitcode_ptr);
                    if (INT32_BITFIELD_TEST(expect_status, *child_exitcode_ptr)) {
                        *child_exitcode_ptr = 0;
                    }
                } else if (warn_on_reap > 0) {
                    PRINT_WARNING("Reaped zombie process with pid=%i", current_pid);
                }

                // Check if other childs have been reaped.
                continue;
        }

        /* If we make it here, that's because we did not continue in the switch case. */
        break;
    }

    return 0;
}

Then in the reap_zombies function, it continuously uses waitpid to handle processes, exiting the loop only when there are no child processes to wait for or when encountering other system errors.

Note the difference in implementation between tini and dumb-init: dumb-init commits suicide after reaping its entry child process, while tini will only exit the loop after all its child processes have exited, then determine whether to commit suicide.

Now let’s test it out.

Using the demo2 example, let’s test the grandchild process scenario.

FROM python:3.9

ADD demo2.py /root/demo2.py
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

ENTRYPOINT [ "/tini","-s", "-g", "--"]
CMD ["python", "/root/demo2.py"]

Then build and execute, the process structure is as follows.

Process Structure of demo2-tini

Then, as usual, we use strace, kill and send SIGTERM to see what happens.

strace 2160093

strace 2160094

strace 2160095

Hmm, as expected. So does this mean there are no issues with tini's implementation? Let’s prepare another example, demo4.py.

import os
import time
import signal
pid = os.fork()
if pid == 0:
    signal.signal(15, lambda _, __: time.sleep(1))
    cpid = os.fork()
time.sleep(1000)

Here we use time.sleep(1) to simulate the program needing to handle SIGTERM gracefully. Then we prepare the Dockerfile.

FROM python:3.9

ADD demo4.py /root/demo4.py
ENV TINI_VERSION v0.19.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /tini
RUN chmod +x /tini

ENTRYPOINT [ "/tini","-s", "-g", "--"]
CMD ["python", "/root/demo4.py"]

Then build and run, let’s look at the process structure, it’s quite fast.

Process Structure of demo4

Then we use strace, sending SIGTERM in a series.

strace 2173315

strace 2173316

strace 2173317

We find that processes 2173316 and 2173317 successfully received the SIGTERM signal, but while processing, they were SIGKILLed. So why is this happening? In fact, there is a potential race condition here.

When we start using tini, after 2173315 exits, 2173316 will be re-parented.

According to the kernel's re-parenting process, 2173317 is re-parented to the tini process.

However, when tini uses waitpid, it uses the WNOHANG option, so if the child process has not yet exited when executing waitpid, it will immediately return 0. This causes it to exit the loop and start the suicide process.

Isn’t that frustrating? My mentor and I raised an issue about this: tini Exits Too Early Leading to Graceful Termination Failure.

I also made a patch for it; you can refer to use new threading to run waitpid (still in PoC, no unit tests written, and the handling is a bit rough).

In fact, the idea is simple: we don’t use the WNOHANG option in waitpid, making it a blocking call, and then use a new thread to handle waitpid.

The testing effect of this build is as follows.

Process Structure of demo5

strace 1808102

strace 1808104

strace 1808105

Hmm, as expected, the test has no issues.

Of course, attentive friends may notice that the original tini also cannot handle binary updates, for the same reason as in demo5. You can test this yourself.

In fact, my handling is quite rough and aggressive; we just need to ensure that tini's exit condition becomes it must wait until waitpid()=-1 && errno==ECHILD before exiting. The specific implementation method is something everyone can think about (there are actually quite a few).

Finally, let’s summarize the core of the issue:

Both dumb-init and tini, in their current implementations, commit the same error: in the special scenario of containers, they do not wait for all descendant processes to exit before exiting. The solution is quite simple; the exit condition must be waitpid()=-1 && errno==ECHILD.

Conclusion#

This article criticized dumb-init and tini. While dumb-init's implementation is indeed flawed, tini's implementation is much more refined. However, tini still exhibits unreliable behavior, and the expected use case of forked binary updates cannot be achieved with either dumb-init or tini. Moreover, both dumb-init and tini currently share a common limitation: they cannot handle the situation where child process groups escape (for example, if ten child processes each escape into their own process group).

Additionally, in the tests in this article, we used time.sleep(1) to simulate graceful shutdown behavior, and tini also fails to meet the requirements. So...

Ultimately, the bottom line is that the application’s signals and process reaping should be self-determined. Any reliance on an init process for handling these basic behaviors is irresponsible in production. (If you really want an init process, just use tini; definitely don’t use dumb-init.)

So, the bare exec method is the way to go; no need for an init process for safety!

That’s about it for this commentary. This piece took nearly a week of my spare time from raising the issue to verifying the conclusion and patching the PoC (the first draft was completed after 4 AM). Finally, thanks to Mr. Chuan for working late into the night with me. I hope you enjoy reading!

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.