blog single gear
Engineering

The Garden Team and the Strange Case of a Connection Being Reset

The Garden Team and the Strange Case of a Connection Being Reset

We all know that the night is dark and full of flakes. There are different kinds of flakes though: some of them are like a refreshing walk in the park, some make you shiver like a sudden gust of wind on a cold winter day, others are as scary as the nightmare that wakes you up in the middle of the night in cold sweat over and over again.

Today I am going to tell you a story about a flake that felt like an emotional roller coaster that used to elevate us to the highest hopes and then throw us down into total despair and rejection, but also a story about unforeseen help from a friend. The bitter memory has made us put off writing this article for more than a year so that we almost lost context on what happened… but finally here it is.

Who are we?

We are Garden, the team that builds the Cloud Foundry container engine. Every time you want MOAR resources for your Cloud Foundry application, the container engine is there to give you MOAR containers so that your application gets even more EPIC.

Containers, especially the Linux ones, are quite exciting themselves. If you are new to containers I would highly recommend reading the excellent tutorial on DIY containers by my colleague Georgi. In order for the article to make sense, you should know that our container engine uses runc under the hood to go through most of them Linux hoops.

The container engine runs on special Cloud Foundry VMs (called Diego cells) as a daemon that exposes an API to manage containers, such as create a container, delete a container, list containers, run a process in a container. Cloud Foundry users usually do not have access to that API, however when they, for example, scale up an application, that eventually results in a create a container request to the container engine. The curious reader could have a look at the Cloud Foundry architecture docs for all the nitty-gritty.

And real quick: what’s a flake?

Conventional computers are (still) quite simple machines — they produce one and the same output when performing a task with identical input. Their only advantage over humans is that they are faster, do not get bored or tired, and do not complain. Therefore, developers expect that if an automated test succeeds once, then it will always succeed unless they change the implementation under test or the test input parameters. 

The expectation holds water pretty well for single threaded programs. Nowadays pure single threaded programs are on the brink of extinction; even simple automated tests utilise concurrent threads under the hood. In a multithreaded environment, time becomes a factor which introduces a huge amount of volatility. Developers usually take that volatility under consideration when they write their programs and automated tests but in the end they are just humans — they can only predict almost all execution paths. Speaking in automated tests context, almost means that tests would succeed almost always; they would only fail like once in a decade (or when volatility is really high — once a month or once a week). As per Murphy’s law, they never fail when developers are desperately running them over and over again trying to figure out why they sometimes fail. 

We call such tests “flaky tests” or just “flakes.”

Once upon a time…

… there was a continuous integration (CI) environment which ran all Garden automated tests. As Garden is all about containers, most of those tests create at least one container. Even though containers themselves are exciting, they are quite useless if they do not contain anything; that’s why tests usually proceed with executing a process in that container. Unfortunately, sometimes process execution would fail with an error:

runc exec: exit status 1: exec failed: container_linux.go:348: starting container process caused "read init-p: connection reset by peer"

As it was pretty hard to reproduce the failure on our local test environment we wanted to believe this was one of those things that does not happen to good people on production. Alas, our CI records showed the failure happened several times per week, so we had to bite the bullet and debug the crap out of it.

20180806: It can’t be a bug in Garden… right?

At the time those mysterious failures started to show up, the Garden process execution flow had not been changed for ages. Therefore the most obvious explanation was that we had bumped a dependency of ours, thus picking up someone else’s bugs. And the most obvious culprit was Golang (one of the languages we use to talk to computers) itself. There was some tribal knowledge that we had not experienced the flake with Golang 1.9.4 so our first shot was to create a test environment running our tests against Golang 1.9.4 for a while just to make sure that the flake was not available there.

  • Hope level: 1
  • Desperation level: 0

20180809: No, it is not Golang

Three days later the flake happened on our Golang 1.9.4 test environment.

  • Hope level: 0
  • Desperation level: 1

20180820: Okay, let’s try to understand what it going on

As mentioned above, we had not changed the process execution flow for long, so it was time to roll up our sleeves and meet the monsters who dwell in runc. It turned out that the string init-p is actually the name of the parent pipe (hence -p) of pipe pairs that are used by parent processes to talk to their children and vice versa. The error message suggests there is a parent process getting messages from its child over a parent pipe and occasionally a connection reset error occurs while doing so. Indeed, when runc executes a new process in the container, it sets up a pipes pair and uses it to talk to the child process.

Wait a minute, runc starting a new process in the container does make lots of sense, but why would it want to talk to it? Why wouldn’t runc just start a process and just let it be itself and do its processy things? Well, if you have already read the DIY containers tutorial, you already know that processes in a container run in their own namespace (or set of Linux namespaces, if you want to be technically precise) that has been created when the container was initially created. Therefore when executing a process in the container, we want the new process to run in exactly that namespace.

Unfortunately, running processes in a given namespace in Linux is not that trivial:

  • First, runc needs to clone into a new process which would eventually execute the binary of the process we want to run in the container
  • Then, the cloned process needs to unshare the inherited namespace and enter the container namespace
  • Finally, the cloned process executes the binary of the requested container process into the container namespace
As a matter of fact, things are even more complex, Alexa Sarai explains all the additional complexity in details here. Don’t worry if you have no idea what he is talking about after reading the explanation for the first time, it took me months to wrap my head around it.

Runc needs to keep track of all that machinery so that it eventually returns the PID of the process executed in the container to Garden. All the necessary communication and synchronization is carried out over the aforementioned pipe pairs through special JSON messages.

Back to our flake: it seems that for some reason the new container process terminates unexpectedly and therefore the parent process (i.e. runc) is getting a connection reset while trying to read messages from its child over the pipes pair.

  • Hope level: 0
  • Desperation level: 5 (due to all that process-namespace madness)

20180821 (morning): Maybe someone already fixed that for us? 🤞

Why waste time trying to fix something if someone already did that for us? Let’s bump runc to latest. Hopefully that should do it.

  • Hope level: 1
  • Desperation level: 5

20180821 (afternoon): No, it is not fixed

Our CI reproduced the flake almost immediately 🙁

  • Hope level: 0
  • Desperation level: 6

20180822: Logs or it did not happen!

We find runc not really chatty, often puzzling us about what a problem could be. We decided that we needed to update our reproduction CI environment with a custom runc build that produced extra log lines on interesting places. That should have given us all the details we needed to figure things out!

  • Hope level: 1
  • Desperation level: 6

~20180917: Fast-forward about a month

Additional logs did not help us to find an explanation of why the child process dies.

  • Hope level: 0
  • Desperation level: 10

20180918: The corrupted message

Remember the pipes pairs and the JSON messages parent/child processes use to talk to each other? If you ever wrote some code utilising JSON communication you know that JSON marshalling/unmarshalling is an inevitable part of the game.

We came across an old moby issue; according to the people involved, the writing end of the pipe encodes a new line (\n) as the last character in the message. But that last character doesn’t always get into the reader’s decoding buffer. Therefore it may happen that the reader has reached the last useful character (}) of the JSON message and thinks that the message is over. However, it finds an obsolete new line character remaining in the pipe and freaks out. Maybe we could inspect our inter-process communication very-VERY-VERY closely and find where such a new line could come from.

  • Hope level: 1
  • Desperation level: 10

20180927: Let’s get rid of that new line

The only place runc would add an additional new line to the parent-child JSON message is here. Let’s update our custom runc build in our CI with a version that has this line removed

  • Hope level: 2
  • Desperation level: 10

20181003: You guessed it – still not fixed

Well, it happened again…

  • Hope level: 0
  • Desperation level: 12

20181017: (Get By) With a Little Help from My Friends

Several days later I have been working on an unrelated bug reported by our Diego friends. In short, their tests are fine, however, they noticed a goroutine leak caused by Garden after running their test suite.

A few hours later we managed to isolate a test that caused such a goroutine leak. The test starts a container with PID limit of 1, then runs a process and expects that to fail. I am not going to discuss container limits here, but a container with PID limit of 1 means that there cannot be more than one process running in the container. As a container always has an init process (started once the container is created), then its limit is already exhausted, hence failing to run the process makes sense.

Setting the maximum number of processes in a container is implemented via the PID cgroup. Each container is created in its own cgroup, setting the PID limit of a container actually writes the limit number into pids.max of the container PID cgroup.

Okay, let’s bash a program (full source code here) that does that, we can check for go routine leaks after it is done (the single parameter of the program determines the PID limit):

func main() {
    // Boring initializations
    container, err := createContainer(gardenClient, containerName, pidLimit)
    // Boring error handling
    processSpec := garden.ProcessSpec{
        Path: "echo",
        Args: []string{"hi"},
    }
    processExitCode, err := runProcess(container, processSpec)
    if err != nil {
        fmt.Printf("failed to run the container process: %s\n", err.Error())
        return
    }
    if processExitCode != 0 {
        fmt.Printf("process exited with non-zero exit code %d", processExitCode)
        return
    }
    fmt.Println("container process finished successfully")
    // More boring stuff
}

The code above would create a container and run echo hi into that container, easy peasy! Let’s run it!

# ./bin/conn-res mycontainer 1

failed to run the container process: runc exec: exit status 1: exec failed: container_linux.go:346: starting container process caused "read init-p: connection reset by peer"

WAIT! WHAAAAAAT!? HOW? WHY? Did we just manage to reproduce our flake!?

  • Hope level: 100
  • Desperation level: 12

20181018: The investigation

Let’s see whether our reproduction is reliable:

# ./bin/conn-res mycontainer 1
failed to run the container process: runc exec: exit status 1: exec failed: container_linux.go:346: starting container process caused "read init-p: connection reset by peer"

Hmmm… let’s try with 10

# ./bin/conn-res mycontainer 10
container process finished successfully

So what is the PID limit when the connection reset error starts to appear? Let’s play with the PID limit values:

# ./bin/conn-res mycontainer 9
container process finished successfully
# ./bin/conn-res mycontainer 8
container process finished successfully
# ./bin/conn-res mycontainer 7
failed to run the container process: runc exec: exit status 1: exec failed: container_linux.go:346: starting container process caused "read init-p: connection reset by peer"
# ./bin/conn-res mycontainer 7
failed to run the container process: runc exec: exit status 1: exec failed: container_linux.go:346: starting container process caused "read init-p: connection reset by peer"
# ./bin/conn-res mycontainer 7
failed to run the container process: runc exec: exit status 1: exec failed: container_linux.go:346: starting container process caused "read init-p: connection reset by peer"
# ./bin/conn-res mycontainer 7
container process finished successfully
# ./bin/conn-res mycontainer 8
container process finished successfully
# ./bin/conn-res mycontainer 8
container process finished successfully
# ./bin/conn-res mycontainer 8
container process finished successfully
# ./bin/conn-res mycontainer 8
failed to run the container process: runc exec: exit status 1: exec failed: container_linux.go:346: starting container process caused "read init-p: connection reset by peer"

Interesting! It seems that PID limit of 7 or 8 makes process execution flaky. But why? We are running a single process i.e. we are running a total of 2 processes (the container init and echo), PID limit of 7 or 8 should be pretty much enough, right? After some digging in the runc code and various experimental programs in C we found out that:

  • The PID cgroup limit is applied to threads started by a process (via e.g. fork, clone, pthread_create), not only to processes themselves
  • There is a race in runc
    • the new process is started without being waited for (here)
    • as the process runs, runc would put the processes into the limited container cgroup (here)

The conditions above create a race where the process can join the restricted cgroup too early while the Golang runtime is initializing and creating its internal threads.

In order to prove that we added a sleep of 100ms before the process is joined to the cgroup and this significantly reduced the failure rate when the PID limit is 7. Removing the code that joins the cgroup “fixed” it entirely.

  • Hope level: 100
  • Desperation level: 9
  • Satisfaction level: 9999999999999999

The aftermath

Once we got to the bottom we had a closer look at the flake history in our CI. And guess what? All the tests that flaked set the container PID limit to 10 or 20. We should have spotted the pattern earlier, doh!

It is now all clear: you should never set your container limits to ridiculously low values. In the end it turned out that this flake is one of those things that do not happen to good people on production indeed – good people never set their PID limits to 1.

As usual, we made the world a better place, so we raised this “documentation issue” with runc to share the knowledge with the community.

As per the flake itself – we just had to increase the PID limit to 50 in our tests and kiss the flake goodbye!

Closing Credits

This write-up would never have been written without the help and support of all gardeners, at time of this thriller:

  • Claudia Beresford
  • Yulia Nedyalkova
  • Georgi Sabev
  • Tom Godkin
  • Giuseppe Capizzi
  • Oliver Stenbom
  • Danail Branekov

Product Manager: Julz Friedman

A huge THANK YOU! goes to our Diego friends who pointed us to the flake root cause by accident.

 

Danail Branekov Profile Image

Danail Branekov, AUTHOR

SEE ALL ARTICLES