Swimming in a Pool of Containers: Lifeguards on Duty! - Part II

Read the first installment of this series here: Health Checking Applications in the Cloud – Part I

Introduction

In Part I of this series, we looked into how different cloud platforms check the healthiness of applications, containers, and virtual machines that they run. In particular, we reviewed the types of health checks offered on platforms like AWS, Kubernetes, Apache Mesos, and so on, as well as the knobs and configuration parameters they offer to customize health checking of applications.

Cloud Foundry is no different from other platforms in terms of the types of health checking it offers. Recently, health checks in Cloud Foundry changed from an implementation native to its RunStep model to an approach with a separate binary process running in a sidecar container beside the application container. This performs the operations to support health checks. In this final part of the series, we compare and contrast the old and the new model of doing health checks in Diego, explain what justified the transition from the old model to the new model, and enumerate pros and cons of the new strategy.

Quick Introduction to Cloud Foundry Runtime

Diego is one of the primary subsystems of Cloud Foundry. It manages the lifecycle of containers responsible for running and for monitoring the Cloud Foundry application. Clients submit tasks and processes to Diego’s Bulletin Board System (BBS), the BBS consults with the Auctioneer component whose primary role is to find, schedule, and then delegate execution of tasks to a Cell Rep. The Cell Rep takes on the responsibility of executing processes by talking to the Cloud Foundry container manager (the Garden server). The Cell Rep’s additional responsibilities are to create containers, deploy the code for the target process into the container, make the process available to the outside world by making it routable through Route-Emitters, and then manage and monitor the execution of the process while updating the status of the process on the BBS.

Process executions in Diego are done by taking executive steps that enforce the series of instructions required to accomplish given tasks. The Diego executor implements around eighteen different types of steps but some of the primary ones include the Run Step, Download Step, Upload Step, Throttle Step, Monitor Step, Background Step, etc.

Cloud Foundry and Health Checks

The legacy model of doing health checks in Cloud Foundry utilizes the execution steps in Diego Runtime to have a monitor step run in parallel to the main application inside the container. The parallel monitor action ensures that upon provisioning a container to run the application code, Diego also downloads a health check binary and supplies the necessary arguments for it to perform traffic checks. The long running process associated with the application is considered ready when the periodic invocation of its monitor action is successful. On the other hand, the long running process is considered to have failed when the periodic monitor action fails or when the application process terminates. By ensuring successful periodic execution of the monitor action, Diego achieves both readiness and liveness checks of the long running process. Figure below depicts how the health check is implemented through a monitor action in Diego.

When a monitor action is sent from the Cloud Controller to the Diego API (BBS), the Cell Rep receives the monitor action information for the process to be run, downloads the health check binary and streams it into the container. The process is then run periodically and the exit status from the health check process determines the status of the monitored process. However, baking health checks into the execution steps model that Diego implements introduces some limitations. Primarily, we have identified the following issues with including health checks as part of Diego’s execution steps:

The overhead of invoking a separate process each time Diego wanted to health check an application appears to be unsustainable on resource-constrained systems and can lead to cascading failures as more and more apps fail and place additional stress on the system.
The concept of Steps specifies too many imperative details of how clients need to specify a check and so unnecessarily constrains how Diego implements the desired check.
Without a native network-check action type, the network-based health checks must be specified by running a separate executable, currently supplied by an entirely different component (the Diego file-server). Extending this executable with additional functionality has proven to be challenging to coordinate across release versions.
While the RunAction is used to invoke both the app start command and the health check executable, the logging policies for these two commands are substantially different. App developers expect the entire output of the app start command to be emitted into the application log stream, but expect to see output from the health check executable only in the case when it explains an application instance failure. Expressing these two different policies within the conceptual constraints of defining the execution steps in Diego has been difficult, and at present does not satisfy the health check case adequately.
Both the Diego Cell Rep and Garden do not unreasonably generate component logs when running a process through the Garden API. Running a process for every app instance every 30 seconds generates a substantial volume of logs that in practice in a CF environment have very little value and are expensive to transfer, aggregate, and process. Just as it is beneficial for app developers to have a strong signal about health check activity in application logs, it is beneficial for operators to have a strong signal around health checks in the component logs.

While some of these issues could be addressed by further refinements to the Action language, the problems with process invocation overhead and over-specification of implementation remain. For the above reasons, Diego decided to take on a more declarative approach for doing health checks in Cloud Foundry which we will review next.

Declarative Health Checks in Cloud Foundry’s Diego Runtime

Even though still experimental, as of version Diego 1.20, the Diego runtime now takes on a more holistic approach in implementing declarative health checks. Diego’s Cell Rep performs the specified checks by invoking an executable inside the application instance container at the same time that it runs the application start command. The Figure below shows how the new declarative health check is implemented in Diego.

This executable is intended to be a relatively long-lived process that performs multiple checks over the course of its lifetime. It operates in two modes:

Starting mode (Readiness Check): The executable runs the specified checks periodically on the specified intervals, with a specified overall timeout. Once each check succeeds individually, it returns with an exit code of 0 to indicate that the instance has succeeded in becoming ready. If the process reaches the overall timeout before all checks have succeeded, it emits output to stdout about those checks and then exits with a non-zero exit code to indicate failure.
Running mode (Liveness Check): The executable runs the specified checks periodically on the specified intervals, with no overall timeout. If one of the checks fails, the process reports that failure on its stdout and then exits with a non-zero status code to indicate failure. If this process exits for any other reason, the Cell Rep considers the check to have failed and stops the instance.

Similar to Consul, the new declarative model of health check in Diego introduces an endpoint in the BBS API that allows for a Check Definition to be presented to the runtime with specification of the type of the check to be performed (at the moment, TCP and HTTP) as well as all the necessary parameters for assessing validity of the checks. The provided information of the definition are then passed to the executable binary on the Cell Rep to conduct network checks (TCP or HTTP). A Check Definition is passed to Diego as part of the overall description of a long running process and requires corresponding modifications to the Cloud Controller. In the new declarative health check, unlike the old model, the health check binary is already available on a Diego cell. Upon receiving a request for a container that requires health checking, the Cell Rep bind-mounts the health check binary into the container and invokes a long running monitor step to perform the required type of check for the long running process.

The initial launch of the declarative health check was not completely smooth though. We observed that, in what is likely a bug in the memory-locking structures in the Linux kernel, a container can get some of its processes or threads stuck in a D (disk sleep) state when it is just below its memory limit and just above its memory+swap limit. The inaccessible memory of those processes then causes other processes that read that memory through the `/proc` file system, such as monit, to hang.

Previously, when invoking a new process for each health check iteration, an unresponsive health check process was considered a failed health check by Diego which would result in stopping the execution of the long running process and eventually terminating the application container and unwinding the system. With the new and more passive interaction model for the declarative health check, Diego still expects a response from the long-running, liveness health check. However, the health check process can also get stuck in a D state and since no news is good news with the declarative health check, Diego happily lets this container continue running, causing other processes to hang while attempting to read its memory and cascading the problem across the whole system.

To avoid the problem, Diego utilizes the concept of sidecars in Garden containers and launches the health check binary into a sidecar container that runs alongside the application container. As a result, when the container for the main application runs out of memory and is stuck in a D state, it does not interfere with the execution of the health check binary and health check binary can detect the misbehaving application container and signal its destruction. The new implementation of Diego health checks is now deployed in production environments and is closely monitored to ensure smooth transition to the new implementation.

Summary

In Summary, for Cloud Foundry, we discussed how we are aiming at transitioning from an older Action-based model of running health checks via Diego steps model of execution to a stand-alone binary running native to the application container. We also discussed the benefits we believe Cloud Foundry will gain by changes to its execution model. However as mentioned earlier, running a separate binary on an application container introduces challenges in using shared resources, which in turn may affect the overall behavior of the deployed application.

Issues like the D state problem mentioned above reveal that failures in a long-running health check process may have cascading effects on the overall behavior of the system, which could lead to destabilizing the entire system. Furthermore, contention from doubling the number of long-running Garden processes in the container may be unsustainable or may lead to worse resource usage. For the overall behavior of the system to be better manageable, it is important for the actual condition of single VMs and entire deployments managed by a Cloud Foundry deployment to be monitored. While in the experimental state, Diego is conducting more experiments using the declarative health check to ensure sufficient reliability before making it generally available.

This work is co-authored by Nima Kaviani (IBM) – Christopher Piraino (VMware) – Michael Maximilien (IBM).

Special thanks to Eric Malm (VMware) for his significant contributions on the subject.