Health Checking Applications in the Cloud

The primary role of any PaaS is to make sure that the set of applications it manages remains in a healthy and available state. This is generally achieved by creating dedicated containers for the applications, deploying application code and running application processes, as well as continuously monitoring the pool of running containers to keep them in a healthy state. As such, it is important for the platform to guard applications by monitoring and reporting on the overall healthiness and behavior of containers it manages to prevent disasters.

This becomes even more critical once you realize that a PaaS can potentially manage hundreds of thousands of containers and their deployed processes among many execution nodes. The platform must ensure that it takes timely measures to discover failures, recover containers, and notify operators of these failures and recoveries.

While all existing platforms provide discovery, recovery, and notification mechanisms to deal with failing containers, there are subtle differences on how they implement these measures. In Cloud Foundry, we have studied approaches taken by other platforms and have settled on a series of strategies to ensure availability of applications and their corresponding containers.

In this first blog post we review the general methods of doing health checks in various cloud platforms. In a follow-up post, we explain the details of how Cloud Foundry transitioned from an older model of doing health checks to a newer model more compliant with the state-of-the-art strategies doing health checks.

Understanding Health Checks

If you are an optimist hoping that your service is going to be seamlessly deployed and remain up and running forever, let us be the bearers of bad news. You will sooner or later face the sad reality that at some point in the lifecycle of your service, it will either fail to start or eventually fail to function. Health checks are among the immediate remedies offered by a cloud provider to ensure early detection and reporting of a service failure so that corrective actions can be taken. Whether or not you know it already, any cloud platform you use offers some sort of health checking. IaaS platforms like Amazon AWS [1] and Google Cloud (GCE) [2] offer health checks to monitor the wellbeing of VMs placed behind load balancers to ensure routability and availability of services. Platforms like Cloud Foundry [3], Kubernetes [4], Mesos [5], Docker [6], and Consul [7] also offer health checks for containers they manage and run.

Health checks in IaaS exist to ensure service availability on a VM and health checks in platforms are to ensure service availability on a container. While health checks come in different forms and types, the one that shares most commonality between IaaS and PaaS or platforms is health checking applications and services that receive network traffic. This is primarily because the common type of applications running on a PaaS are the twelve factor apps (e.g., web apps) receiving and responding to network traffic. Most IaaS also offer possibilities (e.g., load balancer configurations) that allow and monitor VMs’ healthiness when dealing with ingress / egress network traffic. Before getting into the details of how health checks are implemented in Cloud Foundry, let’s review how health checks are done in cloud platforms.

Generally, application level health checks either perform script checks or connections checks. Script checks, as their name indicates, run a script and validate its successful execution by verifying the exit status code of the script. An exit status code of zero is considered a success while any other code is considered a failure in executing the script. Connection checks on the other hand attempt to establish a connection to an endpoint and validate the success of the connection. Connection checks test various protocols such as TCP, HTTP, HTTPS, and SSL/TLS. In case of TCP and SSL/TLS, establishing a successful connection to the given port is considered a successful health check. In the case of SSL/TLS connections it implies successfully verifying the certificate authorities and exchanging SSL/TLS keys. This may further extend to expecting certain request or response to be sent and received over an established connection for the healthiness of the connection to be verified. For HTTP and HTTPS protocols, the health check can be configured to hit a certain endpoint and expect a specific HTTP status code to determine success or failure of the health check.

Configuring Health Checks

Health checks can be configured with a series of parameters to determine the type of validation, frequency, and tolerance to failures. Different platforms generally use a subset of the configuration parameters below to set up health checks:

Interval: how frequently the health check will be performed
Grace period: how long to wait before starting a health check
Unhealthy threshold: number of permitted consecutive failed health checks before health check is considered to have failed
Healthy threshold: number of expected consecutive successful health checks before health check is considered to have succeeded
Timeout: duration after which the health check is considered to have failed

Table below summarizes how and what features are supported by different cloud platforms when doing application health check.

	Script Checks	TCP Check	HTTP Check	HTTPS Check	TLS Check	Grace Period	Interval	Unhealthy Threshold	Healthy Threshold	Timeout
AWS	–	✔	✔	✔	✔	–	✔	✔	✔	✔
GCP	–	✔	✔	✔	✔	–	✔	✔	✔	✔
Consul	✔	✔	✔	✔	–	–	✔	–	–	✔
Docker	✔	–	–	–	–	✔	✔	✔	–	✔
Marathon / Mesos	✔	✔	✔	✔	–	✔	✔	–	–	✔
Kube	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
Cloud Foundry	–	✔	✔	–	–	✔	✔	–	–	✔

The top two rows in the table above are examples of IaaS platforms, while Consul is a more general service discovery tool to which the healthiness of services in its catalog dictates their availability. The rest of the systems listed in table above are platforms, container management platforms and PaaS primarily responsible for keeping their applications up and running. What is particularly interesting in the example above is that Kubernetes implements almost all the configuration knobs while Cloud Foundry at this point has chosen to support what is specific to the category of application it manages.

While it can be seen from the table above that both IaaS and platforms take similar approaches in doing application health checks, there are subtle differences when it comes to running the health check by a platform or IaaS. In particular, in a platform, the health check process is often times internal to the application container while in an IaaS, the health check process is external to the VM that runs the application. Neither of the internal or external approaches to doing health checks is perfect and while it is achievable to do external health checks in platforms as well, there are pros and cons to internal or external health checking. And in our experience, this difference matters to help identify correct healthiness of applications and thus to allow the control plane software to take necessary corrective actions.

When running application health checks externally, extra network traffic may get created depending on where the health check process runs relative to the application. This in turn implies that network failures can possibly impact the results of health check and scalability may become an issue depending on how many instances of an application or service are checked. On the other hand, running health checks on a container or VM implies that the health check process consumes some extra resources on the host on which it runs, which needs to be accounted for when provisioning the application or service. Also while health checking local to the application hosts verifies the availability of the service on the loopback interface, there still needs to be some extra effort to ensure availability of the service from outside the host.

Summary

To sum up, health checks for a complex distributed system may create certain levels of uncertainty whereby the process of observing the healthiness of the system causes instability and unhealthy behaviors that are difficult to discount. Overcoming these challenges, especially at scale, are some of what we have learned in Cloud Foundry by trying various health checking tactics. Experiments with doing health checks for Cloud Foundry applications revealed requirements that while lending themselves to simplicity and being declarative, demand for careful allocation of resources and clear implications of success and failure for health checks.

This work is co-authored by Nima Kaviani (IBM) – Christopher Piraino (VMware) – Michael Maximilien (IBM).

Special thanks to Eric Malm (VMware) for his significant contributions on the subject.