The following is a Cloud Foundry technology blog from engineer Martin Englund
Within the Cloud Foundry BOSH team we have been working on the Cloud Provider Interface for Amazon Web Services EC2 for over a year now, and coming from VMware it was a big change moving from the vSphere API to the AWS API.
Many things can be said about vSphere and its API, but it is very consistent in response times, and it hardly ever lies to you. The AWS API on the other hand is much easier to use than vSphere, but has some really annoying peculiarities – it misleads you and can be very fickle.
When you ask for a virtual machine (instance in AWS speak) to be created, you are handed back a reference to the new instance, and then you need wait until AWS reports that it is running. This will take anything from 20 seconds to 30 minutes, while you keep asking the API “are you done yet?”. The long wait isn’t a problem, but it is not unusual that three consecutive API calls report different states for the instance: missing, running and then missing again.
This blog post will highlight some of the lessons learned during this year of intense AWS development, and also how we have worked around some of the issues that arise from the eventual consistency of the AWS API.
The typical flow of events when the BOSH director is creating a new virtual machine is:
- Create instance
- Wait for instance to be running
- Create volume
- Wait for volume to be available
- Attach volume to instance
On a number of occasions we encounter problems at the last step, where AWS will report that either the instance or the volume isn’t available. This is very frustrating as we know it is – we just created it, and waited for it to be running/available. It can also happen in reverse, when we ask for a volume to be detached, wait for the detachment to complete, and finally then proceed to delete the volume – which sometimes results in an error saying the volume is still attached.
This led to really ugly code, as we always need to second guess the AWS API – it might be misleading us, but it could also truly be the case that the instance really isn’t there, so how many times do you retry the operation?
There are a number of other situations like this, where we know the resource is there, but the API will tell us it is not.
The people at Pi Cloud have run into this too, and they documented their experience in a blog post which was very helpful to us as we could validate our findings (and realised that we weren’t the only ones who were frustrated).
We have ended up implementing a helper class to wait for resources to be available, and as we know we do things in a certain order, we can take advantage of that when we wait… e.g. we know we’ll never ask for an attached disk to be deleted, so if we are told that the disk is still attached when we are trying to delete it – we just assume the AWS API is not being accurate with us and wait a little and try again.
This is not a very elegant solution, but we haven’t found something that produces more stable results more elegantly.
Unpredictable IaaS operations
Another big difference between vSphere and AWS is the predictability of how long an IaaS operation will take. At first we would keep asking “are you done yet?” every second, but that quickly lead to being told by the API to back off when we did large deployments, especially when there are a lot of concurrent VM creations. We then implemented an exponential back-off algoritm which kept us from being throttled, but that lead to waiting longer than we needed to.
This made us gather statistical data on AWS API calls to be able to reduce the number of calls and also minimize the wait time. After creating a couple of thousand instances we have very useful statistics that showed what our gut was telling us – it is usually fast, but every now and then we’d have an operation that would take orders of magnitudes longer than the average. Note that since we have experimented with different wait times, the resource might be ready a little earlier than reported, but it gives a good indication of how long operations can take.
Normally we limit the instance creation wait to 100 seconds, as if it hasn’t started running by then, chances are it will never run no matter how long we wait. We used to wait for instances to enter the running state for up to 6 hours, and it still never happened, so we now terminate the instance, and try to create it again (and again). Looking at the data, we should probably stop waiting at around 60 seconds, and just retry the creation as it is more likely to be faster than waiting for it to complete.
The top ten outliers in this series are: 84.071733082, 86.43778181, 87.430262797, 87.980610834, 91.108461213, 100.624374432, 101.625413733, 101.998313417, 197.348654888, 348.940139624
It is surprising that destroying an instance is slower than creating one, and when you create and destroy thousands of instances, you end up waiting a lot! This led us to implement a fast path delete option, which means we don’t wait for the instance to be deleted, we issue the delete instance API call and move on. To be safe we tag to the instance with to be deleted in case the delete never completes, so we can clean it up by hand if needed.
The top ten outliers in this series are: 154.012144863, 154.151371556, 154.163479756, 154.325573232, 154.553040499, 155.635032762, 155.920728728, 157.312958686, 160.3600697, 172.996460025
Creating volumes (10 GB in these tests) is usually a fast operation, but may sometimes take much, much longer. Since we can’t troubleshoot the underlying infrastructure to find the root cause, we should change our strategy to stop waiting after about 10 seconds and try creating a new volume instead of waiting.
The three outliers in this series are: 155.982429008, 156.393396034, 158.123489427
Attaching a volume to a running instance produced the biggest outlier in this experiment: it took over 26 minutes to attach the volume! An operation that on average takes less than 14 seconds.
The five outliers in this series are: 64.783846287, 80.983038301, 100.659785381, 419.018831255, 1583.646977629
Having all this data at hand has allowed us to create better algorithms for waiting for AWS resources. For example, when we create an instance, we know there isn’t any point in checking until around 20 seconds as it is the earliest we see a newly created instance running, and after 90 seconds it is highly unlikely it will ever run, so we can focus the waiting and polling around the sweet spot on the curve.
We aim to gather more metrics in the future, as there are other oddities with newly created instances we want to investigate further – after an instance reports that it is running it might still never boot the operating system, which usually is caused by the EBS root volume not being available. This is another scenario where having more data will improve our ability to know when we are better off recreating the instance instead of waiting.