Top Cloud API Sins

Here is a short list of the top poor design decisions that I’ve seen in cloud APIs. Let me rephrase that: Here’s a short list of the top API features that got in our way or simply didn’t work for us. There may well be other use cases where these features make sense.

  • Listing of resources without the details, e.g., a list-servers call that doesn’t return all the details for each server. This makes it very expensive to poll for server state changes because the listing doesn’t have enough information and so one has to do a show-server for each individual server. Imagine polling for an account that has several thousand servers – ouch. It’s fine to have a “with details” flag in the request so one can get the bare list, but we’d always set that flag.
  • Not returning a resource id on creation. Some APIs don’t give you a server ID when you request a server to be launched, the response is just “ok, we’ll launch a server”. This means you end up guessing “is that new server that just appeared in the list the one I just launched?”
  • Providing a task queue. Several APIs I’ve seen have a task queue that is supposed to provide updates on tasks that are in progress. For instance, you launch a server and you get a handle onto a task descriptor. For us that’s just overhead. Just include a state field in the resource itself and we’ll just keep track of the state changes on the resource. So if mounting a volume takes a while, create the volume resource and set its state to “attaching” (or whatever is appropriate). Having a separate resource to say “that volume you created is attaching” is just overhead and means that the state of a resource is now in several places.
  • Lacking publishable images or the equivalent of EC2’s user_data (small amount of data that is passed to the launching server via the launch API call). I touched on these in my previous blog post.
  • Not returning deleted resources in a “list resource” call. In particular, terminated servers must be returned in a list servers call for a certain duration, probably at least for an hour. The reason is that otherwise the client has to infer that the server self-terminated or failed when it no longer finds it in the result of list servers calls. Well, we have seen multiple completely different clouds fail to list running servers. In the case of EC2, which lists terminated instances for a good amount of time this resulted in error emails alerting us of the situation. In another cloud this resulted in servers marked as terminated, which is an irreversible operation and often triggers alerts and automation. And then the servers “resurrected”. Ouch! Now combine this with the next sin:
  • Pagination that goes page-wise instead of using a marker – for instance where you get page 1 or the first 100 resources and then issue a query for “page 2” or “from 100 on.” Explain to me how a client can get a consistent resource listing when resources can be added and removed concurrently. This is particularly fun if the client has to infer deletion from the absence of a resource in the listing: was it deleted or did it fall through the cracks between pages due to a different resource being deleted concurrently with the listing? The proper way to do pagination is using markers the way Amazon does it, but for a cloud API I actually don’t see the value in pagination. We always retrieve the whole list.

If you’re working on a cloud API, please think twice if you’re doing one of the above. Again, I don’t know all the use cases, just ours.

Now here’s what I’d really like to see – an event based interface instead of a request-reply based interface. Request-reply is fine if you have a system that sends commands to the cloud. It’s a problem when you build a system that reacts to changes in the cloud because you have to keep polling all these resources. We run a good number of machines that do nothing but chew up 100% CPU polling EC2 to detect changes. Fortunately CPU cycles are cheap :-).