Cloud API Requirements

Does a standard cloud API makes sense? I’m still of the opinion that there is too much diversity out there and that the time is not ripe yet. There are many indispensable features in Amazon EC2 that no-one else has implemented at scale or for which there is no different take. The whole EBS feature set is one example. Image sharing and publishing is another one. Elastic IPs is another example. Eucalyptus has implementations for all this, but I can’t point to someone operating them at scale. In any case, Tom White offers the following “Turing complete set of cloud compute services”:

1. Instance lifecycle. The lifecycle defines the basic commands to provision, start, stop, and terminate instances. The bare-bones of a compute service.

2. Shared images. While it is possible to bootstrap from plain OS images created by the cloud service provider, the ability to build your own customized image, and crucially to share it with others provides a social element that helps drive adoption of a cloud platform.

3. Instance metadata. This feature allows you to inject small amounts of user-specific metadata to each instance at boot time – e.g. secret keys, or custom boot parameters – which allows another level of customization. This feature works well in conjunction with shared images: the common non-user specific code is baked into the shared image, and the user-specific code is supplied at launch time as metadata.

4. Network controls. Cloud providers need to think about the network environment that the user’s instances run in. Offering such services as DNS, firewalling, VPNs (and exposing it via an API) makes it easy for developers to get started quickly without having to build this infrastructure themselves.

This is a great start and I would wholeheartedly agree that these are requirements, but I don’t think they’re enough. In order to operate interesting services (or acquire interesting customers) a cloud must offer more than this. I’m not yet sure what the list exactly is, but the following come to mind:

  • Security groups or vlans: users must be able to control the network boundary around their servers, they must be able to group servers into tiers, and they must be able to create private communication structures. I believe the only two differences between security groups and vlans to be that a network interface can be in many security groups but only one vlan and that vlans can offer layer-2 multicast (and no-IP protocols) while security groups can’t.
  • Private IPs and remappable public IPs: going hand-in-hand with the notion of private communication structures goes the notion of private IP addresses. Of course publicly routable addresses are required as well, and there has to be some way to remap those IPs such that the failure of a server can be masked or a quick fail-over for other purposes can be engineered. I believe that in the end NAT (as used by Amazon’s Elastic IPs) is the only scalable choice, but I’m ready to learn new things and there certainly is room for improvement over EIPs.
  • Mountable storage volumes with snapshot backup: we did operate for a long time without Amazon EBS and at the time the “we need no expensive SAN” feeling was great, but after operating databases in EC2 with EBS for several years now there’s no way I’m going back. I need to be able to mount a storage volume on a server, operate it, take a snapshot backup, and then create a fresh volume from that snapshot on another server. I’m ok for the volume to be a remote filesystem as opposed to a block device, and I’m ok for the snapshot to be another volume as opposed to tertiary storage (S3 in Amazon’s case). Oh, but please don’t make it hard to do this across failure zones so the two above servers are failure-isolated.

In my opinion we can’t be successful until we can hash out all these features with a reasonable degree of flexibility so providers can differentiate yet at the same time a reasonable degree of uniformity so users (like the RightScale cloud management system and its pre-built ServerTemplates) can write portable systems. I have not heard the discussion reach the level of sophistication needed (and I freely admit that I haven’t listened as hard as I could have) and frankly I also feel like we’re all still learning new things all the time. On the RightScale end we’re in the process of reworking our multi-cloud layer so we can incorporate some of the above feature sets in a standardized manner so I hope I can make more specific contributions in the near future.