
The first question I should answer is why did SpatialCloud choose Amazon Web Services (AWS)? When we made that decision almost two years ago, it was clearly the leader in what is now commonly referred to as Infrastructure As A Service (IAAS). Which very loosely translated means something like, "the same as before, just a lot easier and less expensive." You still have to choose between 2 CPUs and 4 CPUs (or more), but the difference is you can change your mind at any time and the economics of making those changes are measured in pennies rather than dollars. "The same as before" part was critical because we needed a command line where we could use tools that we were familiar with, like MapServer and GDAL. And we wanted to be able to continue to use those same tools anywhere, not just in the some specific cloud, but on others and on our own systems. The keyword here was "portability." Specifically, the main features of interest were EC2's pay-as-you-go model including ready to roll Linux machine images that we could modify to our own needs and the ability to create/destroy classic block storage on the fly. We knew we required a lot of on-demand horse power and flexible storage. With AWS the terms were clear, and better yet, came with no limits.
Now, I am pretty sure that anyone who has done infrastructure work in the past (like what hardware to purchase, where to put it, handling software installs, worrying about disaster recovery, etc.) intuitively gets the significance of cloud-based infrastruture. You get it when you first experience the browser as the on/off switch to n-number of server instances. You get it when you find yourself able to mount a new TB storage volume in about a minute, including formatting the disk and mounting the device, and you never even opened a box. I could go on and on here and make the folks over at Amazon happy with all the gushing, but I think you get the point.
Or, at least that's what we thought were the reasons for choosing AWS.
Initially, traditional technical and economic concerns were the deciding factors in our choice of AWS, but over time we realized while the IAAS stuff is very nice, (allowing us to easily scale on demand and to dream of doing something big while staying small), the Platform As A Service (PAAS) parts provided by AWS are just as important, if not more so. By this I mean services such as Simple Queue Service (SQS) and Simple Storage Service (S3). Yes, S3 may look like IAAS to some, but when you consider its other features such as DevPay and Cloudfront, it would be over an simplification to call it IAAS.
The reason for this is simple. At the beginning you just don't get the implications of PAAS on your application architecture. You may have read about it, thought you knew about it. But like all the really good stuff, you don't get it until you have written some code that actually leverages it and experience the benefit of it.
I believe that GIS in the Cloud pushes the envelope of Cloud technology. SpatialCloud specializes in that part of GIS that is traditionally the most resource intensive. In this blog I will share our experiences working with large image-based datasets on AWS. These hi-resolution image databases that we are all now familiar with (think Satellite layer in Google Map) are by their nature BIG. Typically, individual compressed image files test the 5GB file size limit of S3. Not your average family photographs. Just moving this size of source file around tests Amazon's local network infrastructure. To further complicate things, the standard for serving this kind of "base layer" image is to create what is called a "tile pyramid." See Maptiler.org for a guide. At the base of the pyramid is the full resolution layer. Each layer above that is half the resolution. The standard size for an individual tile is an tiny image that is only 256 x 256 pixels in size. Using the FSA's NAIP 1m/pixel dataset as an example, each base layer tile has an area of about 1/16 square km. If you take the land area of the contiguous United States (NAIP's coverage area) which is 7,663,942 km², that means you would need to store about 7,663,942 x 16 tiles or over 122 million tiles.
The point here is that we not only have a lot of data, we have a lot of files. Normally, faced with a file count 122 million, you need to carefully consider a host of issues, including infrastructure related problems like: what sector size to use for disk partitioning, what file system to use for very large numbers of files. And, once you have all of that data on disks and maybe even backed up, only then can you go on to figure out how to count something that is in the hundreds of millions. The use of AWS clearly helped us to off-load a large part of these problems, but as it allowed our team to do bigger things gave us new challenges and forced us into new modes of thinking.
We continue to learn by doing, and the exciting thing is that newly announced services increase the realm of opportunity for the agile. More to come.