It has become a bit of a trope on the internet to quip “there is no cloud; it’s just somebody else’s computer.” In fact, there are even stickers to that effect! But while it is true that a a big part of the innovation of the cloud is allow you to rent hardware from somebody else without owning it yourself, what makes the cloud The Cloud is not just the ability to rent hardware, but also all the innovations in software that make it possible for (a) lots of computers in a data center to work together in concert, and (b) the resources of a giant data center to be divvied up among lots of separate users seamlessly.
Why the Cloud is Complicated¶
Perhaps no single innovation has been as central to the rise of the cloud as the software that makes it possible to carve up a single computer into lots of smaller Virtual Machines, each of which acts like a single, normal computer. This technology – known as virtualization – it means that if you go to amazon or microsoft and say that you want to rent a computer with 2 processing cores, 16 gigs of ram, and 100 gigabytes of hard drive space, Amazon/Microsoft doesn’t have to go find a computer in their data center that has exactly those specifications. Instead, they can find any computer that has at least those resources available, and create a virtual machine on the computer that has access to exactly those resources. To you, this computer looks exactly like a single computer with those resources, but Amazon/Microsoft is free to allocate any other resources on that computer to other people by creating more virtual machines. All of these virtual machines can run on the same physical computer, but all of them are protected from one another, so users can only use exactly what they paid for, and have no way to interact with the other virtual machines on the same computer.
But here’s the problem with virtualization: it makes it possible to divide up big computers into little computers, but you can’t combine several small computers into one super virtual machine. The reasons for this are a little bit technical, but the basic idea is that when multiple processes are running on the same computer, they can all see and manipulate the same data in memory; but when processes are running on different physical computers, they can only communicate with one another by sending messages across ethernet cables, a process that works very differently and is much much slower.
As a result, if you want to use 200 processing cores to analyze a huge data set, you can’t just request a single virtual machine with 200 cores, because nobody makes a single physical computer with 200 cores. Instead, you will have to rent lots of smaller virtual machine and set them up so that they can interact with each other to share the work using a tool like
The Cloud is Not Primarily for Data Analysis¶
One thing that’s important to understand about the cloud is that most people using it to do the type of work that you all want to do as a data analyst. Most people using the cloud are using it to provide some type of service to internet users. For these people, the role of the cloud is to serve lots and lots of users, each of whom needs a little bit of processing power and to process a little bit of data. For example, if you have a messaging app, you’ll have lots of people regularly checking for and sending messages, but each of these interactions requires only a tiny amount of processor time, and will only involve moving a tiny amount of data.
To be clear, this time of cloud computing isn’t easy – the really hard part is ensuring that all users can access a common database, so if someone posts a cat picture on Twitter in South Africa, someone in North Carolina can see it when they check their tweet stream. But most of the demands can easily be handled by having lots of very small virtual machines, each of which handles a few requests at a time.
But data analysis is different: instead of lots and lots of users processing small bits of data, data analysis usually entails a small number of users (often 1) trying to process tons of data all at once for a single calculation. This, unfortunately, makes the fact that we can’t use virtualization to create a single giant virtual machine a much, much bigger problem!
That’s not to say data scientists never use the cloud in the first manner described. Many times after a model has been trained, data scientists want that model easily accessible by users. For example, if you write an algorithm that asks users for a few favorite songs and recommends new artists, you probably want that model to be accessible through a webpage or app. In that situation, we’re in the world of “model deployment,” which is back to a world where lots of users simulaneous want to interact in a way that is pretty simple for each user (using a model is easy computationally, unlike training one). To learn about how deployment works on Azure you can read more here. This is, however, not our focus here. In these exercises, we’ll be talking primarily about data analysis – small number of users, lots of data and compute.
And that also means you should be aware then when you google around for resources about cloud services, most cloud users don’t have the same needs you will as a data analyst! So be careful when you pick which internet resources to use for learning about data analysis on the cloud.
Components of the Cloud¶
In the next set of readings and exercises, we’ll be learning how – in a specific sense – we can setup virtual machines and
dask clusters on the Microsoft Azure system (why Azure? Because Duke has a relationship with them, and they’re the quickest growing Cloud service, even if they’re still smaller than Amazon AWS. So not really an endorsement, just a convenient choice.). But for the moment, let’s focus on general principles of cloud computing – concepts that will apply whether you use Microsoft Azure, Amazon AWS, Google Cloud, or any other service.
But if what follows starts to sound overwhelmingly complex, I encourage you to quickly jump ahead to the page where we actually set up a dask cluster and do some computing to see what this looks like in practice, where hopefully you will see that it’s not nearly as bad as it seems like it’s going to be.
As a data analyst, there are two ways that you can approach doing your work on the cloud: the first is just to get a big virtual machine with lots of resources and use it as a single machine; the second is to rent a lot of (usually smaller) machines and use them together in a cluster.
Getting a single large virtual machine is definitely the easier strategy if it works for your purposes. Working on a single virtual machine means you can use the normal programming practices you use on your personal computer. This makes these great choices if they work for your budget and provide the resources you need. I think most virtual machines top out at about 64 cores and about 256 gb of RAM, but bigger VMs do exist: Azure has some virtual machines that will go up to 128 cores and start at 480 GB RAM for about 3.60 an hour (though half those cores from from a process called hyperthreading, so the performance you get won’t quite measure up to that core count). You can see an overview of Azure options and prices here (look under the high performance computing options), where you’ll also see offerings for GPU-centric computers, which can provide as many as four GPUs.
Getting a single large virtual machine is often much easier logistically, and can be faster depending on the exact operation of your code (since different processes on one computer never have to send data across a network connection like in a cluster). Moreover, big virtual machines often seem suprisingly price competitive with clusters. The one downside to be aware of, though, is that as soon as you turn on a big virtual machine, you’re paying for all that hardware, even if you’re not using it. So if you go for a big VM, make sure you code is as close to ready to go as it can be (i.e. test on a smaller computer) before you move up.
The other option is to create a compute cluster that consistents of lots of smaller computers. This has a couple advantages: first, they’re infinitely scalable (no core limits!), and second, you can play around with two or three nodes (cheap!) till you’re ready to go, then scale up the cluster to hundreds of nodes for your actual computations (unlike a single big VM).
The downside, of course, is that working on a cluster requires software that can handle the fact that you need lots of physically distinct computers to work together. I’ll show you how we can do that with
dask, but it is a complexity to be aware of.
There can also be performance penalties to working with a cluster if your code requires moving data around a lot, since different computers have to communicate with one another by network connection.
Live versus Batch
One last thing to consider when picking your compute is whether you want an active computer (or cluster of computers) that just work for you, or whether you’re ok submitting a job to a queue and waiting for it to run till resources are available.
Active, dedicated computers are nice because if you’re doing data exploration, you can try out code, immediately see how it fails and fix it, or adapt based on what you see in the data or from model training. In what follows, I’ll be focused on that type of Cloud computing.
But the other option is what’s called “batch” computing, and anyone coming from a research background will likely find this more familiar (since this is how most research Cloud platforms work). In batch computing, you write a script and submit it, and it then the cluster finds nodes that are free to run the code, but you aren’t interactively connected to things while this happens. And if your job needs more resources than are free, it knows how to put jobs in a line and distribute them when resources become available.
This option can be better if you want to run a LOT of jobs (e.g. thousands) because this kind of compute is generally cheaper, especially if you are ok using “low priority nodes”. Low priority nodes are nodes that Azure could re-claim at any time if someone decides they’ll pay for the priviledge, but which are 80% cheaper than dedicated nodes as a result. This may sound awful, but batch computing systems keep track of when a node is killed for some reason, and puts the job it was doing when it died back in the queue to ensure everything still gets done.
If you’re interested in batch computing, the relevant keywords to find these services at Azure are Azure Batch and Azure CycleCloud (which supports schedulers like SLURM). Here’s a little Batch how-to video.
Storage is one of the coolest parts of the Cloud. That’s because: it’s (effectively) infinite! Not Dropbox file size limits, not harddrive constraints. If you can afford it (details of pricing below), you can store it.
When you get Cloud storage, you should be aware it comes in many flavors. Most importantly for our purposes (and in order of performance) in varies in what it will accept, how quickly it can be accessed, and the level of data redundancy in storage.
Unlike your computer — that just has “files” — cloud storage offers some specialized storage for data with specfic properties. In general, however, you will probably always want to work with Blob storage. Blob is the most flexible type of storage, as it can hold anything (BLOB standands for Binary Large OBject storage). So unless you have reason to do otherwise, just use Blob storage.
BUt so you know, other types of storage include:
File storage: similar to Blob. File storage supports certain file sharing protocols that people may have been using before they came to the cloud, so makes life easy for people moving to the cloud. But fewer features than BLOB, so almost surely not for you. Sounds like the storage that’d be most familiar to a normal computer user, but it’s a trick. Use BLOB.
Queue storage: specialized storage for messaging services.
Table storage: What’s known as a NoSQL database system.
Speed of Access
Storage usually comes in a couple speed tiers (sometimes called “hot”, “cold” and “archive”), the the faster the storage, the more expensive.
If you’re doing active data science, you almost surely want “hot” storage. Archive is for data you hopefully never have to look at again (loading it can take hours), and I guess you could use “cold” to save a little money when you’re storing data you don’t plan to see again for months? But usually use hot.
All cloud storage is backed up / duplicated, but how aggressively varies. For example, most cloud storage offers a “geographic redundancy”, which means it has copies in different data centers, so if one data center goes down you can still access your data. You can pay for this, but really that’s for people who use storage to back a website who never want their customers to be unable to visit their site even if a data center is down for a few hours (which I think is super rare?).
So what’s storage cost? Here’s a full list of prices for Azure, but for what you’re most likely to use (hot, locally redundant Blob storage), probably about 20 dollars per TB per month. So not nothing, but not awful.
The last thing to be aware of is that if you decide to run a cluster (multiple computers networked together) or connect storage in certain ways, you may have to connect them all to a Virtual Network to make it easier for them to see one another.
A Final Note on Complexity¶
A final note on complexity: because the Cloud is used for so many different purposes, and by such different users (everything from small businesses to Twitter and Netflix), it is almost infinitely configurable. If you want, you can just rent dedicated control of a couple physical computers, and not only control them, but also control all the network systems that connect them at the lowest level possible.
In these tutorials, though, I’m going to assume you’re an applied data scientist interested in analysis, and so I’m gonna do my very, very best to provide the smoothest on-ramp to these services as I can. I’ll show you only what I think you need to know to feel comfortable using the tools you’re using and to have a somewhat generalizable understanding of what’s going on (so if things change a little in the future, you don’t feel lost) without getting into the infinite possibilities of the cloud.
In addition, I will do my best to use the tools you already know as much as possible: Python, pandas, and dask. Services like Amazon AWS and Microsoft Azure have all sorts of specialty tools for machine learning, and some day you may decide that you prefer those to what I’m gonna show you. But I’m a firm believer in minimizing the number of tools you need to learn, so I’ll try and keep things as focused as I can.
Finally, if, when you’re done, you say “BUT I WANT TO CONFIGURE MY OWN VIRTUAL NETWORK AND DOCKER IMAGES AND KUBERNETES CONFIGURATIONS!” (don’t worry, none of that should mean anything to you), then good for you! The internet is full of in depth tutorials on those things. But if you just want to start using the Cloud to manipulation your data, then hopefully by the end of this section, you’ll feel good to go!
Let’s Get Going!¶
OK, enough abstract concepts – let’s get setup on Azure and start doing some data science!