Setting Up a Cluster on Azure for Data Science¶
As discussed in our last lesson, getting a single large VM is by far the easiest way to move your work to the cluster, and it’s what I would do 99% of the time. But sometimes no single computer is big enough to do all the compute you want to accomplish. When that happens, we turn to using Virtual Machine Cluster, and in particular we’re gonna figure out how to set one up with
Setting up a cluster is, unfortunately, a little more complicated than starting up a pre-configured VM. With that in mind, we’re going to move from setting resources up interactively on the Azure website to using the Azure Command Line Tool (CLI). So before we dive in, go read this and follow the directions for installing the tool. Once it’s installed, come on back!
Install relevant packages.¶
First, run the following:
pip install "dask-cloudprovider[azure]" pip install "dask-cloudprovider[azure]" --upgrade pip install adlfs lz4 blosc pip install dask --upgrade pip install distributed --upgrade
Now we’ll talk through setting up all the resources you need for your cluster using the Azure commandline tool.
Login to Azure.
Create a new resource group.
If you have an existing resource group you’re welcome to use it. However, when you create your cluster it’ll create all sorts of ancillary resources (mostly related to networking protocols) so if you put them all in a new resource group, it’s much easier to clean everything up when you’re done by just deleting the resource group.
Note that location refers to which of Azure’s data centers you want to connect to. You can see a list of available regions here. If you’re at Duke, it’s probably best to use “eastus”.
az group create --location <location> \ --name <resource group name> \ --subscription <subscription>
Create a virtual network.
In order for the computers in your cluster to see one another, they have to be placed in a virtual network. Here’s a basic configuration recommended by the dask team. This will support up to 255 nodes.
az network vnet create -g <resource group name> \ -n <vnet name> \ --address-prefix 10.0.0.0/16 \ --subnet-name <subnet name> \ --subnet-prefix 10.0.0.0/24
Create a security rule.
By default, when you create a virtual network it’s locked down for security. So the first thing we have to do is create a security rule that allows you to connect to the virtual network from home. This requires creating a Network Security Group and new rule.
az network nsg create -g <resource group name> \ --name <security group name> az network nsg rule create -g <resource group name> \ --nsg-name <security group name> \ -n MyNsgRuleWithAsg \ --priority 500 \ --source-address-prefixes Internet \ --destination-port-ranges 8786 8787 \ --destination-address-prefixes '*' \ --access Allow --protocol Tcp \ --description "Allow Internet to Dask on ports 8786,8787."
Start your cluster!¶
Now that we have all resources set up to support the cluster, all that’s left is to get it started! The following code will take a few minutes, but you should get a lot of regular notifications about progress.
%load_ext lab_black from dask_cloudprovider.azure import AzureVMCluster
# Start cluster cluster = AzureVMCluster( resource_group="nce8rg", vnet="nce8vn", security_group="nce8nsg", n_workers=1, location="eastus", )
Creating scheduler instance Assigned public IP Network interface ready Creating VM Created VM dask-8eec3fa0-scheduler Waiting for scheduler to run Scheduler is running Creating worker instance
/Users/Nick/miniconda3/lib/python3.7/contextlib.py:119: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources. Hang tight! next(self.gen)
Network interface ready Creating VM Created VM dask-8eec3fa0-worker-29ac2137
Connect to Your Cluster¶
Finally, we need to connect our personal computer to the cluster so we can start assigning it work to do! Thankfully with dask this is super easy. (Not familiar with dask yet? Check out this intro here!).
from dask.distributed import Client client = Client(cluster)
/Users/Nick/miniconda3/lib/python3.7/site-packages/distributed/client.py:1129: VersionMismatchWarning: Mismatched versions found +---------+--------+-----------+---------+ | Package | client | scheduler | workers | +---------+--------+-----------+---------+ | lz4 | 3.1.0 | 3.1.1 | None | +---------+--------+-----------+---------+ warnings.warn(version_module.VersionMismatchWarning(msg["warning"]))
And that’s it! Now you have a running
dask cluster you’re connected to from home! Now you just use it like any other dask session. Want to see your dask Dashboard to see how your cluster is running? Just run
cluster.dashboard_link and click the link!
Scaling Your Cluster¶
The great thing about Azure VM Clusters is that you can create a cluster that starts with, say, two computers (nodes), but then scale it up to 100 nodes when you’ve debugged your code and are ready to run a bit of computationally intensive code. Then when you’re done, those nodes will automatically shut down after they’ve been idle for a set period of time, and you’ll be back to the two nodes. And that means you only have to pay for the 100 nodes when you are using them!. Pretty amazing, honestly.
To scale your cluster, use the
Signature: cluster.scale(n=0, memory=None, cores=None) Docstring: Scale cluster to n workers Parameters ---------- n: int Target number of workers Examples -------- >>> cluster.scale(10) # scale cluster to ten workers File: ~/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py Type: method
# Move from 1 worker to 2 cluster.scale(2)
Creating worker instance
Note that scaling is pretty fast, but not instant, so it may take a little time before you see your new works on your dask Dashboard.
Closing your Cluster¶
It’s super important that when you’re done you close your cluster! Otherwise Azure will just keep billing you for it running. (Reminder: its good practice to set up billing alerts! See the last section here for details.
Network interface ready Creating VM Created VM dask-8eec3fa0-worker-99699f0c Terminated VM dask-8eec3fa0-worker-29ac2137 Removed disks for VM dask-8eec3fa0-worker-29ac2137 Deleted network interface Terminated VM dask-8eec3fa0-worker-99699f0c Removed disks for VM dask-8eec3fa0-worker-99699f0c Deleted network interface