Setting Up a Cluster on Azure for Data Science

As discussed in our last lesson, getting a single large VM is by far the easiest way to move your work to the cluster, and it’s what I would do 99% of the time. But sometimes no single computer is big enough to do all the compute you want to accomplish. When that happens, we turn to using Virtual Machine Cluster, and in particular we’re gonna figure out how to set one up with dask.

As this is one of the later reading in a series, I’ll assume you already know about the basics of what the cloud is and how Azure Storage works (and that you have an Azure account already!).

Setup

Setting up a cluster is, unfortunately, a little more complicated than starting up a pre-configured VM. With that in mind, we’re going to move from setting resources up interactively on the Azure website to using the Azure Command Line Tool (CLI). So before we dive in, go read this and follow the directions for installing the tool. Once it’s installed, come on back!

Install relevant packages.

First, run the following:

pip install "dask-cloudprovider[azure]"
pip install "dask-cloudprovider[azure]" --upgrade
pip install adlfs lz4 blosc
pip install dask --upgrade
pip install distributed --upgrade

Setup Resources

Now we’ll talk through setting up all the resources you need for your cluster using the Azure commandline tool.

  1. Login to Azure.

az login
  1. Create a new resource group.

If you have an existing resource group you’re welcome to use it. However, when you create your cluster it’ll create all sorts of ancillary resources (mostly related to networking protocols) so if you put them all in a new resource group, it’s much easier to clean everything up when you’re done by just deleting the resource group.

Note that location refers to which of Azure’s data centers you want to connect to. You can see a list of available regions here. If you’re at Duke, it’s probably best to use “eastus”.

az group create --location <location> \
                --name <resource group name> \
                --subscription <subscription>
  1. Create a virtual network.

In order for the computers in your cluster to see one another, they have to be placed in a virtual network. Here’s a basic configuration recommended by the dask team. This will support up to 255 nodes.

az network vnet create -g <resource group name> \
                       -n <vnet name> \
                       --address-prefix 10.0.0.0/16 \
                       --subnet-name <subnet name> \
                       --subnet-prefix 10.0.0.0/24
  1. Create a security rule.

By default, when you create a virtual network it’s locked down for security. So the first thing we have to do is create a security rule that allows you to connect to the virtual network from home. This requires creating a Network Security Group and new rule.

az network nsg create -g <resource group name> \
                      --name <security group name>

az network nsg rule create -g <resource group name> \
                           --nsg-name <security group name> \
                           -n MyNsgRuleWithAsg \
                           --priority 500 \
                           --source-address-prefixes Internet \
                           --destination-port-ranges 8786 8787 \
                           --destination-address-prefixes '*' \
                           --access Allow --protocol Tcp \
                           --description "Allow Internet to Dask on ports 8786,8787."

Start your cluster!

Now that we have all resources set up to support the cluster, all that’s left is to get it started! The following code will take a few minutes, but you should get a lot of regular notifications about progress.

[1]:
%load_ext lab_black
from dask_cloudprovider.azure import AzureVMCluster
[2]:
# Start cluster
cluster = AzureVMCluster(
    resource_group="nce8rg",
    vnet="nce8vn",
    security_group="nce8nsg",
    n_workers=1,
    location="eastus",
)
Creating scheduler instance
Assigned public IP
Network interface ready
Creating VM
Created VM dask-8eec3fa0-scheduler
Waiting for scheduler to run
Scheduler is running
Creating worker instance
/Users/Nick/miniconda3/lib/python3.7/contextlib.py:119: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources. Hang tight!
  next(self.gen)
Network interface ready
Creating VM
Created VM dask-8eec3fa0-worker-29ac2137

Connect to Your Cluster

Finally, we need to connect our personal computer to the cluster so we can start assigning it work to do! Thankfully with dask this is super easy. (Not familiar with dask yet? Check out this intro here!).

[3]:
from dask.distributed import Client

client = Client(cluster)
/Users/Nick/miniconda3/lib/python3.7/site-packages/distributed/client.py:1129: VersionMismatchWarning: Mismatched versions found

+---------+--------+-----------+---------+
| Package | client | scheduler | workers |
+---------+--------+-----------+---------+
| lz4     | 3.1.0  | 3.1.1     | None    |
+---------+--------+-----------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

And that’s it! Now you have a running dask cluster you’re connected to from home! Now you just use it like any other dask session. Want to see your dask Dashboard to see how your cluster is running? Just run cluster.dashboard_link and click the link!

[4]:
cluster.dashboard_link
[4]:
'http://52.151.219.69:8787/status'

Scaling Your Cluster

The great thing about Azure VM Clusters is that you can create a cluster that starts with, say, two computers (nodes), but then scale it up to 100 nodes when you’ve debugged your code and are ready to run a bit of computationally intensive code. Then when you’re done, those nodes will automatically shut down after they’ve been idle for a set period of time, and you’ll be back to the two nodes. And that means you only have to pay for the 100 nodes when you are using them!. Pretty amazing, honestly.

To scale your cluster, use the .scale() method:

[5]:
cluster.scale?
Signature: cluster.scale(n=0, memory=None, cores=None)
Docstring:
Scale cluster to n workers

Parameters
----------
n: int
    Target number of workers

Examples
--------
>>> cluster.scale(10)  # scale cluster to ten workers
File:      ~/miniconda3/lib/python3.7/site-packages/distributed/deploy/spec.py
Type:      method

[6]:
# Move from 1 worker to 2
cluster.scale(2)
Creating worker instance

Note that scaling is pretty fast, but not instant, so it may take a little time before you see your new works on your dask Dashboard.

Closing your Cluster

It’s super important that when you’re done you close your cluster! Otherwise Azure will just keep billing you for it running. (Reminder: its good practice to set up billing alerts! See the last section here for details.

[ ]:
cluster.close()
Network interface ready
Creating VM
Created VM dask-8eec3fa0-worker-99699f0c
Terminated VM dask-8eec3fa0-worker-29ac2137
Removed disks for VM dask-8eec3fa0-worker-29ac2137
Deleted network interface
Terminated VM dask-8eec3fa0-worker-99699f0c
Removed disks for VM dask-8eec3fa0-worker-99699f0c
Deleted network interface