Big Data Strategies

Step 1: Do I Have “Big Data”?

Oddly, this is not actually a straight-forward question for two reasons:

  1. Programs often copy data in order to manipulate it. When you, say, sort an array, what your computer actually does is create a new empty array, then move over entries from your original array into the new array one entry at a time (in sorted order). While it does then erase the original array, while it’s working it effectively has two copies of your array. As a result, the fact that you have 16gb of RAM doesn’t mean you can easily work with a 14gb file. As a general rule, you need at least two times as much memory as your file takes up when first loaded.

  2. The size of data changes as you change its format. For example, if you’re reading data from an inefficient file format (like a .csv text file), it may be the case that a file that is 10gb on disk may actually be only a few gb once you load it into a program like Python that knows how to store data more efficiently.

So how do I check to see if my data is fitting in memory?

If your program starts using more space than you have main memory, your operating system will usually just start using your harddrive for extra space without telling you (this is called “virtual memory”, and is nice in that it prevents your computer from crashing, though it will slow stuff down a lot). As a result, you won’t always get an error message if you try and load a file to is much bigger than main memory.

To avoid this, use one of the tools provided by your operating system to monitor whether it’s using the harddrive:

  • OSX: OSX comes with a program called Activity Monitor in the Applications > Utilities folder. The memory tab of Activity Monitor has a graph at the bottom called “Memory Pressure”. If this is green, you’re good. If it’s yellow or red, you’re actively going back and forth to your harddrive. (Note your computer may be using “virtual memory” without affecting performance if it’s just storing away things you aren’t actively using).

  • Windows: Windows has a program called “Resource Monitor” that shows data on memory use. If you get near 100% Used Physical Memory, your computer is probably using the harddrive.

Step 2: Pick a Strategy

If you have big data, you basically have four options:

  1. Use chunking to trim and thin out your data so it does fit in memory (this works if the data you were given is huge, but the final analysis dataset you want to work with is small).

  2. Buy more memory. Seriously, consider it.

  3. Minimize the penalties of working off your harddrive (not usually practical for data science)

  4. Break your job into pieces and distribute over multiple machines.

Strategy 1: Trim till it fits

At least in my experience, most of the time people have a dataset that’s too big to load into main memory, they’re in a situation where the raw data they want to work with is too big for main memory, but the dataset they are trying to create for analysis will not be. This can happen when the raw data you’re working with:

  • has lots of variables you don’t need for your analysis,

  • has lots of observations you don’t need for your analysis,

  • the data is at a more granular level then you need for analysis (e.g. it’s individual level, but you only need averages for different groups), or

  • where data is formatted in a really efficient manner where better data management would allow it to fit.

In these situations, the go-to strategy is chunking, which is the practice of:

  • loading only a small part of your data, preprocessing it (dropping extra variables, dropping extra observations, taking averages, and/or storing the data in more efficient formats), and saving it away,

  • then recombining all the now substantially smaller bits into a new analysis dataset that has everything you want and fits into data.

Chunking is pretty easy with most data formats. In pandas, for example, you can specify what rows you want to import using either the nrows (to specify you only want the first n rows read into memory), or you can use the combination of skiprows and nrows to get a specific set of rows (i.e. “start right after the number of rows in skiprows, then take nrows from there). Or if you really want to be fancy, you can actually just use read_csv as an interator using iterator=True and chunksize, which will return an object that you can loop over and which pulls in the number of rows you specify in chunksize each pass of the loop.

Types of thinning you may wish to do during import:

  • Drop unneeded variables

  • Convert strings variables to numeric or categorical/factor datatypes (Strings almost always have the largest memory footprint)

  • Drop unneeeded observations

  • Collapse observations

  • Change numeric datatypes to more compressed formats (e.g. using Float32 instead of Float64 when you know you don’t need all the precision offered by Float64. See numbers in machines for a refresher on tradeoffs from different data types. **Be especially careful compressing Ints from Int64 to Int32: it’s hard to overflow Int64, but very easy to overflow Int32!).

Finally, note that some import functions can do this automatically without you having to chunk your data manually. For example, read_csv has a usecols option (where you can specify a subset of columns you want to keep), and a dtypes option (for specifying the data format in which you want data stored as it’s streaming in from disk).

Note: Memory management in Python: If you’re chunking, you need to know about how Python manages memory.

In general, we never think about memory management. When you create an object, Python allocates it memory, and when no one is using an object any more, or when your code finishes running, Python throws it away. Great.

But when you’re pushing the limits of your computer, it’s sometimes important to intervene to encourage Python to delete things sooner than it might otherwise.

In general, Python will only remove something from memory when there is no longer any way for the programmer to access that data. So for example, if you ran the code:

[1]:
x = [1, 2, 3]
x = 'hello!'

In line 1, Python would create your list in memory. Then in line 2, it would delete that list since the only way you previously had to access that list (the variable x) has been reassigned, leaving you incapable of getting to the list.

Why does this matter to you? Because if you have a big thing that you’re done with, and you want to delete it, the only way to do it is to reassign all the variables that previously pointed to it it something else (by convention, you often reassign them to None), or you use the del command:

[2]:
x = [1, 2, 3]
del x

Just remember that del removes the variable, not the object. The object will only disappear if nothing points to it. So in this situation:

[3]:
x = [1, 2, 3]
y = x
del x

Python won’t delete the list because while you deleted x, the list is still accessible via y!

Strategy 2: Buy More Memory

Don’t overlook this option. You can get 100gb of RAM new for about $300 (and this will only go down over time), which means you can easily buy a desktop PC with all the ram and processing power you want for well under a thousand dollars. That’s not a small amount of money, but your time as a data scientist is also valuable (both to you and to whoever is paying you), and if you can obviate having to deal with all these kinds of problems buy just buying a bigger computer (especially if it helps you avoid the strategies described below), it can be an incredibly good investment.

Strategy 3: Minimize disk-access penalties.

There are a number of tools designed to minimize the penalties associated with moving data off disk. HDF5, for example, is a file format that does things like:

  • compresses data (it’s actually faster to move a small amount of compressed data from the harddrive to main memory and then decompress it than it is to move a large amount of uncompressed memory off disk. That’s how big the penality is to read data off disk).

  • can give you select rows or select columns from a dataset without loading the whole thing into memory first

  • stores data in efficient memory formats.

SQL, similarly, is designed to maximize the efficiency by which data can be pulled from disk, but in general if you aren’t in a situation where (a) lots of people are gonna be accessing the same data at the same time, or (b) you don’t plan to do tons of different analyses using the same dataset over a long period, the up front costs of setting up a SQL database overwhelm the benefits it has at the time you want to run a query.

But it’s also worth emphasizing that the strategies described above are generally only efficient if your goal is to pull a subset of data from disk. This makes them useful for things like medical databases, where doctors want to be able to quickly pull the records of individual patients. But in data science, we usually want to load a lot of data at once, and so the ability to quickly pull subsets of data isn’t always super useful.

Strategy 4: Distributed Computing

If absolutely everything else has failed it’s time to turn to distributed computing!

Distributed computing is the practice of taking your task and your data, breaking it into smaller pieces, sending those pieces to different computers, running the “smaller pieces” on those different computers, then bringing back the results and recombining them into a single output.

Distributed computing is amazing, and makes possible many things that would never be possible otherwise. But it is not without two very important substantial costs:

  • Programmer time: it takes a lot of time to learn to use these systems, to set-up these systems, and to find a way to take code that works easily on one computer and parallelize it and distribute it across lots of computers. Note that not all jobs can be parallelized! See parallelism.

  • Computer time: there is a huge fixed cost to break up jobs, moving those jobs between computers, and recombining the results. As a result, if a job can be done on one computer, it will usually run faster on the one computer than distributed across lots of computers.

For these reasons, you should only use distributed computing if the following apply:

  • Your job is highly parallelizable so you can distribute the job across lots of computers.

  • You can’t do the job on one computer either because there isn’t enough space in memory, or because your job is parallelizable and you need far more processors than you can get on one computer to run the job in a reasonable period of time.

    • One obvious case for this are situations where your data doesn’t even fit on the harddrives of one computer. This is a real problem for companies like Google and Facebook, and is actually one of the main reasons that distributed computing exists.

So if you’re in this domain, what options are open to you?

The first big distributed computing platform in the world was Apache Hadoop, but currently (late-2019), the most popular distributed computing platform is probably Apache Spark, which you can kind of think of as Hadoop v2.0. If you want to use Spark, as a Python person you’ll probably do so using PySpark.

Learning to use these distributed systems is outside the scope of this course, but if you want to get into it, you can find a nice little tutorial here.

While Spark is popular, however, it is far from the only option. Most research institutions (Duke included) run research computing clusters using a system called Slurm, where users simply submit their jobs using a simple bash script along with specification for the number of processors, amount of RAM, etc. they want to use, and the system them runs that script when resources are free. A Slurm system doesn’t do as much as PySpark (which manages distribution of data around a system, and has code that parallelizes lots of common tasks), but if you’re doing simple distributed processes (like you want to run a single simulation millions of times), it has much less overhead, is much easier to learn, and is already running on many systems (whereas Spark has to be set up for you).

Exercises!

Great! Now that you have some context for big data, let’s do some exercises.

Exercises