Big Data Exercises

In these exercises we will work on data from a series of global weather monitoring stations used to measure climate trends to examine long-term trends in temperature for your home locality. This data comes from the Global Historical Climatology Network, and is the actual raw data provided by NOAA. The only changes I have made to this data are a few small formatting changes to help meet the learning goals of this exercise.

To do these excercises, first please download the data for this exercise from here. Note this is a big file (this is a big-data exercise, after all), so be patient.

(1) The data we’ll be working with can be found in the file ghcnd_daily.tar.gz. It includes daily weather data from thousands of weather stations around the work over many decades.

Begin by unzipping the file and checking it’s size – it should come out to be about 4gb, but will expand to about 12 gb in RAM, which means there’s just no way most students (who usually have, at most, 16gb of RAM) can import this dataset into pandas and manipulate it directly.

(Note: what we’re doing can be applied to much bigger datasets, but they sometimes takes hours to work with, so we’re working with data that’s just a little big so we can get exercises done in reasonable time).

(2) Thankfully, we aren’t going to be working with all the data today. Instead, everyone should pick two weather stations to examine during this analysis (so each pair should pick 4 – different weather stations have different data availability, so by grabbing two each hopefully at least 1 will have a long time series available…).

To pick your stations, we’ll need to open the ghcnd-stations.txt file in the directory you’ve downloaded. It includes both station codes (which is what we’ll find in the ghcnd_daily.csv data, as well as the name and location of each station).

When picking a weather station, make sure to pick one flagged as being in either GSN, HCN, or CRN (these designate more formalized stations that have been around a long time, ensuring you’ll get a station with data that has been recorded over a longer period).

Note that Station IDs start with the two-letter code of the country in which they are located, and the “NAME” column often constains city names.

The ``ghcnd-stations.txt`` is a “fixed-width” dataset, meaning that instead of putting commas or tabs between observations, all columns have the same width (in terms of number of characters). So to import this data you’ll have to (a) read the notes about the data in the project README.txt, and (b) read about how to read in fixed-width data in pandas. When entering column specifications, remember that normal people count from 1 and include end points, while Python counts from 0 and doesn’t include end points (so if the readme says data is in columns 10-20, in Python that’d be 9 through 20).

(3) Now that we something about the observations we want to work with, we can now turn to our actual weather data.

Our daily weather can be found in ghcnd_daily.csv, which you get by unzipping ghcnd_daily.tar.gz. Note that the README.txt talks about this being a fixed-width file. Since you’ve already dealt with one fixed-width file, I’ve just converted this to a CSV, and dropped all the data that isn’t “daily max temperatures”.

Let’s start with the fun part. SAVE YOUR NOTEBOOK AND ANY OTHER OPEN FILES!. Then just try and import the data (ghcnd_daily.csv) while watching your Activity Monitor (Mac) or Resource Monitor (Windows) to see what happens.

If you have 8GB of RAM, this should fail miserably.

If you have 16GB of RAM, you might just get away with this. But if it does load, try sorting the data by year and see how things go.

(If you have 32GB of RAM: you’re actually probably fine with data this size. Sorry – datasets big enough to cause big problems for people with 32GB take a long time to chunk on an 8GB computer, and these exercises have to be fast enough to finish in a class period! There are some exercises at the bottom with a REALLY big dataset you can work with.)

You may have to kill your kernel, kill Jupyter Lab, and start over when this explodes…

(4) Now that we know that we can’t work with this directly, it’s good with these big datasets to just import ~200 lines so you can get a feel for the data. So load just 200 lines of ghcnd_daily.csv.

(5) Once you have a sense of the data, write code to chunk your data: i.e. code that reads in all blocks of the data that will fit in ram, keeps only the observations for the weather stations you’ve selected to focus on, and throws away everything else.

In addition to your own 4 weather stations, please also include station USC00050848 (a weather station from near my home!) so you can generate results that we can all compare (to check for accuracy).

Note you will probably have to play with your chunk sizes (probably while watching your RAM usage?). That’s because small chunk sizes, while useful for debugging, are very slow.

Every time Python processes a chunk, there’s a fixed processing cost, so in a dataset with, say, 10,000,000 rows, if you try to do chunks of 100 rows, that fixed processing cost has to be paid 100,000 times. Given that, the larger you can make your chunks the better, so long as your chunks don’t use up all your RAM. Again, picking a chunk size then watching your RAM usage is a good way to see how close you are to the limits of your RAM.

(6) Now, for each weather station, figure out the earliest year with data. Keep USC00050848 and the one weather station for each member of your team with the best data (i.e. each member of your pair should have picked two weather stations: keep the one from each pair with the best data).

(7) Now calculate the average max temp for each weather station / month in the data. Note that in a few weeks, we’ll have the skills to do this by reshaping our data so each row is a single day, rather than a month. But for the moment, just sum the columns, watching out for weird values.

To sum across the value columns, we can combine:


(to just get the columns whose names start with “value”) with .mean(axis='columns') (which averages across columns (along rows) rather than the usual averaging across rows (along columns).

(6) Now for each weather station, generate a separate plot of the daily temperatures over time. You should end up with a plot that looks something like this:


Want More Practice?

If you really want a challenge, the file ghcnd_daily_30gb.tar.gz will decompress into ghcnd_daily.dat, the full version of the GHCND daily data. It contains not only daily high temps, but also daily low temps, preciptionation, etc. Moreover, it is still in fixed-width format, and is about 30gb in raw form.

Importing and chunking this data (with moderate optimizations) took about 2 hours on my computer.

If you’re up for it, it’s a great dataset to wrestling with data in weird formats and chunking.

Pro-tip: strings take up way more space in RAM than numbers, so some columns can be converted to keep the memory footprint of the data down.

Absolutely positively need the solutions?

Don’t use this link until you’ve really, really spent time struggling with your code! Doing so only results in you cheating yourself.