{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Testing in a Data Analysis Workflow\n", "\n", "\n", "It's hard to do much programming without coming across the concept of *testing* — the practice of writing code whose sole purpose is to verify the programs you write actually do what you expect them to do. Indeed, testing is such a core part of programming that few people would consider posting their code publicly on a service like github without including a suite of tests. Some programmers have even adopted a practice known as *test-driven development* in which they *start* a project by writing tests that they want the code the plan to write to eventually be able to pass and then work until those tests pass.\n", "\n", "\n", "But while testing is deeply engrained in the practice of writing software, all too often, data scientists assume testing isn't relevant when they are writing simple scripts that are meant to run linearly to do things like run load, clean, merge, reshape, and analyze datasets using canned library functions. After all, if the point of tests is to ensure that the code you write is working correctly, and all you use in your simple scripts are functions that come from libraries like numpy and pandas (that are tested by the library maintainers), why would you need to use tests?\n", "\n", "\n", "The answer is twofold. First, it's easy to make coding mistakes even when you aren't writing complicated programs. Operations like merging, grouping, and reshaping can easily get wrong, so it's good to verify that the results of those operations are what you expect.\n", "\n", "\n", "But the second and bigger reason is that tests are needed in data analysis workflows to **test that what you think is true about your data is actually correct**. In other words, writing tests in a data analysis workflow is often not about ensuring you wrote the code you think you wrote, but rather about verifying your assumptions about the structure and properties of your data — and thus the behavior of the code you write — are correct.\n", "\n", "\n", "## A Simple Example\n", "\n", "\n", "To illustrate, consider the following simple example: I load a small subset of data from the World Bank World Development Indicators. The data includes data on countries, their GDP per capita, and Polity Scores (a measure of political freedom).\n", "\n", "\n", "In the code, I then try to compare the average Polity score for large oil producers to all other countries:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | country | \n", "region | \n", "gdppcap08 | \n", "polityIV | \n", "
---|---|---|---|---|
0 | \n", "Albania | \n", "C&E Europe | \n", "7715 | \n", "17.8 | \n", "
1 | \n", "Algeria | \n", "Africa | \n", "8033 | \n", "10.0 | \n", "
2 | \n", "Angola | \n", "Africa | \n", "5899 | \n", "8.0 | \n", "
3 | \n", "Argentina | \n", "S. America | \n", "14333 | \n", "18.0 | \n", "
4 | \n", "Armenia | \n", "C&E Europe | \n", "6070 | \n", "15.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
140 | \n", "Venezuela | \n", "S. America | \n", "12804 | \n", "16.0 | \n", "
141 | \n", "Vietnam | \n", "Asia-Pacific | \n", "2785 | \n", "3.0 | \n", "
142 | \n", "Yemen | \n", "Middle East | \n", "2400 | \n", "8.0 | \n", "
143 | \n", "Zambia | \n", "Africa | \n", "1356 | \n", "15.0 | \n", "
144 | \n", "Zimbabwe | \n", "Africa | \n", "188 | \n", "6.0 | \n", "
145 rows × 4 columns
\n", "\n", " | country | \n", "region | \n", "gdppcap08 | \n", "polityIV | \n", "large_oil_producers | \n", "
---|---|---|---|---|---|
16 | \n", "Brazil | \n", "S. America | \n", "10296 | \n", "18.0 | \n", "True | \n", "
67 | \n", "Jordan | \n", "Middle East | \n", "5283 | \n", "8.0 | \n", "False | \n", "
78 | \n", "Lithuania | \n", "C&E Europe | \n", "18824 | \n", "20.0 | \n", "False | \n", "
44 | \n", "France | \n", "W. Europe | \n", "34045 | \n", "19.0 | \n", "False | \n", "
55 | \n", "Haiti | \n", "S. America | \n", "1177 | \n", "8.0 | \n", "False | \n", "