Plotting with Altair

Plotting is one of the most important tools of data science, not only for effectively communicating findings to others, but also for exploring and understanding data for oneself. Plotting data allows us to leverage the astounding human ability to recognize patterns visually to help us understand our data. To do so effectively, however, we need to think carefully about how our visual pattern recognition abilities work, and how we can use that knowledge to effectively present our data.

In this reading, we will provide an overview of an approach to thinking about visualizations – the grammer of graphics – as well as the Altair plotting library. As we will see, the way Altair works embodies the logic of how we use plots to communicate information, and as a result the very way the library works will help us organize our thinking when plotting.

The Grammer of Graphics

The Grammer of Graphics is a framework for thinking about visualization developed by Leland Wilkinson. Its core idea is that any visualization can be decomposed into several constituent parts:

The Data

At the core of any visualization, of course, is the data that were hoping to visualize.

Marks

To visualize our data, we must of course represent the data with actual marks on our figure. These include not only the axes that give our figure form, but also points, circles, bars, or other geometric shapes that populate our figure.

Encoding:

The encoding of a figure is where the magic happens. In order to represent our data in a figure, we must encode different aspects of our data into the various visual features (channels) of our figure. For example, in a simple scatter plot, we encode information about one variable in the location of points along the x-axis, and encode information about another variable in locations along the y-axis. Then in the resulting figure, we would say that we have encoded information about two input variables in the location of points in our figure.

But in the Grammer of Graphics, there are many ways information can be encoded in a figure. In a scatter plot, for example, we can encode information in the location of points, but we can also encode information in the size of points (e.g. making points larger for, say, more populous countries), or the shape and color of points. Each of these different ways information can be encoded is called a “channel.”

Putting these together, we can say that when we make a figure, we are communicating information about our data by encoding information about the value of different variables in the different channels made possible my our marks.

(To be clear, there are some really interesting nuances to Wilkinson’s work that I can’t do justice to in the space I have here, including discussion of scales, which I’m ignoring for now, but those are the basics!)

The Altair

In this course, we will be doing our plotting using the Altair plotting library. The reason for that is that Altair not only generates gorgeous features and has native support for interactive web graphics, but also that the way it works (its API) embodies the idea that the meaning of a figure can be decomposed into data, marks, and an encoding of variables onto distinct channels.

(We’ll talk more about other plotting libraries and the relative merits of Altair in another reading, don’t worry!).

To illustrate the basics of Altair, let’s begin with a toy dataset provided by the vega_datasets library with data on the features of different models of cars, such as horsepower and mileage:

[1]:
# Standard imports
import pandas as pd
import numpy as np
import altair as alt
[2]:
from vega_datasets import data
cars = data.cars()
cars.sample(5)
[2]:
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
260 chevrolet malibu 20.5 6 200.0 95.0 3155 18.2 1978-01-01 USA
389 honda Accelerationord 36.0 4 107.0 75.0 2205 14.5 1982-01-01 Japan
360 volkswagen jetta 33.0 4 105.0 74.0 2190 14.2 1982-01-01 Europe
167 buick century 17.0 6 231.0 110.0 3907 21.0 1975-01-01 USA
335 mercedes-benz 240d 30.0 4 146.0 67.0 3250 21.8 1980-01-01 Europe

To build an Altair figure, we begin with making a new Chart object and passing it the data we wish to plot with Chart(), then specify the basic marks we want on our plot.

In this case, let’s begin by looking at the relationship between mileage (Miles_per_Gallon) and engine power (Horsepower) in a scatter plot. For that, we’ll want to make a chart with points, so we begin:

[3]:
import altair as alt
alt.Chart(cars).mark_point()
[3]:

While we have now given this Chart data, and specified we’ll be adding points to the chart, we haven’t actually told it which variables will help determine the location of points yet. As a result, this code does not generate any output. But if we tell it we want to encode Miles_per_Gallon to the x-axis and Horsepower to the y-axis, we get:

[4]:
alt.Chart(cars).mark_point().encode(x="Miles_per_Gallon", y="Horsepower")
[4]:

Voila! It’s as simple as that.

But of course, a simple plot with toy data always seems easy, so let’s explore Altair more with real data and a concrete question.

Altair, GDP per Capita, and Economic Development

To better demonstrate not only the ins-and-outs of the Altair plotting library, but also how plotting can be used for exploratory data analysis, let’s take a look at the relationship between average income and different measures of human development.

Among people who are interested in trying to improve living conditions in developing countries around the world, one canonical debate is about whether the best way to help those most in need is to focus on encouraging economic growth writ large, or whether it would make more sense to focus on specific interventions around things like infant mortality or the education of girls.

Many economists argue that encouraging economic growth may actually be the most efficient way of achieving improvements in human development not because economic growth itself is intrinsically valuable, but because economic growth tends to be associated with improvements in lots of other things that we care about, presumably because a better functioning economy and wealthier citizens are better able to build the institutions and public service delivery systems necessary to provide access to schools and public health everyone.

Others argue that efforts to encourage economic growth are targeting an outcome that we don’t actually care about (income) rather than targeting the things that we are actually trying to encourage (higher life expectancy, equal access to education, etc.). In this view, targeting the specific outcomes is much more likely to help us achieve the things that we care about most.

(Of course, it’s worth emphasizing that this way of framing the debate is simplistic – this is not an either/or proposition, and efforts can be directed in both directions. But in a world of scarce resources, practitioners do sometimes have to decide whether they allocate resources to encouraging economic growth through, say, small business loans versus putting money into building schools.)

In an effort to move beyond debating this in the abstract, in this lesson we will use Altair to look at patterns between economic growth (measured in terms of average incomes in different countries, measured as Gross Domestic Product (GDP) per capita) and other development outcomes (life expectancy, literacy, etc.). We will be doing so using data from the World Development Indicators (WDI) database from the World Bank that provides country-level data on a range of outcomes.

Loading WDI Data

The data we’ll be using can be found here. This is basically straight from the WDI website, though I’ve made a couple small formatting changes to make life a little easier.

Let’s begin by loading the data and getting a quick sense of what we have:

[5]:
wdi_data = (
    "https://raw.githubusercontent.com/nickeubank/"
    "practicaldatascience/master/Example_Data/wdi_plotting.csv"
)
world = pd.read_csv(wdi_data)
world.sample(5)

[5]:
Year Country Name Country Code GDP per capita (constant 2010 US$) Population, total CO2 emissions (metric tons per capita) Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population) PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total) Life expectancy at birth, total (years) Mortality rate, under-5 (per 1,000 live births) Literacy rate, youth female (% of females ages 15-24)
7663 2006 France FRA 40850.358740 63621376.0 5.840804 NaN NaN 80.812195 4.5 NaN
2675 1983 Gambia, The GMB 903.542449 700198.0 0.230432 NaN NaN 48.645000 212.9 NaN
6709 2001 Turkmenistan TKM 2458.503909 4564087.0 8.345590 NaN NaN 63.842000 66.0 NaN
2478 1982 Iran, Islamic Rep. IRN 5322.364617 41869231.0 3.296157 NaN NaN 53.126000 91.8 NaN
6143 1999 Finland FIN 38277.282550 5165474.0 10.939945 NaN NaN 77.291220 4.4 NaN

We can also look at the columns in the dataset a little more systematically:

[6]:
for c in world.columns: print(c)
Year
Country Name
Country Code
GDP per capita (constant 2010 US$)
Population, total
CO2 emissions (metric tons per capita)
Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)
PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total)
Life expectancy at birth, total (years)
Mortality rate, under-5 (per 1,000 live births)
Literacy rate, youth female (% of females ages 15-24)

And finally, we can look at the years that are included in the data:,

[7]:
world.Year.describe()
[7]:
count    10850.000000
mean      1995.500000
std         14.431535
min       1971.000000
25%       1983.000000
50%       1995.500000
75%       2008.000000
max       2020.000000
Name: Year, dtype: float64

And the number of countries (although we can already see from the sample above that not all countries have all variables defined in every year):

[8]:
# How many countries?
world["Country Name"].nunique()
[8]:
217

Basic Descriptive Plotting

Given our interest in the relationship between average income and human development, let’s start by looking at the relationship GDP per capita and the standard measure of the quality of public health in a country: the mortality rate for young children.

To start, let’s also limit our attention to data from a single year – we can look at variation over time later.

[9]:
world = world[world.Year == 2018]
[10]:
alt.Chart(world).mark_point().encode(
    x="GDP per capita (constant 2010 US$)",
    y="Mortality rate, under-5 (per 1,000 live births)",
)
[10]:

Here we immediately see one of the great things about plotting – we can instantly recognize that this relationship is not remotely linear, which is important to know because it means any analysis designed to measure linear relationships (like a simple correlation or a linear regression) would be misleading. Moreover, not only do we see it’s non-linear, but we can get a sense of the way in which it’s non-linear!

(Yes, there are ways to learn this without plotting, but those approaches tend not to tell us what functional form the relationship actually takes, and can often be fooled by certain non-linear relationships in a way your eye is not! Indeed, even the standard way of validating the linearity assumption in a linear regression is to plot your residuals for this precisely reason.)

So let’s log our variables to get something more linear and easy to interprete:

[11]:
world["log_gdp_per_cap"] = np.log(world["GDP per capita (constant 2010 US$)"])
world["log_under5_mortality_rate"] = np.log(
    world["Mortality rate, under-5 (per 1,000 live births)"]
)
[12]:
alt.Chart(world).mark_point().encode(
    x="log_gdp_per_cap",
    y="log_under5_mortality_rate",
)
[12]:

Much better. Here we can clearly see a nice, linear relationship between logged GDP per capita and logged child mortality.

But we now come to one of the quirks of Altair – by default, it always starts its axes at zero! And while there are situations where including zero is import to provide proper context to our data, here it just adds white space to our plot without providing any meaningful information.

Encoding Channel Objects

In order to change our x-axis, we need to introduce Channel Objects.

By default, you can simply use the name of a column to specify what variable should be encoded in a given channel in Altair. Above, for example, we encoded log_gdp_per_cap to the x channel by just entering x = "log_gdp_per_cap". But if we want to start modifying how log_gdp_per_cap is being encoded to the x channel, we to pass a Channel Object instead of just the name of column (e.g. x = alt.X("log_gdp_per_cap")). These objects are useful because in addition to accepting the name of a variable as the first argument, they also support lots of options. For example, to tell Altair it doesn’t need to include zero on the x-axis, we could use the following code:

[13]:
alt.Chart(world).mark_point().encode(
    x=alt.X("log_gdp_per_cap", scale=alt.Scale(zero=False)),
    y="log_under5_mortality_rate",
)

[13]:

This, admittedly, does feel a little cumbersome at first, but the advantage is that it has a nice logic too it – we don’t think of the x-axis as a separate object we need to modify, but rather as a facet of how we are encoding data from log_gdp_per_cap to the x-axis location channel.

These objects exist for all channels – so there’s an alt.Y() for the y channel, a alt.Size() for the size channel, etc.

Adding Channels

So far we’ve made great progress on the question we were asking: does average income seem to be strongly correlated with childhood morality? Yes, they are clearly strongly correlated, though with some dispersion of outcomes at each income level.

But as we saw above, we have more than 200 countries in this dataset, and each country is represented by a single point in that figure. This is equitable, but perhaps misleading – if we’re interested in human welfare, then not all countries are equally important; after all, nearly half of all people alive live in only 7 countries! What if this relationship between childhood mortality is true for small countries, but not big ones, but we can’t see it because all countries are being plotted the same way?

One way to address this is to encode additional data – the population of each country – to another channel. The location of points on x and y axes are, of course, the most obvious encoding channels in a visualization, but they are far from the only ones available! Indeed, Altair allows information to be encoded in a range of mark features, including:

  • Color

  • Size

  • Shape

  • Stroke (for lines)

  • Opacity

and more.

So let’s add one of these to our data – let’s encode population in the size of our points, so big countries get bigger points! That way we can see if big countries also seem to be following this trend we see:

[14]:
alt.Chart(world).mark_point().encode(
    x=alt.X("log_gdp_per_cap", scale=alt.Scale(zero=False)),
    y="log_under5_mortality_rate",
    size="Population, total"
)

[14]:

And just like that we have added substantial information to our figure! In this case, what we’ve learned is that the linear relationship we see for all countries is similar to the linear relationship we see if we focus only on big countries (since we see the bigger circles distributed all along the same general line as the smaller points), which should re-assure use that the relationship we see holds both for the average country and for the average person.

The Encoding Channel Hierarchy

In the figure above, we encoded GDP per capita to each point’s x-axis location, Mortality to each point’s y-axis location, and population to point size. But why are encoding GDP and Mortality to x-y locations, and population to size? We could have just as easily encoded population in x-axis locations, and GDP in size:

[15]:
world["log_population"] = np.log(world["Population, total"])
alt.Chart(world).mark_point().encode(
    x=alt.X("log_population", scale=alt.Scale(zero=False)),
    y="log_under5_mortality_rate",
    size=alt.Size("log_gdp_per_cap", scale=alt.Scale(zero=False)),
)

[15]:

But when we look at that figure, we immediately get the sense we’ve done something wrong. Why?

The answer is that not all encoding channels are created equal. That’s because our visual pattern recognition system is more sensitive to some channels (like x-y locations) than others (size). In the figure above, for example, the pattern between GDP and Mortality is still present, the problem is that it manifests as the marks being ever so slightly larger at the bottom of the figure than at a top, a difference that’s really hard to see.

(Why is this the case? While it’s always dangerous to play armchair evolutionary biologist, it’s not hard to imagine a reason we are so sensitive to location. Misperceive the location of an object even a little and your spear throw goes awry, or you foot misses the rock you were trying to step on. Misperceive the size of an apple a little and… well, there are probably no consequences!)

This idea of a hierarchy of encoding channels is explored in detail in Jacques Berin’s Semiology of Graphics. As with the Grammer of Graphnics, I won’t go into all of the thoughtful nuance from the original book, but in short Jacques argues that the top of the channel hierarchy is (in decreasing order of desirability):

  • Position

  • Size

  • Color

  • Shape

(There are all sorts of nuance for data that’s categorical versus cardinal, and different rules for points versus lines versus shapes, but crucially in all cases position is at the top!)

And that’s why this figure seems so wrong – we haven’t used position to encode the relationship we care about most, but instead used a mixture of position (for Mortality) and size (for GDP per capita), obscuring what we most want to understand.

In our original figure, by contrast, we encoded a secondary variable of interest (population) into a secondary channel (size). And given we only wanted to quickly see if the relationship we were seeing was a relationship that was true for both large and small countries, this secondary channel was sufficient. But it’s not sufficient for GDP per capita.

Layering & Transforms

One feature of this plot is that we see a significant relationship between GDP per capita and child mortality, but we also see that there is some dispersion of outcomes around that trend – even at the same income level, some countries are outperforming others in terms of reducing child mortality.

To make this easier to visualize, it may be helpful to add a line-of-best-fit (e.g. a linear regression line) to our plot.

To do so, we need to make use of two Altair features: layering and transformations.

Layering is the process of creating distinct charts and then overlaying them. This sounds complicated, but Altair makes it simple, an important feature given how often we want to do things like place a regression line over a scatter plot of our data – you just create two charts and put them together with a + operator!

The second feature we need to make use of are Altair transforms. There are a lot of these in-built data transformation tools, and in general you are better served manipulating your data with pandas before you pass it to Altair rather than using these convenience functions. But at least in my experience, adding a regression line or a non-parametric regression line (like a LOESS fit) is something that I do so often when exploring data that I think it’s worth using.

First we create our base plot:

[16]:
base = (
    alt.Chart(world)
    .mark_point()
    .encode(
        x=alt.X("log_gdp_per_cap", scale=alt.Scale(zero=False)),
        y="log_under5_mortality_rate",
        size="Population, total",
    )
)

Then we create a new plot of our regression fit. We do this by building off our former plot (by starting with base instead of alt.Chart()), adding a transform_regression() method specifying the x and y variables to model, and then adding a mark_line().

NOTE: This practice of “building off” existing Charts is used a lot in Altair. Basically, the idea is that the new Chart inherits all the data and features of the old Chart unless you explicitly overwrite them. So in this case, transform_regression() modifies the data we’re plotting to be that of a linear regression, and mark_line() overrides the original mark_point(). But other features (like the mapping of x to log_gdp_per_cap with a scale that doesn’t have to start at zero) are preserved.

[17]:
fit = base.transform_regression(
        "log_gdp_per_cap", "log_under5_mortality_rate"
    ).mark_line()
fit
[17]: