Advanced Plotting with Altair

In our last reading, we introduced Altair, and explored how to make basic charts, layer and facet them, and more. In this reading, we’ll learn more about how Altair works, some of its quirks and hidden features (and how they can make your life easier), and how to generate and share interactive graphics.

What is an Altair Chart?

This may sound like a bit of an odd question to even ask – it’s it just an image?! – but in the case of Altair, it turns out the answer is a little more complicated than you might think.

Altair actually sits on top of a rather large stack of software libraries. Altair itself is actually just a Python wrapper for a visualization library called Vega-Lite, which is itself a simplified interface for Vega, which in turn is built on top of D3, a low level JavaScript visualization library.

Thankfully, most of that isn’t your problem, but it is helpful to know that when you create an Altair chart, what you’re actually generating is a JSON-formatted Vega-Lite specification for your chart. For example this chart:

vegalite_line

Is actually represented by this JSON file:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {"url": "data/cars.json"},
  "encoding": {
    "x": {
      "field": "Year",
      "timeUnit": "year"
    }
  },
  "layer": [
    {
      "mark": {"type": "errorband", "extent": "ci"},
      "encoding": {
        "y": {
          "field": "Miles_per_Gallon",
          "type": "quantitative",
          "title": "Mean of Miles per Gallon (95% CIs)"
        }
      }
    },
    {
      "mark": "line",
      "encoding": {
        "y": {
          "aggregate": "mean",
          "field": "Miles_per_Gallon"
        }
      }
    }
  ]
}

and this is then compiled into an image by javascript code in your browser (or in your Jupyter Notebook) to create the image you see!

This is important to know for a couple reasons:

The first is that this is the reason we need a second library – altair_saver – to save charts as PNGs, PDFs, or other image formats. If you have a chart and just run chart.save(), you can only save it as a JSON file or as an HTML document – we need a different library to actually render the chart into a savable image.

The second reason is that these Vega-Lite specifications include the data used to make the chart either explicitly or by linking to a source for the data. The reason, again, is that the chart is being generated on the fly in the browser, so the data has to be available at the time of compilation.

As a result, if you are making a chart from a large dataset, the resulting chart can become very large. Indeed, for this reason Altair will throw an error if you try and make a chart from a dataset with more than 5,000 rows, since doing so has the potential to generate pretty big files.

If you want to make a chart from a large dataset…

There are several workarounds, depending on your use-case.

If you’re just exploring your data:

There are three pretty simple workarounds:

  • The first is to just turn off this warning and ignore the problem by running alt.data_transformers.disable_max_rows(). If your data isn’t huge, this is just a good hack. But if you start to find that your jupyter notebook isn’t working so well any more because these are getting big, then…

    • Use the JSON data transformer by running alt.data_transformers.enable('json'). When you make a chart, this will cache the data to disk somewhere, then reference this file in the Vega-Lite JSON rather than writing all your data into the file. Then when your browser renders the chart, it will just read the data when needed.

    • Install the altair_data_server (pip install altair_data_server) and run alt.data_transformers.enable('data_server'). This will do the same kind of trick as above (put a reference to your data in the vega-lite spec rather than a full copy of your data), but this does it without writing anything to disk.

Note that these are good fixes if you just want to see what something looks like, or you want to generate a chart and then save it as a PDF or PNG. But they aren’t great for sharing / hosting online because the data isn’t actually being saved in your image.

If you want to share your chart / host online:

  • Share a PDF / PNG As noted above, there are lots of easy tricks if you just want to make and then export a PDF / PNG.

  • Collapse your data first It’s pretty rare that one can create figure can really accommodate tens of thousands of data points and retain readability. So one option is to just collapse your data first! For example, if your goal is to make a bar-chart showing the number of 911 calls by day of the week, and your dataset has one million 911 calls, just collapse your data first (e.g. one row per day of the week, with the associated number of calls for each day).

  • Host the data at a URL Vega-Lite specs need to have access to source data, but they don’t have to contain it per se. Indeed, if you look at the Vega-Lite JSON spec above, you can see where it points to the source data ("data": {"url": "data/cars.json"},). So if your data is too big to put into the Vega-Lite spec for some reason, you can also host it elsewhere and provide a URL to the data. Just put the URL you want to use in Chart() when you make your chart!

Altair Quirks & Convenience Features

OK, enough of the nuts and bolts of Altair! Let’s dive into the little quirks and features of Altair that will make your life easier.

Aggregators

As we noted in the last reading, it will often be the case that it’s easier to just do any transformations of your data before passing the data to Altair (in part because of the Altair data size issues noted above). But with that said, in data exploration it’s often really nice to have some quick convenience transformations, and Altair does not disappoint.

In particular, Altair has a small library of aggregators that can be really helpful for data exploration.

To illustrate, let’s look at some bar graphs of our WDI data:

[1]:
import pandas as pd
import numpy as np
import altair as alt

wdi_data = (
    "https://raw.githubusercontent.com/nickeubank/"
    "practicaldatascience/master/Example_Data/wdi_plotting.csv"
)
world = pd.read_csv(wdi_data)
world = world[world.Year == 2018]
for c in world.columns: print(c)
Year
Country Name
Country Code
GDP per capita (constant 2010 US$)
Population, total
CO2 emissions (metric tons per capita)
Mortality rate attributed to household and ambient air pollution, age-standardized (per 100,000 population)
PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total)
Life expectancy at birth, total (years)
Mortality rate, under-5 (per 1,000 live births)
Literacy rate, youth female (% of females ages 15-24)

Suppose we wanted to look at the distribution of incomes across countries in the world. We could create bins and run a groupby("bin").count() to create counts-per-bin, then plot that, but Altair will create bins itself, then using the y="count()" we can ask Altair to count the number of observations per bin itself!

[14]:
c = alt.Chart(world).mark_bar().encode(
    x=alt.X("GDP per capita (constant 2010 US$)", bin=True), y="count()"
)
c

[14]:

Similarly, suppose we wanted to know average CO2 emissions per GDP per capita bin. We could do a groupby() ourselves, or we could use mean():

[3]:
alt.Chart(world).mark_bar().encode(
    x=alt.X("GDP per capita (constant 2010 US$)", bin=True),
    y="mean(CO2 emissions (metric tons per capita))",
)

[3]:

The data is a little quirky as we have no data for countries with GDP per capita above 120,000 (I think there’s only one country up there), so we might need to tweak a little:

[4]:
alt.Chart(
    world[world["GDP per capita (constant 2010 US$)"] < 120000]
).mark_bar().encode(
    x=alt.X("GDP per capita (constant 2010 US$)", bin=True),
    y="mean(CO2 emissions (metric tons per capita))",
)

[4]:

Transforms

Similarly, Altair also offers a few in-line data transformations, such as transform_regression() and transform_loess() which we saw in our last reading. The full set are available here. Many of these are things I can’t imagine you’d ever want in place of a simple pandas data manipulation (things like “filter” (subset) or “sample” (take a random sample), which are so easy to do in pandas), but there are a couple I like for exploratory data analysis:

For example, here’s the distribution of CO2 emissions across countries!

[5]:
alt.Chart(world).transform_density(
    "CO2 emissions (metric tons per capita)",
    as_=["CO2 emissions (metric tons per capita)", "density"],
).mark_area().encode(
    x="CO2 emissions (metric tons per capita)",
    y="density:Q",
)

[5]:

Tidy Data: An Implicit Requirement

You may not have noticed it, but there’s also an implicit data requirement for Altair: your dataset is organized in the Tidy data format.

Tidy data is data where:

  • Every column is a variable.

  • Every row is an observation.

  • Every cell is a single value.

As evidenced by the fact I haven’t had to explicitly say this yet, this is how most data scientists expect their data. But it is worth emphasizing that this is a requirement, and if you have data where, say, your columns are values (not variables), like this example from the paper linked above:

#>   religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
#> 1 Agnostic       27        34        60        81        76       137        122
#> 2 Atheist        12        27        37        52        35        70         73
#> 3 Buddhist       27        21        30        34        33        58         62
#> 4 Catholic      418       617       732       670       638      1116        949
#> 5 Don’t kn…      15        14        15        11        10        35         21
#> 6 Evangeli…     575       869      1064       982       881      1486        949
#> # … with 12 more rows, and 3 more variables: $100-150k <dbl>, >150k <dbl>,

You’ll need to reorganize your data before using Altair so that you have a column for “religion”, a column for “income”, and a column for “number of respondents”, e.g.:

#>   religion income  frequency
#> 1 Agnostic <$10k          27
#> 2 Agnostic $10-20k        34
#> 3 Agnostic $20-30k        60
#> 4 Agnostic $30-40k        81
#> 5 Agnostic $40-50k        76
#> 6 Agnostic $50-75k       137
#> # … with 174 more rows

Interactive Charts

As previously noted, one of the more unique features of Altair as a plotting library is that it has inbuilt support for interactive graphics. These range from simple mouse over effects to the ability for users to make selections on one chart and observe the effects of the selections on a second.

Zoom and Mouseover Effects

The simplest but perhaps most useful form of interaction with Altair are mouseover effects – the ability for the user to zoom in, and to roll their mouse over points (e.g. in a scatter plot) and see additional information pop up.

To illustrate the value of this, consider the following figure from our past reading in which we wanted to overlay the names of countries on our scatterplot:

[6]:
world["log_gdp_per_cap"] = np.log(world["GDP per capita (constant 2010 US$)"])
world["log_under5_mortality_rate"] = np.log(
    world["Mortality rate, under-5 (per 1,000 live births)"]
)
base = (
    alt.Chart(world)
    .mark_point()
    .encode(
        x=alt.X("log_gdp_per_cap", scale=alt.Scale(zero=False)),
        y="log_under5_mortality_rate",
        size="Population, total",
    )
)
fit = base.transform_regression(
        "log_gdp_per_cap", "log_under5_mortality_rate"
    ).mark_line()

text = (alt.Chart(world)
    .encode(
        x=alt.X("log_gdp_per_cap", scale=alt.Scale(zero=False)),
        y="log_under5_mortality_rate",
        text="Country Code",
    )
    .mark_text(size=5))

base + fit + text
[6]:

I guess this works for exploratory analysis, but it’s incredibly cluttered and almost unreadable. But what if, instead, we could make this chart so that the names of countries only popped up when you roll your mouse over a country, and where you could zoom in or out easily?

Well, with Altair it turns out that’s trivially easy – we add .interactive() to our base chart, and add a tooltip channel to our encode() function!

Try it – roll your scroll wheel over the plot / scroll on your mousepad, and run your mouse over different points!

[7]:
base = (
    alt.Chart(world)
    .mark_point()
    .encode(
        x=alt.X("log_gdp_per_cap", scale=alt.Scale(zero=False)),
        y="log_under5_mortality_rate",
        size="Population, total",
        tooltip="Country Name"
    )
).interactive()

base + fit
[7]:

And of course, that’s only the most basic functionality. For example, tooltip can take any number of fields:

[8]:
base = (
    alt.Chart(world)
    .mark_point()
    .encode(
        x=alt.X("log_gdp_per_cap", scale=alt.Scale(zero=False)),
        y="log_under5_mortality_rate",
        size="Population, total",
        tooltip=[
            "Country Name",
            "Year",
            "Population, total",
            "Life expectancy at birth, total (years)",
        ],
    )
).interactive()

base + fit
[8]:

The other useful features of interactivity is allowing users to subset the data in the figure to study different sub-populations. To do so, one has to specify two things:

  • A “selector”: how a subset is being chosen

  • A “condition”: how the chart should respond

To illustrate, here we create a selector (brush = alt.selection_interval()) which will allow the user to drag a box over our chart to select a set of points.

Then we assign this to our plot with .add_selection(brush).

And finally, we add a condition we assign to the color channel. The syntax of the condition is “If [first argument is true], then do [second argument], otherwise do [third argument]”. So here, when we type:

color=alt.condition(brush, alt.value("blue"), alt.value("grey"))

we are saying “if a point is selected by brush, make it blue, otherwise make it grey.

[9]:
brush = alt.selection_interval()
alt.Chart(world).mark_point().encode(
    x=alt.X("log_gdp_per_cap", scale=alt.Scale(zero=False)),
    y="log_under5_mortality_rate",
    size="Population, total",
    color=alt.condition(
        brush, alt.value("blue"), alt.value("grey")
    ),
).add_selection(brush)
[9]: