Before we diving into actually doing some exercises, a few notes about reshaping in
So many tools: As is so often the case with
pandas, when there are multiple ways that something can be done,
pandas implements all of them, even it if results in a lot of duplication of funcationality. As a result, there are several sets of methods that do very similar things, and which can be hard to keep straight:
pivot: Tools for moving a dataset from a wide-to-long (
pivot) or long-to-wide (
pivot_table: A more powerful and flexible version of
pivotthat allows for multiple index columns, and allows for aggregations if you want.
pivot, but only for use when your variables are stored in indices. I don’t use these much.
Personally, I find melt and pivot_table most useful.
pivot can be nice, but only works if your “index” column is a single column, which is often not the case.
Also, I have to go look at these pictures… every time I reshape. Don’t hesistate to look them up for help!
What SHOULD I want my data to look like? Just a reminder that a natural question to ask is “in what format should I want my data?” Obviously this question is subjective, but as you read in the Hadley Wickham piece I assigned, this is a good standard definition for a “tidy” dataset to which we might aspire:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
In general, I’m a big fan of this approach, and I think it’s a good one to use (especially numbers 1 and 2).
Note that the question of what forms an “observation” depends on the analysis you plan to run (which depends on the answer you seek to ask! See! EVERYTHING ALWAYS COMES BACK TO THE QUESTION YOU ARE SEEKING TO ANSWER!). Suppose you have data on GDP from different countries for many years. If you plan to run a panel analysis where each observation is a country-year, then each row should be a country-year. But if you plan to do a country-level analysis using measures of, say, GDP variation over time, then each row should be a country!
So in this exercise, we’re going to work with data from the World Development Indicators. Unlike in previous exercises, however, where I cleaned up the data before giving it to you, we’ll be working with the raw data just as you would get it from the World Bank download site. You can find the data here.
(1) Download and import the data.
NOTE: This data downloads as a zip file. Try loading it into
pandas without unzipping it…
(2) In this exercise, we’d like to look at the relationship between GDP and inequality across countries. So before anything else, thin down your dataset by getting only the rows related to:
GDP per capita (in Constant US Dollars – the “constant” means values are have been adjusted for inflation, so values from 1970 can be directly compared with 2010)
Information on income shares (what percentage of total income goes to X% of the income distribution).
NOTE: this is non-trivial. This is a big dataset with LOTS of variables, so you’ll have to use your data wrangling skills to see all the variables names. Remember that if you want to see all the entries in something but python compresses it (so it doesn’t take up too much space), you can loop over it and print each item separately to make it show everything!
(3) OK, that makes our data a lot more manageable, but it’s still pretty heinous.
In our analysis, we know that we’ll want to have country-years as our units of observation, so let’s start reshaping. Come up with a strategy that gets our inequality variables as columns, and makes each of our future observations (i.e. each country-year) a row.
Hint: It can sometimes be helpful to drop useless columns before a reshape.
Other hint: Sometimes you get multi-level column names when reshaping, which you can’t just fix with
.reset_index(). As with multi-indexes for rows, they’re a pain. The two easy ways to fix them are:
Just drop a level:
df.columns.drop_level(0)where 0 is which level you want to drop.
Directly manipulate the column names by doing something to the
df.columnsobject and re-assigning to
df.columns. For example, you can just concatenate two columns levels with a dash between them with
df.columns = map(lambda x: str(x) + "-" + str(x), df.columns).
(4) Now that you’ve done your reshapes, subset for observations since 1980.
(5) Subset for observations for which we have GINI index estimates.
(6) Plot the relationship between log GDP per capita and Gini scores for all country years and add a linear fit as well.
(7) It’s important to not always rely too much on a single measure of something like inequality. To check the robustness of our result, let’s be sure to also look at the relationship between inequality and income by using the income share of the richest 10% instead of just the GINI coefficient. Plot that relationship.