Plotting Exercises, Part 1

Exercise 1

Create a pandas dataframe from the “Datasaurus.txt” file using the code:

[1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/nickeubank/practicaldatascience/master/Example_Data/Datasaurus.txt', delimiter='\t')

Note that the file being downloaded is not actually a CSV file. It is tab-delimited, meaning that within each row, columns are separated by tabs rather than commas. We communicate this to pandas with the delimiter="\t" option ("\t" is how we write a tab, as we will discuss in future lessons).

Exercise 2

This dataset actually contains 13 separate example datasets, each with two variables named example[number]_x and example[number]_y.

In order to get a better sense of what these datasets look like, write a loop that iterates over each example dataset (numbered 1 to 13) and print out the mean and standard deviation for example[number]_x and example[number]_y for each dataset.

For example, the first iteration of this loop might return something like:

Example Dataset 1:
Mean x: 23.12321978429576,
Mean y: 98.23980921730972,
Std Dev x: 21.2389710287,
Std Dev y: 32.2389081209832,
Correlation: 0.73892819281

(Though you shouldn’t get those specific values)

Exercise 3

Based only on these results, discuss what might you conclude about these example datasets with your partner. Write down your thoughts.

Execise 4

Write a loop that iterates over these example datasets, and using the plotnine library, plot a simple scatter plot of each dataset with the x variable on the x-axis and the y variable on the y-axis. Save these plots as PDFs somewhere you can find them.

Hint: When writing this type of code, it is often best to start by writing code to do what you want for the first iteration of the loop. Once you have code that works for the first example dataset, then write the full loop around it.

Exercise 5

Review you plots. How does your impression of how these datasets differ from what you wrote down in Exercise 3?

Want More Practice?

Download the ACS data you were working with in our last exercise (available here), and let’s use it to study how the gender wage gap has changed over generations in the United States. Data for that can be found here (US_ACS_2017_10pct_sample.dta).

Once you have downloaded the data, plot income among working women and income among working men against age to see how the gender wage gap has evolved from cohort to cohort.

Then repeat this analysis by race. Note that in the US Census, “Hispanic” is not considered to be a racial category, and so Hispanic Americans are not identified in the race variable but instead in the hispanic variable. For this analysis, therefore, we want to separate our sample into “White, Non-Hispanic”, “White, Hispanic”, “Black”, and “Other”.

Absolutely positively need the solutions?

Don’t use this link until you’ve really, really spent time struggling with your code! Doing so only results in you cheating yourself.

Link