Pandas Lesson 2: Dataframes

In Pandas Lesson 1, we learned about Series: an ordered collection of observations, analogous to a numpy vector but with super-powers.

In this tutorial, we’ll learn about DataFrames, a method of holding tabular data in which each row is an observation, and each column is a variable. (OK, there are some different forms of tabular data, but that’s the most common format you’ll encounter).

To illustrate, here’s a small pandas dataframe (created by importing data from a spreadsheet you can find here):

[1]:
import pandas as pd
smallworld = pd.read_csv('https://raw.githubusercontent.com/nickeubank/practicaldatascience/master/Example_Data/world-very-small.csv')
smallworld
[1]:
country region gdppcap08 polityIV
0 Brazil S. America 10296 18
1 Germany W. Europe 35613 20
2 Mexico N. America 14495 18
3 Mozambique Africa 855 16
4 Russia C&E Europe 16139 17
5 Ukraine C&E Europe 7271 16

As you can see, each of the 6 rows in the DataFrame world is a different country, and each column contains different information about that country (the country’s name, its region, it’s income level (GDP per Capita in 2008), and how close it was to an idealized liberal democracy in 2008 (it’s polity IV score).

1. What is a DataFrame

Where a Series was a one-dimensional collection of data, a DataFrame is fundementally two dimensional. As a result, it has many of the same types of features as a Series, but generalized to two dimension.

Index and Columns

For example, like a Series, a DataFrame has an index that labels every row: in this case, it’s the usual default index that labels each row with its initial row number. Unlike a Series, however, DataFrames have a second set of labels: column names!

[2]:
# Here are the row labels
# (Note that a "range index" is just
# another way of labeling each row with its row number)
smallworld.index
[2]:
RangeIndex(start=0, stop=6, step=1)
[3]:
# And here is our column index.
# Note that while we don't call it "index",
# the column names are of type Index.
# They really are the same as row indices,
# just for columns

smallworld.columns
[3]:
Index(['country', 'region', 'gdppcap08', 'polityIV'], dtype='object')

Constructing DataFrames

As with Series, there are many ways to construct a DataFrame. Honestly, by far the most common is that you’ll read in a dataset from a file. Pandas offers lots of tools for doing this depending on the format of the data you’re importing. We’ll discuss this more in future lessons, but here are just a few methods to know about:

  • pd.read_csv: Read in comma-seperated-value spreadsheets

  • pd.read_excel: Read in excel (.xls and .xlsx) spreadsheets

  • pd.read_stata: Read stata (.dta) datasets

  • pd.read_hdf: Read HDF (.hdf) datasets

  • pd.read_sql: Read from SQL database

You can find a full list of IO methods here!

But you can also construct DataFrames by hand. The easiest (and most common) way is by passing in a Dictionary, where the keys will become column names and the values are column values:

[4]:
df = pd.DataFrame({'animals': ['dog', 'cat', 'bird', 'fish'],
                   'can_swim': [True, False, False, True],
                   'has_fur': [True, True, False, False]})
df
[4]:
animals can_swim has_fur
0 dog True True
1 cat False True
2 bird False False
3 fish True False

2. Getting To Know Your DataFrame

While our toy smallworld dataset is small enough to easy print out and visualize, most datasets worth working with are too big to just look at. In those situations, we need tools to summarize the contents of our DataFrame.

Let’s load up a version of the smallworld dataset we looked at above that actually has all the countries in the world (instead of just 6). You can find the original dataset here.

[5]:
world = pd.read_csv('https://raw.githubusercontent.com/nickeubank/practicaldatascience/master/Example_Data/world-small.csv')
world
[5]:
country region gdppcap08 polityIV
0 Albania C&E Europe 7715 17.800000
1 Algeria Africa 8033 10.000000
2 Angola Africa 5899 8.000000
3 Argentina S. America 14333 18.000000
4 Armenia C&E Europe 6070 15.000000
5 Australia Asia-Pacific 35677 20.000000
6 Austria W. Europe 38152 20.000000
7 Azerbaijan C&E Europe 8765 3.000000
8 Bahrain Middle East 34605 3.000000
9 Bangladesh Asia-Pacific 1334 16.000000
10 Belarus C&E Europe 12261 3.000000
11 Belgium W. Europe 34493 20.000000
12 Benin Africa 1468 16.200000
13 Bhutan Asia-Pacific 4755 2.000000
14 Bolivia S. America 4278 18.200000
15 Botswana Africa 13392 19.000000
16 Brazil S. America 10296 18.000000
17 Bulgaria C&E Europe 12393 19.000000
18 Burkina Faso Africa 1161 10.000000
19 Cambodia Asia-Pacific 1905 12.000000
20 Cameroon Africa 2215 6.000000
21 Canada N. America 36444 20.000000
22 Central African Republic Africa 736 10.200000
23 Chad Africa 1455 8.000000
24 Chile S. America 14465 19.200000
25 China Asia-Pacific 5962 3.000000
26 Colombia S. America 8885 17.000000
27 Comoros Africa 1169 15.800000
28 Congo Brazzaville Africa 3946 6.000000
29 Congo Kinshasa Africa 321 15.000000
... ... ... ... ...
115 Slovakia C&E Europe 22081 19.000000
116 Slovenia C&E Europe 27605 20.000000
117 Solomon Islands Asia-Pacific 2610 18.000000
118 South Africa Africa 10109 19.000000
119 Spain W. Europe 31954 20.000000
120 Sri Lanka Asia-Pacific 4560 15.333333
121 Sudan Africa 2153 4.000000
122 Swaziland Africa 4928 1.000000
123 Sweden Scandinavia 37383 20.000000
124 Switzerland W. Europe 42536 20.000000
125 Taiwan Asia-Pacific 30881 19.333333
126 Tajikistan C&E Europe 1906 7.666667
127 Tanzania Africa 1263 11.000000
128 Thailand Asia-Pacific 7703 19.000000
129 Togo Africa 829 8.000000
130 Tunisia Middle East 7996 6.000000
131 Turkey Middle East 13920 17.000000
132 Turkmenistan C&E Europe 6641 1.000000
133 UAE Middle East 38830 2.000000
134 Uganda Africa 1165 6.000000
135 Ukraine C&E Europe 7271 16.000000
136 United Kingdom W. Europe 35445 20.000000
137 United States N. America 46716 20.000000
138 Uruguay S. America 12734 20.000000
139 Uzbekistan C&E Europe 2656 1.000000
140 Venezuela S. America 12804 16.000000
141 Vietnam Asia-Pacific 2785 3.000000
142 Yemen Middle East 2400 8.000000
143 Zambia Africa 1356 15.000000
144 Zimbabwe Africa 188 6.000000

145 rows × 4 columns

As you can see, pandas prints out a bunch of the rows, but not all the rows (note the ... in the middle) in an effort to not take over your computer. This DataFrame could theoretically be printed out in its entirety (as noted at the bottom of the output, it only has 145 rows), but in the real world we often work with datasets with hundreds of thousands or millions of rows where printing just isn’t possible. So here are some methods for “getting to know your data”:

[6]:
# Just see first 5 rows:
world.head()
[6]:
country region gdppcap08 polityIV
0 Albania C&E Europe 7715 17.8
1 Algeria Africa 8033 10.0
2 Angola Africa 5899 8.0
3 Argentina S. America 14333 18.0
4 Armenia C&E Europe 6070 15.0
[7]:
# See a random subset of rows (here, 5)
# (the first rows of a dataset aren't always representative!)
world.sample(5)
[7]:
country region gdppcap08 polityIV
134 Uganda Africa 1165 6.0
9 Bangladesh Asia-Pacific 1334 16.0
83 Mali Africa 1128 16.0
54 Guyana S. America 2542 16.0
104 Philippines Asia-Pacific 3510 18.0
[8]:
# Get Number of Rows:
len(world)
[8]:
145
[9]:
# Get number of columns:
len(world.columns)
[9]:
4
[10]:
# Learn the datatype of each column:
world.dtypes
[10]:
country       object
region        object
gdppcap08      int64
polityIV     float64
dtype: object
[11]:
# Get summary statistics for each numeric column (objects are ignored):
world.describe()
[11]:
gdppcap08 polityIV
count 145.000000 145.000000
mean 13251.993103 13.407816
std 14802.581676 6.587626
min 188.000000 0.000000
25% 2153.000000 7.666667
50% 7271.000000 16.000000
75% 19330.000000 19.000000
max 85868.000000 20.000000
[12]:
# List out all the columns (if there are a lot, you can't just see them in the table,
# and if you just do `world.columns`, often pandas will compress that too. This will show you
# all columns:
for c in world.columns: print(c)
country
region
gdppcap08
polityIV

3. Subsetting a DataFrame

As with Series, one of the most important skills for working with DataFrames is knowing how to subset them. Thankfully, DataFrames works kind of like a two-dimensional generalization of Series when it comes to the use of iloc and loc.

iloc

To subset a DataFrame using iloc, we now have to pass two arguments into iloc seperated by a comma. For example, if we wanted the entry in the fourth row of the first column, we would use:

[13]:
world.iloc[3, 0]
[13]:
'Argentina'

Similarly, iloc still supports slices. Here are the first two rows of the first three columns:

[14]:
world.iloc[0:2, 0:3]
[14]:
country region gdppcap08
0 Albania C&E Europe 7715
1 Algeria Africa 8033

If you want to get a subset on one dimension, but all the entries on the other, just pass a : for the dimension on which you want all the data (just like in numpy). Here are the first two rows and all the columns:

[15]:
world.iloc[0:2, :]
[15]:
country region gdppcap08 polityIV
0 Albania C&E Europe 7715 17.8
1 Algeria Africa 8033 10.0

If you ONLY pass one set of arguments, though, those will be applied to the first dimension (rows), just like in numpy. Thus .iloc[0:2] is the same as .iloc[0:2, :].

[16]:
world.iloc[0:2]
[16]:
country region gdppcap08 polityIV
0 Albania C&E Europe 7715 17.8
1 Algeria Africa 8033 10.0

loc

The generalization of .loc from Series to DataFrames works the same as iloc. If you pass two arguments, the first will subset rows (though for .loc, the subsetting is on index values, not row numbers), and the second will subset columns (again, on column names, not column order).

[17]:
# Index value 1, column country
world.loc[1, 'country']
[17]:
'Algeria'

And just like in Series, if you pass a range to .loc, the end points will be included (unlike with most Python functions)

[18]:
world.loc[0:1, 'country']
[18]:
0    Albania
1    Algeria
Name: country, dtype: object

Finally, as with .iloc, if you pass a single argument to .loc, it will subset on the first dimension (rows):

[19]:
world.loc[0:3]
[19]:
country region gdppcap08 polityIV
0 Albania C&E Europe 7715 17.8
1 Algeria Africa 8033 10.0
2 Angola Africa 5899 8.0
3 Argentina S. America 14333 18.0

Logical Tests

Subsetting with logical tests also works in a familiar manner for DataFrames:

  • If you pass a single boolean array to .loc, it will subset on rows.

  • If the boolean array has an Index (i.e. if it’s a Series), then alignment will take place on index values

  • If the boolean array does NOT have an idex (i.e. it’s a list of booleans), then alignment will take place on row order.

  • To subset columns based on a test, you have to use .loc[:, YOUR_TEST_HERE].

To illustrate, let’s start by shuffling our DataFrame so that index values and row numbers aren’t the same:

[20]:
world = world.sort_values('gdppcap08')
world.head()
[20]:
country region gdppcap08 polityIV
144 Zimbabwe Africa 188 6.0
29 Congo Kinshasa Africa 321 15.0
76 Liberia Africa 388 10.0
53 Guinea-Bissau Africa 538 11.0
40 Eritrea Africa 632 3.0
[21]:
# Test with an index -> subset rows, align on index
relatively_democratic = world.loc[world['polityIV'] > 10]
relatively_democratic.head()
[21]:
country region gdppcap08 polityIV
29 Congo Kinshasa Africa 321 15.000000
53 Guinea-Bissau Africa 538 11.000000
96 Niger Africa 684 15.333333
22 Central African Republic Africa 736 10.200000
113 Sierra Leone Africa 766 15.000000
[22]:
# Subset using list of booleans. Note that
# if the boolean list or boolean array passed to `.loc`
# is not as long as the dataset, you get an exception.

# This is different than how R
# cycles a boolean over the data.

# Note: This behavior *just* changed in the latest pandas release.

relatively_democratic = relatively_democratic.loc[[True, False, True]]
relatively_democratic
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-22-fed8763de0a1> in <module>
      8 # Note: This behavior *just* changed in the latest pandas release.
      9
---> 10 relatively_democratic = relatively_democratic.loc[[True, False, True]]
     11 relatively_democratic

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1408
   1409             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1410             return self._getitem_axis(maybe_callable, axis=axis)
   1411
   1412     def _is_scalar_access(self, key: Tuple):

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1772             return self._get_slice_axis(key, axis=axis)
   1773         elif com.is_bool_indexer(key):
-> 1774             return self._getbool_axis(key, axis=axis)
   1775         elif is_list_like_indexer(key):
   1776

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _getbool_axis(self, key, axis)
   1422         # caller is responsible for ensuring non-None axis
   1423         labels = self.obj._get_axis(axis)
-> 1424         key = check_bool_indexer(labels, key)
   1425         inds, = key.nonzero()
   1426         try:

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexing.py in check_bool_indexer(index, key)
   2397         if len(result) != len(index):
   2398             raise IndexError(
-> 2399                 "Item wrong length {} instead of {}.".format(len(result), len(index))
   2400             )
   2401

IndexError: Item wrong length 3 instead of 96.

And if we want to subset columns on a boolean (admittedly a silly example, but you get the idea):

[23]:
relatively_democratic = relatively_democratic.loc[:, (world.columns == 'country') | (world.columns == 'gdppcap08')]
relatively_democratic
[23]:
country gdppcap08
29 Congo Kinshasa 321
53 Guinea-Bissau 538
96 Niger 684
22 Central African Republic 736
113 Sierra Leone 766
81 Malawi 837
90 Mozambique 855
80 Madagascar 1049
83 Mali 1128
27 Comoros 1169
127 Tanzania 1263
9 Bangladesh 1334
143 Zambia 1356
49 Ghana 1452
12 Benin 1468
75 Lesotho 1588
69 Kenya 1590
112 Senegal 1772
19 Cambodia 1905
97 Nigeria 2082
35 Djibouti 2140
101 Papua New Guinea 2208
54 Guyana 2542
117 Solomon Islands 2610
95 Nicaragua 2682
87 Moldova 2925
58 India 2972
104 Philippines 3510
88 Mongolia 3566
56 Honduras 3965
... ... ...
31 Croatia 19084
57 Hungary 19330
41 Estonia 20662
115 Slovakia 22081
106 Portugal 23074
33 Czech Republic 24712
94 New Zealand 27029
63 Israel 27548
116 Slovenia 27605
70 Korea South 27939
50 Greece 29361
64 Italy 30756
125 Taiwan 30881
119 Spain 31954
44 France 34045
66 Japan 34099
11 Belgium 34493
43 Finland 35427
136 United Kingdom 35445
48 Germany 35613
5 Australia 35677
21 Canada 36444
34 Denmark 36607
123 Sweden 37383
6 Austria 38152
93 Netherlands 40849
124 Switzerland 42536
62 Ireland 44200
137 United States 46716
98 Norway 58138

96 rows × 2 columns

[] Square brackets

As with Series, single square brackets in pandas change their behavior depending on the values you pass them. Again, it is worth emphasizing that there is nothing that one can do with square brackets that you can’t do with .loc and .iloc, so if they seem to strange, you don’t have to use them.

With that said, as summarized below, [] is actually much safer on DataFrames than on Series.

The rules of [] in DataFrames are:

  • If your entry is a single column name, or a list of column names, it will return those columns.

  • If your entry is a slice, it will work like iloc and select rows based on row order.

  • If your entry is a boolean array, and of exactly the same length as the number of rows in your data, it will subset rows.

    • Note this means that [] does not do the same thing we saw .loc do above where, if passed a short boolean array, it will assume any row without an entry in the boolean array should be dropped.

[24]:
# Select one column
world['country'].head()
[24]:
144          Zimbabwe
29     Congo Kinshasa
76            Liberia
53      Guinea-Bissau
40            Eritrea
Name: country, dtype: object
[25]:
# Select multiple columns
world[['country', 'gdppcap08']].head()
[25]:
country gdppcap08
144 Zimbabwe 188
29 Congo Kinshasa 321
76 Liberia 388
53 Guinea-Bissau 538
40 Eritrea 632
[26]:
# Boolean test
world[world['gdppcap08'] > 10000].head()
[26]:
country region gdppcap08 polityIV
79 Macedonia C&E Europe 10041 19.0
118 South Africa Africa 10109 19.0
16 Brazil S. America 10296 18.0
30 Costa Rica S. America 11241 20.0
68 Kazakhstan C&E Europe 11315 4.0
[27]:
# Slice of rows
world[0:3]
[27]:
country region gdppcap08 polityIV
144 Zimbabwe Africa 188 6.0
29 Congo Kinshasa Africa 321 15.0
76 Liberia Africa 388 10.0

My advice on using ``[]`` on DataFrames: in short, [] is much safer on DataFrames because the situation where [] might subset on index labels (if your index labels are integers) or it might subset on row order (if your index labels are not integers) doesn’t exist. Moreover, selecting a single column is extremely common, and this is a case where I use single square brackets all the time.

In a Series, if I pass 0, it’s always unclear whether that’s going to get me the first row (row-order-based) or the row with index value 0 (if I have integer index-values). On a DataFrame, a single entry or list of entries will only attempt to match columns based on index values, and if that fails, it throws an exception rather than defaulting to acting like .iloc:

[28]:
world[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2889             try:
-> 2890                 return self._engine.get_loc(key)
   2891             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-28-4b0f3501794b> in <module>
----> 1 world[0]

~/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2973             if self.columns.nlevels > 1:
   2974                 return self._getitem_multilevel(key)
-> 2975             indexer = self.columns.get_loc(key)
   2976             if is_integer(indexer):
   2977                 indexer = [indexer]

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2890                 return self._engine.get_loc(key)
   2891             except KeyError:
-> 2892                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2893         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2894         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

Similarly, boolean subsetting always acts like you’re using .loc (aligning on index values where it can, row order if it can’t), and slices in [] always get behavior like .iloc, making behavior much more predictable.

4. Getting Columns with Not-Notation

In addition to passing the name of a column into .loc or to [], columns can also sometimes be access using dot-notation:

[29]:
world.country.head()
[29]:
144          Zimbabwe
29     Congo Kinshasa
76            Liberia
53      Guinea-Bissau
40            Eritrea
Name: country, dtype: object

This method of getting columns is very easy and intuitive (given how often we use dot-notation in Python more broadly), but it has a couple significant pit-falls:

  • Only works for column names without spaces or punctuation

  • You can’t pass a variable to dot-notation, you have to write out the column explicity (so you can’t write generalized code).

  • Only works if the column name isn’t the same as an existing method (i.e. df.count will call the count method, even if you have a column named “count”)

  • Causes big problems if you try to put it on the left side of the equals sign.

Of these, the reasons for the first and second aren’t complicated, but the third and fourth concerns bear exploring.

Suppose we added a column to our data called rank that gave each country’s GDP rank (this code is a little convoluted because there is an easier way to do this, but this works):

[30]:
world = world.sort_values('gdppcap08')
world['rank'] = range(0,len(world))
world.head()
[30]:
country region gdppcap08 polityIV rank
144 Zimbabwe Africa 188 6.0 0
29 Congo Kinshasa Africa 321 15.0 1
76 Liberia Africa 388 10.0 2
53 Guinea-Bissau Africa 538 11.0 3
40 Eritrea Africa 632 3.0 4

But if we try and access the rack column with dot-notation, we don’t get that column, we get the method rank:

[31]:
world.rank
[31]:
<bound method NDFrame.rank of             country        region  gdppcap08  polityIV  rank
144        Zimbabwe        Africa        188       6.0     0
29   Congo Kinshasa        Africa        321      15.0     1
76          Liberia        Africa        388      10.0     2
53    Guinea-Bissau        Africa        538      11.0     3
40          Eritrea        Africa        632       3.0     4
..              ...           ...        ...       ...   ...
62          Ireland     W. Europe      44200      20.0   140
137   United States    N. America      46716      20.0   141
114       Singapore  Asia-Pacific      49284       8.0   142
98           Norway   Scandinavia      58138      20.0   143
107           Qatar   Middle East      85868       0.0   144

[145 rows x 5 columns]>

Now if you hit this problem on the right side an assignment operator, you’ll get an exception and will know you have a problem. Suppose you want to move up everyone’s rank by 1:

[32]:
world.rank = world['rank'] + 1
[33]:
world.head()
[33]:
country region gdppcap08 polityIV rank
144 Zimbabwe Africa 188 6.0 0
29 Congo Kinshasa Africa 321 15.0 1
76 Liberia Africa 388 10.0 2
53 Guinea-Bissau Africa 538 11.0 3
40 Eritrea Africa 632 3.0 4

It fails silently because what you’ve actually done is over-written the method rank with the column rank plus 1. Now now only has your rank column not changed (see it still starts with 0), but now you’ve broken the rank method:

[34]:
world.rank()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-685de3d339fd> in <module>
----> 1 world.rank()

TypeError: 'Series' object is not callable

When you try to assign values using dot-notation, you also get into trouble if you try to create a new column. For example:

[35]:
world.rank_doubled = range(0,2*len(world), 2)
world.head()
/Users/Nick/miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  """Entry point for launching an IPython kernel.
[35]:
country region gdppcap08 polityIV rank
144 Zimbabwe Africa 188 6.0 0
29 Congo Kinshasa Africa 321 15.0 1
76 Liberia Africa 388 10.0 2
53 Guinea-Bissau Africa 538 11.0 3
40 Eritrea Africa 632 3.0 4

See now rank_doubled wasn’t added to your DataFrame? It just disappears. pandas does now raise a warning, but warnings don’t stop your code from running, so if you don’t see it, you can corrupt your data.

My advice on dot-notation:

  • Never, just never use dot-notation on the left-side of the assignment operator. It’s just begging for trouble.

  • Try not to use it on the right side of the assignment operator. It’s safer than using it on the left side of the assignment operator, but none of us will ever memorize all the names of methods in pandas, and if your column happens to have the same name as a method, you may not notice the error.

5. DataFrames: Collection of Series

While it is natural to think of a DataFrame as a single table (like a numpy matrix), in reality a DataFrame is just a collection of Series.

To see this, let’s pull out individual columns using square bracket notation, and check it’s type:

[36]:
type(world['country'])
[36]:
pandas.core.series.Series

Tada!

And that means that you can always pull out a column from a DataFrame and manipulate it using the tools you’ve already learned from the Series tutorial. And because you know how to extract the numpy array that underlies a Series, that means you also always know how to move from DataFrames to numpy arrays if you need to.

Selecting Series versus Selecting DataFrames

There is one point of nuance worth exploring: when you extract a single column from a DataFrame, you have the choice of either extracting a Series, or extracting a DataFrame with a single column. What determines this is whether you use one pair of square brackets, or two.

If you use a single set of square brackets (or pass just the name of a column to loc, you get back a Series. But if you pass a list with the column name, you get back a DataFrame:

[37]:
type(world['country'])
[37]:
pandas.core.series.Series
[38]:
type(world[['country']])
[38]:
pandas.core.frame.DataFrame

Note that this is the opposite behavior of R, where double brackets get you a Vector, and single brackets get you a data.frame!

This also holds for rows, by the way. If you ask for a single row, you will actually get back a (newly construted) Series:

[39]:
type(world.iloc[3])
[39]:
pandas.core.series.Series

(Obviously, if you ask for more than one row, or more than one column, you will always get back a DataFrame, since the object you’re requesting is intrinsically 2-dimensional and can’t be represented as a Series. )

6. Exercises!

If you are enrolled in Practical Data Science at Duke, don’t do these exercises on your own – we’ll do them in class!

DataFrame Exercises