Pandas Lesson 1: Series

This tutorial introduces the fundamental building block of pandas: Series. By the end of this section, you will learn how to create different types of Series, subset them, modify them, and summarize them.

1. What is a Series?

In the simpliest terms, a Series is an ordered collection of values, generally all of the same type. For example, you can have a Series that contains the ages of everyone in your class (a numeric Series), or a Series of all the names of people in your family (a string Series).

This may sound familiar: isn’t that how we described numpy vectors (i.e. one-dimensional numpy arrays)? Yes! In fact, Series are basically one-dimensional numpy arrays with lots of extra features added on top of them. As we’ll see, most everything you could do with a numpy array you can do with a Series; Series can just do more.

Series are central to pandas because pandas was designed for statistics, and Series are a perfect way to collect lots of different observations of a variable.

There are lots of ways to create Series, but the easiest is to just pass a list or an array to the pd.Series constructor.

To illustrate, let me tell you about a week at the zoo I wish I owned. Here’s what attendance looked like at my zoo last week:

Day of Week Attendees
Monday 132 people
Tuesday 94 people
Wednesday 112 people
Thursday 84 people
Friday 254 people
Saturday 322 people
Sunday 472 people

Let’s make a Series for this attendance pattern:

[1]:
import pandas as pd # We have to import pandas to use Series!

attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance
[1]:
0    132
1     94
2    112
3     84
4    254
5    322
6    472
dtype: int64

Indices

One of the fundamental differences between numpy arrays and Series is that all Series are associated with an index. An index is a set of labels for each observation in a Series. If you don’t specify an index when you create a Series, pandas will just create a default index that just labels each row with it’s initial row number, but you can specify an index if you want.

In this case, for example, we know that these entries are associated with different days of the week, so let’s specify an index for our attendance Series:

[2]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                              'Friday', 'Saturday', 'Sunday'])
attendance
[2]:
Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64

Now as we see the rows are labeled with days of the week on the left side, rather than with initial row numbers.

Note that you can always access a Series’ index with the .index property:

[3]:
attendance.index
[3]:
Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
       'Sunday'],
      dtype='object')

An important property of index labels is that they stay with each row, even if you sort your data. So if I sort my Series by attendance, not only will rows re-order, but so will the index labels:

[4]:
attendance = attendance.sort_values()
attendance
[4]:
Thursday      84
Tuesday       94
Wednesday    112
Monday       132
Friday       254
Saturday     322
Sunday       472
dtype: int64

Note: This seems intuitive with days-of-the-week as our index labels, but it can be confusing when your index starts out as row numbers. For example, if you had not changed our index to be days of the week, then the default index would look like the index labels were just row numbers. But if we then sort the Series, the numbers will shuffle, and they will no longer correspond to row numbers:

[5]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance
[5]:
0    132
1     94
2    112
3     84
4    254
5    322
6    472
dtype: int64
[6]:
attendance = attendance.sort_values()
attendance
[6]:
3     84
1     94
2    112
0    132
4    254
5    322
6    472
dtype: int64

2. Subsetting Series

Extracting a subset of elements from a Series is an extremely important task, not least because it generalizes nicely to working with bigger datasets (which are at the heart of data science). This process — whether applied to a Series or a dataset — is often referred to as “taking a subset”, “subsetting”, or “filtering”. If there is one skill you need to master as quickly as possible, it’s this.

In pandas, there are three ways to filter a Series: using a separate logical Series, using row-number indexing, and using index labels. I tend to use the first method most, but all three are useful. The first and second of these you will recognize from numpy arrays, while the last once (since it uses index labels which only exist in pandas) is unique to pandas.

Subsetting using row-number indexing

A different way to subset a Series is to specify the row-numbers you want to keep using the iloc function. (iloc stands for “integer location”, since row numbers are always integers). This will give you the behavior you’re more familiar with from R or numpy. Just remember that, as in all of Python, the first row is numbered 0!

[7]:
fruits = pd.Series(["apple", "banana"])
fruits.iloc[0]
[7]:
'apple'

You can also subset with lists of rows, or ranges, just like in numpy:

[8]:
fruits.iloc[[0, 1]]
[8]:
0     apple
1    banana
dtype: object
[9]:
fruits.iloc[0:2]
[9]:
0     apple
1    banana
dtype: object

Subsetting using index values

Lastly, we can subset our rows using the index values associated with each row using the loc function.

[10]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                              'Friday', 'Saturday', 'Sunday'])
[11]:
attendance.loc["Monday"]
[11]:
132

You can also ask for ranges of index labels. Note that unlike in integer ranges (like the 0:2 we used above to get rows 0 and 1), index label ranges include the last item in the range. So for example if I ask for .loc["Monday":"Friday"], I will get Friday included, even if .iloc[0:2] doesn’t include 2.

[12]:
attendance.loc["Monday":"Friday"]
[12]:
Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
dtype: int64

Subsetting with logicals

Let’s jump right into an example, using our Zoo attendance Series:

[13]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                              'Friday', 'Saturday', 'Sunday'])
attendance
[13]:
Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64

Suppose we want to only get days with at least 100 people attending. We can subset our Series by using a simple test to build a Series of booleans (True and False values), then asking pandas for the rows of our Series for which the entry in our test Series is True:

[14]:
was_busy = attendance > 100
was_busy
[14]:
Monday        True
Tuesday      False
Wednesday     True
Thursday     False
Friday        True
Saturday      True
Sunday        True
dtype: bool
[15]:
busy_days = attendance.loc[was_busy]
busy_days
[15]:
Monday       132
Wednesday    112
Friday       254
Saturday     322
Sunday       472
dtype: int64

There is one really important distinction between how subsetting works in pandas and most other languages though, which has to do with indices. Suppose we want to subset a Series with fruits to only get the entry “apple”. Would could do the following:

[16]:
fruits = pd.Series(["apple", "banana"])
apple_selector = pd.Series([True, False])
fruits.loc[apple_selector]
[16]:
0    apple
dtype: object

This looks familiar from numpy, but:

A very important difference between pandas and other languages and libraries (like R and numpy) is that when a logical Series is passed into loc, evaluation is done not on the basis of the order of entries, but on the basis of index values. In the case above, because we did not specify indices for either fruits or apple_selector, they both got the usual default index values of their initial row numbers. But let’s see what happens if we change their indices so they don’t match their order:

[17]:
fruits # We can leave fruits as they are
[17]:
0     apple
1    banana
dtype: object
[18]:
apple_selector = pd.Series([True, False], index=[1, 0])
apple_selector
[18]:
1     True
0    False
dtype: bool

Note that we’ve flipped the index order for apple_selector: the first row has index value 1, and the second row has value 2. Now watch what happens when we put apple_selector in square brackets:

[19]:
fruits.loc[apple_selector]
[19]:
1    banana
dtype: object

We get banana! That’s because in apple_selector, the index value associated with the True entry as 1, and the row of fruit that had index value 1 was banana, even though they are in different rows. This is called index alignment, and is absolutely crucial to keep in mind while using pandas.

But note this only happens if your boolean array is a Series (and thus has an index). If you pass a numpy boolean array or a list of booleans (neither of which have a concept of an index), then despite using loc, alignment will be based on row numbers not index values (because there are no index values to align).

[20]:
fruits.loc[[True, False]]
[20]:
0    apple
dtype: object

UGH I know. If I wrote pandas, this would not work and this would throw an exception. But that’s how it is.

While this distinction between row numbers and index values is important, though, it comes up less often than you’d think. That’s because usually we subset by feeding in a new Series of booleans we made by hand; instead we build a new Series by executing a test on the Series we’re using. And when we do that, the new Series of booleans will have the same index as the old Series, so they align naturally. Look back at our example of was_busy: you’ll see that it automatically got the same index as our original Series, attendance. As a result, the first row of our boolean Series will have the same index value as the first row of our original Series, the second row of our boolean Series will have the same index value as the second row of our original Series, and so on. As a result, there’s no difference between matching on row order and matching on index value. But it does occassionally come up (like if you tried to re-sort one of these), so keep it in mind!

Single Square Brackets ([])

As discussed above, because Series have both an order of rows, and labels for each row, you should always think carefully about which method of subsetting you are invoking. My advice: Always using the ``.loc`` (for index labels) and ``.iloc`` (for row numbers) selectors. If you use these, the only surprising behavior to watch out for is that ``loc`` will align on row numbers if you pass a list or array of booleans with no index. But since you can’t align on an index in that case, there’s no alternative behavior you would be expecting in that situation.

However, there is another way to subset Series that is a little… stranger. In an effort to be easier for users, pandas allows subsetting using just square brackets (without a .loc or .iloc). With just square brackets, pandas will do different things depending on what you put in the square brackets. In theory this should always “do what you want it to do”, but in my experience it’s a recipe for errors. With that in mind, I don’t suggest using it, but I will detail how it works here so you know. If this makes your head swim, just remember: you can always use ``loc`` and ``iloc``. Single square brackets do not offer any functionality you can’t get with ``.loc`` or ``.iloc``.

So, if you pass an index values into square brackets, pandas will subset based on index values (as though you were using .loc).

[21]:
attendance
[21]:
Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64
[22]:
attendance['Sunday']
[22]:
472

Similarly, if you pass booleans to square brackets, then pandas will behave like you are using .loc as well:

[23]:
attendance[attendance > 100]
[23]:
Monday       132
Wednesday    112
Friday       254
Saturday     322
Sunday       472
dtype: int64

(If it’s not clear to you why attendance[attendance > 100] is a test with an index: Python first evaluates attendance > 100. That generates a new Series of booleans with the same index as attendance. Then Python evaluates the attendance[] part of the problem.)

BUT: if your Series index is not integer based, and if you pass integers into the square brackets, it will act like you’re using iloc:

[24]:
attendance[0]
[24]:
132

Most of the time, this works out. But you can get confused you are working with a Series that has a numeric index. If you pass an integer into [], and you have an index of integers, then [0] will be treated like your typing .loc[0], not .iloc[0]:

[25]:
series_w_numeric_index = pd.Series(["dog", "cat", "fish"], index=[2, 1, 0])
series_w_numeric_index
[25]:
2     dog
1     cat
0    fish
dtype: object
[26]:
series_w_numeric_index[0]
[26]:
'fish'

So personally, I try to always use loc or iloc to avoid this kind of confusion. But if you do use [] on their own, just be very careful that you don’t inadvertently select row based on index values when you think you’re selecting on

Types of Series

Before we dive too far into Series manipulations, it’s important to talk about datatypes. Every Series, as we will see, has a “dtype” (short for datatype). The dtype of a Series is important to understand because a Series’ dtype determines what manipulations you can apply to that series.

There are, broadly, two types of Series:

  • Numeric: these hold numbers that pandas understands are numbers. Specific numeric datatypes include things like int64, and int32 (integers), or float64 and float32 (floating point numbers).
  • Object: these are Series that can hold any Python object, like strings, numbers, Sets, you name it. They have dtype O for “objects”. They are flexible, but also very slow and actually harder to work with.

Numeric Series are by far the easiest to work with, and are generally either integers (int64, int32, etc.) or floating point numbers (float64, float32). We’ll talk more about the differences between these data types later, but for the moment it’s enough to know that integer Series (datatypes that start with int) can only hold… well, integers (whole numbers), while floating point numbers Series (datatypes that start with float) can hold integers, numbers with decimal points, and even missing values.

The numbers at the end of these types (64, 32, etc.) have to do with how many actual bits of data are allocated to each number, something we’ll discuss later in the course. For the moment, the differences between them don’t matter, and in general you’ll likely always see (and should use) the 64 suffix.

You can check the dtype of a Series by typing .dtype. For example, here are some different kinds of Series:

[27]:
s = pd.Series([1, 2, 3])
s.dtype
[27]:
dtype('int64')
[28]:
s = pd.Series([1, 2, 3.14])
s.dtype
[28]:
dtype('float64')
[29]:
s = pd.Series([1, 2, "a string"])
s.dtype
[29]:
dtype('O')

As you can see, integer (int64) Series can only hold integers. If we add one number with a decimal component, the whole thing becomes a float64. Similarly, floating point Series can only hold numbers. If we add a single String, the whole thing becomes an Object (O) type.

Converting datatypes

If you want to change the datatype of a Series, you can do so with the .asdtype() method… provided a conversion is possible! For example, you can always convert integer arrays to floating point Series because a whole number can be represented as a floating point number (just trust me on this for now… we’ll discuss why later!).

[30]:
s = pd.Series([1, 2, 3])
s = s.astype('float64')
s
[30]:
0    1.0
1    2.0
2    3.0
dtype: float64

But be careful: since integers can’t ever hold decimals, if you try and convert a floating point Series to an integer Series, it will just drop the decimal part of numbers with decimals!

[31]:
s = pd.Series([1, 2, 3.14])
s = s.astype('int64')
s
[31]:
0    1
1    2
2    3
dtype: int64

(Note Pandas is just doing the same thing regular python would do:

[32]:
int(3.14)
[32]:
3

But if you try and convert an “object” Series to numeric and there are numbers that can’t be converted, pandas will throw an error:

[33]:
s = pd.Series([1, 2, "a string"])
s.astype('float64')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-c0312cc33b10> in <module>
      1 s = pd.Series([1, 2, "a string"])
----> 2 s.astype('float64')

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   5689             # else, only a single dtype is given
   5690             new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 5691                                          **kwargs)
   5692             return self._constructor(new_data).__finalize__(self)
   5693

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, **kwargs)
    529
    530     def astype(self, dtype, **kwargs):
--> 531         return self.apply('astype', dtype=dtype, **kwargs)
    532
    533     def convert(self, **kwargs):

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    393                                             copy=align_copy)
    394
--> 395             applied = getattr(b, f)(**kwargs)
    396             result_blocks = _extend_blocks(applied, result_blocks)
    397

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    532     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    533         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 534                             **kwargs)
    535
    536     def _astype(self, dtype, copy=False, errors='raise', values=None,

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    631
    632                     # _astype_nansafe works fine with 1-d only
--> 633                     values = astype_nansafe(values.ravel(), dtype, copy=True)
    634
    635                 # TODO(extension)

~/anaconda3/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    700     if copy or is_object_dtype(arr) or is_object_dtype(dtype):
    701         # Explicit copy, or required since NumPy can't view from / to object.
--> 702         return arr.astype(dtype, copy=True)
    703
    704     return arr.view(dtype)

ValueError: could not convert string to float: 'a string'

3. Series Arithmetics

One of the nice things about Series is that, like numpy arrays, we can easily do things like multiple all the values by another number easily. For example, suppose tickets to my zoo cost $15 per person. What is the total money generated by ticket sales each day? Let’s find out!

[34]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                              'Friday', 'Saturday', 'Sunday'])
attendance
[34]:
Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64
[35]:
revenue = attendance * 15
revenue
[35]:
Monday       1980
Tuesday      1410
Wednesday    1680
Thursday     1260
Friday       3810
Saturday     4830
Sunday       7080
dtype: int64

Now what if we want to know to the total amount raised in a week, instead of just the amount on each day? We can use one of R’s many helper functions – in this case sum – which adds up all the values of a Series

[36]:
revenue.sum()
[36]:
22050

Cool!

This is an example of one of the three forms of Series arithmetic:

  1. A Series with more than one element and a Series with only one element.
  2. A Series modified by a function.
  3. Two Series with the same number of elements. When working with two Series, elements are matched based on index values, not row numbers.

But note that the types of things you can do with a Series depends on the Series dtype. Math functions, for example, can only be applied to numeric datatypes!

Summarizing with Functions

We often want to get summary statistics from a Series — that is, learn something general about it by looking beyond its constituent elements. If we have a Series in which each element represents a person’s height, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is, etc. Here are common summary facts for numeric Series (some also work for object types):

my_numbers = pd.Series([1, 2, 3, 4])

my_numbers.dtype #check the dtype
len(my_numbers) #number of elements
my_numbers.max() #maximum value
my_numbers.min() #minimum value
my_numbers.sum() #sum of all values in the Series
my_numbers.mean() #mean
my_numbers.median() #median
my_numbers.var() #variance
my_numbers.std() #standard deviation
my_numbers.quantile() #return specified quantile, 0.5 if none specified
my_numbers.describe() #function that contains many summary stats from above
my_numbers.value_counts() # Tabulate out all the values. Add the argument `normalize=True` to get shares in each big.

Of those, two of the most powerful are .describe() (for numeric Series that take on lots of values):

[37]:
my_numbers = pd.Series(range(100))
my_numbers.describe()
[37]:
count    100.000000
mean      49.500000
std       29.011492
min        0.000000
25%       24.750000
50%       49.500000
75%       74.250000
max       99.000000
dtype: float64

and .value_counts() for numeric series with only a few unique values:

[38]:
my_numbers = pd.Series([1, 2, 2, 2, 2, 1, 1, -1, -1])
my_numbers.value_counts()
[38]:
 2    4
 1    3
-1    2
dtype: int64

Note that .value_counts() can be combined with the normalize=True argument to get the share of observations that have each unique value, rather than the count:

[39]:
my_numbers.value_counts(normalize=True)
[39]:
 2    0.444444
 1    0.333333
-1    0.222222
dtype: float64

4. Modifying Series Elements

The subsetting logic from above can be used to modify Series. The idea here is that instead of keeping elements that meet a logical condition or occur at a specific index, we can change them. For example, what if we had mis-entered attendance for our zoo? We can fix it using a logical test, row-number indexing (iloc), or by index-value (loc).

[40]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472],
                       index=['Monday', 'Tuesday', 'Wednesday', 'Thursday',
                              'Friday', 'Saturday', 'Sunday'])
attendance
[40]:
Monday       132
Tuesday       94
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64

Oops! Turns out Tuesday attendance was 194, not 94! (It was a holiday).

[41]:
# Edit with a test:
attendance[attendance == 94] = 194
[42]:
# Edit with `iloc`:
attendance.iloc[1] = 194
[43]:
# Edit with `loc`:
attendance.loc['Tuesday'] = 194
[44]:
attendance
[44]:
Monday       132
Tuesday      194
Wednesday    112
Thursday      84
Friday       254
Saturday     322
Sunday       472
dtype: int64

Element Modification and DataTypes

One of the big differences between Series and numpy arrays is that Series are “dynamically typed”, meaning that if you have an integer Series and you try and add a number like 3.14 (which has a decimal component, and thus cannot be represented as a floating point number), pandas will just convert your whole array to floating point numbers so that it can hold 3.14. Similarly, if you try and add a string to a floating point array, pandas will just convert the whole array to an Object Series.

[45]:
attendance.loc['Tuesday'] = 3.14
attendance
[45]:
Monday       132.00
Tuesday        3.14
Wednesday    112.00
Thursday      84.00
Friday       254.00
Saturday     322.00
Sunday       472.00
dtype: float64
[46]:
attendance.loc['Tuesday'] = 'no one showed up on Tuesday! :('
attendance
[46]:
Monday                                   132
Tuesday      no one showed up on Tuesday! :(
Wednesday                                112
Thursday                                  84
Friday                                   254
Saturday                                 322
Sunday                                   472
dtype: object

This is different than numpy, where once an array has a type, it will only change types if you ask numpy to change the type explicitly. If you try and stick 3.14 into an integer array, it will just coerce the value 3.14 into an integer by dropping the decimal component:

[47]:
import numpy as np
my_array = np.array([1, 2, 3], dtype='int')
my_array
[47]:
array([1, 2, 3])
[48]:
my_array[0] = 3.14
my_array
[48]:
array([3, 2, 3])
[49]:
my_array.dtype
[49]:
dtype('int64')

And if you try and insert a string, you’ll get an exception:

[50]:
my_array[0] = 'first entry'
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-50-39825a86f88a> in <module>
----> 1 my_array[0] = 'first entry'

ValueError: invalid literal for int() with base 10: 'first entry'

5. numpy Under The Hood

As you may recall, at the start of this tutorial I recommended thinking of Series as augmented 1-dimensional numpy arrays. It turns out that’s more than just a metaphor: behind every Series is a numpy array which you can access with the .values method:

[51]:
attendance.values
[51]:
array([132.0, 'no one showed up on Tuesday! :(', 112.0, 84.0, 254.0,
       322.0, 472.0], dtype=object)

This is good to know because every now and then you my find a tool that works with numpy arrays but not pandas. And when that happens, you now know how to pull out the numpy array underlying your Series and use it directly!

6. Exercises!

If you are enrolled in Practical Data Science at Duke, don’t do these exercises on your own – we’ll do them in class!

Series Exercises