Pandas Series¶
This tutorial introduces the fundamental building block of pandas
, the Series. By the end of this section, you will learn how to create different types of Series, subset them, modify them, and summarize them.
What is a Series?¶
In the simplest terms, a Series is an ordered collection of values, generally all of the same type. As examples of what you can do with a series, you can have a Series that contains the ages of everyone in your class (a numeric Series), or a Series of all the names of people in your family (a string Series).
This may sound familiar: isn’t that how we described numpy
vectors (i.e. one-dimensional numpy arrays)? Yes! In fact, Series are basically one-dimensional numpy
arrays with lots of extra features added on top of them. As we’ll see, most everything you could do with a numpy
array you can do with a Series; Series can just do more. In particualr, the Series provides the option to use an explicit index (rather than only using the row numbers), as well as some querying and analysis tools which we’ll discuss through this course.
Series are central to pandas
because pandas
was designed for statistics, and Series are a perfect way to collect lots of different observations of a variable. We’ll see that the apex of pandas
functionality which is found in DataFrames is essentially a collection of Series. Understanding Series will enable you to better-understand DataFrames and their value for programming for data science.
Creating a series¶
There are lots of ways to create Series, but the easiest is to just pass a list or an array to the pd.Series
constructor.
To illustrate, let me tell you about a week at the zoo I wish I owned. Here’s what attendance looked like at my zoo last week:
Day of Week | Attendees |
---|---|
Monday | 132 people |
Tuesday | 94 people |
Wednesday | 112 people |
Thursday | 84 people |
Friday | 254 people |
Saturday | 322 people |
Sunday | 472 people |
Let’s make a Series for this attendance pattern:
[1]:
import pandas as pd # We have to import pandas to use Series!
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance
[1]:
0 132
1 94
2 112
3 84
4 254
5 322
6 472
dtype: int64
Indices¶
One of the fundamental differences between numpy
arrays and Series is that all Series are associated with an index
. An index is a set of labels for each observation in a Series. If you don’t specify an index
when you create a Series, pandas
will create a default index that just labels each row with it’s initial row number, but you can specify an index if you want. When we explored lists and numpy
arrays, we also encountered indices which were used to access individual elements. For 2-dimensional arrays or less, these were essentially the row numbers. These implicit indices of the row numbers are also options in Series, but there’s the option to explicitly label your indices in ways that can make your data easier to analyze.
In the case below, for example, we know that these entries are associated with different days of the week, so let’s specify an index for our attendance
Series:
[2]:
attendance = pd.Series(
[132, 94, 112, 84, 254, 322, 472],
index=[
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
],
)
attendance
[2]:
Monday 132
Tuesday 94
Wednesday 112
Thursday 84
Friday 254
Saturday 322
Sunday 472
dtype: int64
Now as we see the rows are labeled with days of the week on the left side, rather than with initial row numbers.
Note that you can always access a Series’ index with the .index
property:
[3]:
attendance.index
[3]:
Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
'Sunday'],
dtype='object')
An important property of index labels is that they stay with each row, even if you sort your data. So if I sort my Series by attendance, not only will rows re-order, but so will the index labels:
[4]:
attendance = attendance.sort_values()
attendance
[4]:
Thursday 84
Tuesday 94
Wednesday 112
Monday 132
Friday 254
Saturday 322
Sunday 472
dtype: int64
Note: This seems intuitive with days-of-the-week as our index labels, but it can be confusing when your index starts out as row numbers. For example, if you had not changed our index to be days of the week, then the default index would look like the index labels were just row numbers. But if we then sort the Series, the numbers will shuffle, and they will no longer correspond to row numbers:
[5]:
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance
[5]:
0 132
1 94
2 112
3 84
4 254
5 322
6 472
dtype: int64
[6]:
attendance = attendance.sort_values()
attendance
[6]:
3 84
1 94
2 112
0 132
4 254
5 322
6 472
dtype: int64
Relationship to numpy
arrays¶
As you’ve probably noted, there are some similarities in the construction of pandas
Series with how we constructed numpy
arrays. The biggest difference is the index. In the figure below we can see the code to construct both a numpy
array and a pandas
Series containing the same entries, however, we can also add in explicit index labels that may be easier for us to read or to use to access the row contents. For example, if we were representing financial data for three days of a week: ‘mon’, tue’, and ‘wed’, we may be able do do this as shown in the figure below. Once we’ve created our pandas
Series, we can return a numpy
array. if b
contains a pandas
Series, then we can return a numpy
array with b.values
.

While we will discuss how to access the data in these data structures in more detail in the next lesson, we can rather easily access an entry using the index or its assigned label. There is some nuance in using row numbers since as we saw above with the sort_values()
example, the default ordering of row number based indices can change, so the iloc
method will allow us to always return the \(i^{th}\) entry regardless of how the Series has been sorted. More on that later, though…
Going forward, we want to be crystal clear when we’re discussing pandas
Series or DataFrames instead of when we are talking about numpy
arrays. Pictorally, we’ll represent those as little tables as shown below with its key characteristics labeled:

Types of Series¶
Before we dive too far into Series manipulations, it’s important to talk about datatypes. Every Series, as we will see, has a “dtype” (short for datatype). The dtype
of a Series is important to understand because a Series’ dtype
determines what manipulations you can apply to that series.
There are, broadly, two types of Series:
Numeric: these hold numbers that
pandas
understands are numbers. Specific numeric datatypes include things likeint64
, andint32
(integers), orfloat64
andfloat32
(floating point numbers).Object: these are Series that can hold any Python object, like strings, numbers, Sets, you name it. They have dtype
O
for “objects”. They are flexible, but also very slow and actually harder to work with.
Numeric Series are by far the easiest to work with, and are generally either integers (int64
, int32
, etc.) or floating point numbers (float64
, float32
). Integer Series (datatypes that start with int
) can only hold… well, integers (whole numbers), while floating point numbers Series (datatypes that start with float
) can hold integers, numbers with decimal points, and even missing values.
The numbers at the end of these types (64
, 32
, etc.) have to do with how many actual bits of data are allocated to each number. For the moment, the differences between them don’t matter, and in general you’ll likely always see (and should use) the 64
suffix.
You can check the dtype
of a Series by typing .dtype
. For example, here are some different kinds of Series:
[7]:
s = pd.Series([1, 2, 3])
s.dtype
[7]:
dtype('int64')
[8]:
s = pd.Series([1, 2, 3.14])
s.dtype
[8]:
dtype('float64')
[9]:
s = pd.Series([1, 2, "a string"])
s.dtype
[9]:
dtype('O')
As you can see, integer (int64
) Series can only hold integers. If we add one number with a decimal component, the whole thing becomes a float64
. Similarly, floating point Series can only hold numbers. If we add a single String, the whole thing becomes an Object (O
) type.
Converting datatypes¶
If you want to change the datatype of a Series, you can do so with the .asdtype()
method… provided a conversion is possible! For example, you can always convert integer arrays to floating point Series because a whole number can be represented as a floating point number (just trust me on this for now… we’ll discuss why later!).
[10]:
s = pd.Series([1, 2, 3])
s = s.astype("float64")
s
[10]:
0 1.0
1 2.0
2 3.0
dtype: float64
But be careful: since integers can’t ever hold decimals, if you try and convert a floating point Series to an integer Series, it will just drop the decimal part of numbers with decimals!
[11]:
s = pd.Series([1, 2, 3.14])
s = s.astype("int64")
s
[11]:
0 1
1 2
2 3
dtype: int64
Note Pandas is just doing the same thing regular python would do:
[12]:
int(3.14)
[12]:
3
But if you try and convert an “object” Series to numeric and there are numbers that can’t be converted, pandas
will throw an error:
s = pd.Series([1, 2, "a string"])
s.astype('float64')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Users/nce8/github/practicaldatascience/source/10_pandas_series.ipynb Cell 32 in <cell line: 2>()
1 s = pd.Series([1, 2, "a string"])
----> 2 s.astype('float64')
...
File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:1181, in astype_nansafe(arr, dtype, copy, skipna)
1177 raise ValueError(msg)
1179 if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
1180 # Explicit copy, or required since NumPy can't view from / to object.
-> 1181 return arr.astype(dtype, copy=True)
1183 return arr.astype(dtype, copy=copy)
ValueError: could not convert string to float: 'a string'
Computation with Series¶
One of the nice things about Series is that, like numpy
arrays, we can easily do things like multiply all the values by another number easily. For example, suppose tickets to my zoo cost $15 per person. What is the total money generated by ticket sales each day? Let’s find out!
[ ]:
attendance = pd.Series(
[132, 94, 112, 84, 254, 322, 472],
index=[
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
],
)
attendance
Monday 132
Tuesday 94
Wednesday 112
Thursday 84
Friday 254
Saturday 322
Sunday 472
dtype: int64
[ ]:
revenue = attendance * 15
revenue
Monday 1980
Tuesday 1410
Wednesday 1680
Thursday 1260
Friday 3810
Saturday 4830
Sunday 7080
dtype: int64
Now what if we want to know to the total amount raised in a week, instead of just the amount on each day? We can use one of pandas
’s many helper functions – in this case sum
– which adds up all the values of a Series
[ ]:
revenue.sum()
22050
Cool!
This is an example of one of the three forms of Series arithmetic:
Operations involving a Series with more than one element a single element (as shown above).
A Series modified by a function.
[ ]:
revenue**2
Monday 3920400
Tuesday 1988100
Wednesday 2822400
Thursday 1587600
Friday 14516100
Saturday 23328900
Sunday 50126400
dtype: int64
Two Series with the same number of elements. When working with two Series, elements are matched based on index values, not row numbers.
This third one is a bit trickier, so let’s create a set of Series to demonstrate:
[ ]:
a = pd.Series(data=[1, 2, 3.2], index=["mon", "tue", "wed"])
a
mon 1.0
tue 2.0
wed 3.2
dtype: float64
[ ]:
b = pd.Series(data=[4.0, 5, 6], index=["tue", "wed", "fri"])
b
tue 4.0
wed 5.0
fri 6.0
dtype: float64
Note that the indices are not the same , although ‘tue’ and ‘wed’ are in both Series. Let’s multiply them together:
[ ]:
c = a * b
c
fri NaN
mon NaN
tue 8.0
wed 16.0
dtype: float64
Two Series multipled together that have explicit index labels perform the operation by matching the indices. If the indices do not match, it returns a NaN
value in its place. Of course, with the default row number based indices, if the data have not been sorted, they multiple element-wise as you would expect:
[ ]:
a = pd.Series(data=[1, 2, 3.2])
a
0 1.0
1 2.0
2 3.2
dtype: float64
[ ]:
b = pd.Series(data=[4.0, 5, 6])
b
0 4.0
1 5.0
2 6.0
dtype: float64
[ ]:
c = a * b
c
0 4.0
1 10.0
2 19.2
dtype: float64
But note that the types of things you can do with a Series depends on the Series dtype
. Math functions, for example, can only be applied to numeric datatypes!
Summarizing with Functions¶
We often want to get summary statistics from a Series—that is, learn something general about it by looking beyond its constituent elements. If we have a Series in which each element represents a person’s height, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is, etc. Here are common summary facts for numeric Series (some also work for object types):
my_numbers = pd.Series([1, 2, 3, 4])
my_numbers.dtype #check the dtype
len(my_numbers) #number of elements
my_numbers.max() #maximum value
my_numbers.min() #minimum value
my_numbers.sum() #sum of all values in the Series
my_numbers.mean() #mean
my_numbers.median() #median
my_numbers.var() #variance
my_numbers.std() #standard deviation
my_numbers.quantile() #return specified quantile, 0.5 if none specified
my_numbers.describe() #function that contains many summary stats from above
my_numbers.value_counts() # Tabulate out all the values. Add the argument `normalize=True` to get shares in each big.
Of those, two of the most powerful are .describe()
(for numeric Series that take on lots of values):
[ ]:
my_numbers = pd.Series(range(100))
my_numbers.describe()
count 100.000000
mean 49.500000
std 29.011492
min 0.000000
25% 24.750000
50% 49.500000
75% 74.250000
max 99.000000
dtype: float64
and .value_counts()
for numeric series with only a few unique values:
[ ]:
my_numbers = pd.Series([1, 2, 2, 2, 2, 1, 1, -1, -1])
my_numbers.value_counts()
2 4
1 3
-1 2
dtype: int64
Note that .value_counts()
can be combined with the normalize=True
argument to get the share (i.e. proportion) of observations that have each unique value, rather than the count:
[ ]:
my_numbers.value_counts(normalize=True)
2 0.444444
1 0.333333
-1 0.222222
dtype: float64
Summary¶
pandas
Series are dynamic tools for representing a wide array of data types than (in general) numpy
arrays. While numpy
is the preferred tool for highly specialized numerical processing, pandas
Series provide an important step towards understanding the representational and querying power of pandas
for analyzing tabular data, which is an extremely common and important type of data for data scientists to work with. The Series introduces the idea of labeled and sortable indices which will enhance our ability to query our data. In the next lesson, we will explore how to work with the contents of a pandas
Series.