# Pandas Series#

This tutorial introduces the fundamental building block of `pandas`

, the Series. By the end of this section, you will learn how to create different types of Series, subset them, modify them, and summarize them.

## What is a Series?#

In the simplest terms, a Series is an ordered collection of values, generally all the same type. As examples of what you can do with a series, you can have a Series that contains the ages of everyone in your class (a numeric Series), or a Series of all the names of people in your family (a string Series).

This may sound familiar: isn’t that how we described `numpy`

vectors (i.e. one-dimensional numpy arrays)? Yes! In fact, Series are basically one-dimensional `numpy`

arrays with lots of extra features added on top of them. As we’ll see, almost everything you could do with a `numpy`

array you can do with a Series; Series can just do *more*. In particular, the Series provides the option to use an explicit index (rather than only using the row numbers), as well as some querying and analysis tools which we’ll discuss through this course.

Series are central to `pandas`

because `pandas`

was designed for statistics, and Series are a perfect way to collect lots of different observations of a variable. We’ll see that the apex of `pandas`

functionality which is found in DataFrames is essentially a collection of Series. Understanding Series will enable you to better understand DataFrames and their value for programming for data science.

## Creating a series#

There are lots of ways to create Series, but the easiest is to just pass a list or an array to the `pd.Series`

constructor.

To illustrate, let me tell you about a week at the zoo I wish I owned. Here’s what attendance looked like at my zoo last week:

Day of Week |
Attendees |
---|---|

Monday |
132 people |

Tuesday |
94 people |

Wednesday |
112 people |

Thursday |
84 people |

Friday |
254 people |

Saturday |
322 people |

Sunday |
472 people |

Let’s make a Series for this attendance pattern:

```
import pandas as pd # We have to import pandas to use Series!
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance
```

```
0 132
1 94
2 112
3 84
4 254
5 322
6 472
dtype: int64
```

## Indices#

One of the fundamental differences between `numpy`

arrays and Series is that all Series are associated with an `index`

. An index is a set of labels for each observation in a Series. If you don’t specify an `index`

when you create a Series, `pandas`

will create a default index that just labels each row with its initial row number, but you can specify an index if you want. When we explored lists and `numpy`

arrays, we also encountered indices that were used to access individual elements. For 2-dimensional arrays or less, these were essentially the row numbers. These implicit indices of the row numbers are also options in Series, but there’s the option to explicitly label your indices in ways that can make your data easier to analyze.

In the case below, for example, we know that these entries are associated with different days of the week, so let’s specify an index for our `attendance`

Series:

```
attendance = pd.Series(
[132, 94, 112, 84, 254, 322, 472],
index=[
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
],
)
attendance
```

```
Monday 132
Tuesday 94
Wednesday 112
Thursday 84
Friday 254
Saturday 322
Sunday 472
dtype: int64
```

Now as we see the rows are labeled with days of the week on the left side, rather than with initial row numbers.

Note that you can always access a Series’ index with the `.index`

property:

```
attendance.index
```

```
Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
'Sunday'],
dtype='object')
```

An important property of index labels is that they stay with each row, even if you sort your data. So if I sort my Series by attendance, not only will rows re-order, but so will the index labels:

```
attendance = attendance.sort_values()
attendance
```

```
Thursday 84
Tuesday 94
Wednesday 112
Monday 132
Friday 254
Saturday 322
Sunday 472
dtype: int64
```

**Note:** This seems intuitive with days-of-the-week as our index labels, but it can be confusing when your `index`

starts its life as the default index — the Series’ initial row numbers. For example, if you had not changed our index to be days of the week, then the default index would look like the index labels were just row numbers. But if we then *sort* the Series, the numbers will shuffle, and they will no longer correspond to row numbers:

```
attendance = pd.Series([132, 94, 112, 84, 254, 322, 472])
attendance
```

```
0 132
1 94
2 112
3 84
4 254
5 322
6 472
dtype: int64
```

```
attendance = attendance.sort_values()
attendance
```

```
3 84
1 94
2 112
0 132
4 254
5 322
6 472
dtype: int64
```

# Relationship to `numpy`

Arrays#

As you’ve probably noted, there are some similarities in the construction of `pandas`

Series with how we constructed `numpy`

arrays. The biggest difference is the index. In the figure below we can see the code to construct both a `numpy`

array and a `pandas`

Series containing the same entries, however, we can also add a1 explicit index labels that may be easier for us to read or to use to access the row contents. For example, if we were representing financial data for three days of a week: ‘mon’, tue’, and ‘wed’, we may be able to do this as shown in the figure below. Once we’ve created our `pandas`

Series, we can return a `numpy`

array. If `b`

contains a `pandas`

Series, then we can return a `numpy`

array with `b.values`

.

While we will discuss how to access the data in these data structures in more detail in the next lesson, we can rather easily access an entry using the index or its assigned label. There is some nuance in using row numbers since as we saw above with the `sort_values()`

example, the default ordering of row number-based indices can change, so the `iloc`

method will allow us to always return the \(i^{th}\) entry regardless of how the Series has been sorted. More on that later, though…

Going forward, we want to be crystal clear when we’re discussing `pandas`

Series or DataFrames instead of when we are talking about `numpy`

arrays. Pictorially, we’ll represent those as little tables as shown below with their key characteristics labeled:

Note: Starting with the release of pandas version 2.0 in 2023,`numpy`

arrays are no longer the only way data can be stored on the back-end by`pandas`

. However,`numpy`

arrays are still the default way data is stored in`pandas`

, so it is what we will use in these lessons. We will discuss the alternate back-end —`arrow`

arrays — added in pandas 2.0 in a future reading.

## Types of Series#

Before we dive too far into Series manipulations, it’s important to talk about datatypes. Every Series, as we will see, has a “dtype” (short for datatype). The `dtype`

of a Series is important to understand because a Series’ `dtype`

determines what manipulations you can apply to that series.

There are, broadly, three types of Series:

Numeric: these hold numbers that

`pandas`

understands are numbers. Specific numeric datatypes include things like`int64`

, and`int32`

(integers), or`float64`

and`float32`

(floating point numbers).Object: these are Series that can hold any Python object, like strings, numbers, sets, you name it. They have dtype

`O`

for “objects”. They are flexible, but also very slow and actually harder to work with.Categorical: this is a special data type (analogous to factors in R, if that means anything to you) that we’ll discuss in a future reading. Don’t worry about them for now.

Numeric Series are by far the easiest to work with and are generally either *integers* (`int64`

, `int32`

, etc.) or *floating point numbers* (`float64`

, `float32`

). *Integer* Series (datatypes that start with `int`

) can *only* hold… well, integers (whole numbers), while *floating point numbers* Series (datatypes that start with `float`

) can hold integers, numbers with decimal points, and even missing values.

The numbers at the end of these types (`64`

, `32`

, etc.) have to do with how many actual bits of data are allocated to each number. For the moment, the differences between them don’t matter, and in general, you’ll likely always see (and should use) the `64`

suffix.

You can check the `dtype`

of a Series by typing `.dtype`

. For example, here are some different kinds of Series:

```
s = pd.Series([1, 2, 3])
s.dtype
```

```
dtype('int64')
```

```
s = pd.Series([1, 2, 3.14])
s.dtype
```

```
dtype('float64')
```

```
s = pd.Series([1, 2, "a string"])
s.dtype
```

```
dtype('O')
```

As you can see, integer (`int64`

) Series can *only* hold integers. If we add one number with a decimal component, the whole thing becomes a `float64`

. Similarly, floating point Series can only hold numbers. If we add a single String, the whole thing becomes an Object (`O`

) type.

### Converting datatypes#

If you want to change the datatype of a Series, you can do so with the `.astype()`

method… provided a conversion is possible! For example, you can always convert integer arrays to floating point Series because a whole number can be represented as a floating point number (just trust me on this for now… we’ll discuss why later!).

```
s = pd.Series([1, 2, 3])
s = s.astype("float64")
s
```

```
0 1.0
1 2.0
2 3.0
dtype: float64
```

But be careful: since integers can’t ever hold decimals, if you try and convert a floating point Series to an integer Series, it will just drop the decimal part of numbers with decimals!

```
s = pd.Series([1, 2, 3.14])
s = s.astype("int64")
s
```

```
0 1
1 2
2 3
dtype: int64
```

Note Pandas is just doing the same thing regular Python would do:

```
int(3.14)
```

```
3
```

But if you try and convert an “object” Series to numeric and there are numbers that can’t be converted, `pandas`

will throw an error:

```
s = pd.Series([1, 2, "a string"])
s.astype('float64')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/Users/nce8/github/practicaldatascience/source/10_pandas_series.ipynb Cell 32 in <cell line: 2>()
1 s = pd.Series([1, 2, "a string"])
----> 2 s.astype('float64')
...
File ~/opt/miniconda3/lib/python3.10/site-packages/pandas/core/dtypes/cast.py:1181, in astype_nansafe(arr, dtype, copy, skipna)
1177 raise ValueError(msg)
1179 if copy or is_object_dtype(arr.dtype) or is_object_dtype(dtype):
1180 # Explicit copy, or required since NumPy can't view from / to object.
-> 1181 return arr.astype(dtype, copy=True)
1183 return arr.astype(dtype, copy=copy)
ValueError: could not convert string to float: 'a string'
```

## Computation with Series#

One of the nice things about Series is that, like `numpy`

arrays, we can easily do things like multiply *all* the values by another number easily. For example, suppose tickets to my zoo cost $15 per person. What is the total money generated by ticket sales each day? Let’s find out!

```
attendance = pd.Series(
[132, 94, 112, 84, 254, 322, 472],
index=[
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
],
)
attendance
```

```
Monday 132
Tuesday 94
Wednesday 112
Thursday 84
Friday 254
Saturday 322
Sunday 472
dtype: int64
```

```
revenue = attendance * 15
revenue
```

```
Monday 1980
Tuesday 1410
Wednesday 1680
Thursday 1260
Friday 3810
Saturday 4830
Sunday 7080
dtype: int64
```

Now what if we want to know the total amount raised in a week, instead of just the amount on each day? We can use one of `pandas`

’s many helper functions — in this case `sum`

— which adds up all the values of a Series:

```
revenue.sum()
```

```
22050
```

Cool!

This is an example of one of the three forms of Series arithmetic:

Operations involving a Series with more than one element and a single element (like multiplying all values of a Series by 15, as shown above).

Modifying a Series with a function.

```
revenue**2
```

```
Monday 3920400
Tuesday 1988100
Wednesday 2822400
Thursday 1587600
Friday 14516100
Saturday 23328900
Sunday 50126400
dtype: int64
```

Operations involving two Series.

**Note that when working with two Series, elements are matched based on index values,**.*not*row numbers (the behavior of numpy arrays)

### Index Alignment: Matching Entries Using Index Values#

When doing an operation between two Series, pandas matches up entries based not on row number (like `numpy`

or other programming languages like R); rather, it matches up entries based on index values, a practice called **index alignment**.

This can be a bit tricky, so let’s create a pair of Series to demonstrate:

```
a = pd.Series(data=[1, 2, 3.2], index=["mon", "tue", "wed"])
a
```

```
mon 1.0
tue 2.0
wed 3.2
dtype: float64
```

```
b = pd.Series(data=[4.0, 5, 6], index=["tue", "wed", "fri"])
b
```

```
tue 4.0
wed 5.0
fri 6.0
dtype: float64
```

Note that the indices are not the same, although `'tue'`

and `'wed'`

are in both Series. Let’s multiply them together:

```
c = a * b
c
```

```
fri NaN
mon NaN
tue 8.0
wed 16.0
dtype: float64
```

Two Series multiplied together that have explicit index labels perform the operation by matching the index values. `tue`

is the second entry in `a`

and the first in `b`

, but pandas matches them up to compute the final entry associated with `tue`

: `2 * 4 = 8`

.

If an observation has an index value that doesn’t exist in the other Series (e.g., `fri`

and `mon`

), a `NaN`

value is returned.

Of course, with the default row number-based indices, if the data have not been sorted, they multiply element-wise as you would expect:

```
a = pd.Series(data=[1, 2, 3.2])
a
```

```
0 1.0
1 2.0
2 3.2
dtype: float64
```

```
b = pd.Series(data=[4.0, 5, 6])
b
```

```
0 4.0
1 5.0
2 6.0
dtype: float64
```

```
c = a * b
c
```

```
0 4.0
1 10.0
2 19.2
dtype: float64
```

Note that the types of things you can do with a Series depends on the Series `dtype`

. Math functions, for example, can only be applied to numeric datatypes!

### Summarizing with Functions#

We often want to get summary statistics from a Series—that is, learn something general about it by looking beyond its constituent elements. If we have a Series in which each element represents a person’s height, we may want to know who the shortest or tallest person is, what the median or mean height is, what the standard deviation is, etc. Here are common summary facts for numeric Series (some also work for object types):

```
my_numbers = pd.Series([1, 2, 3, 4])
my_numbers.dtype #check the dtype
len(my_numbers) #number of elements
my_numbers.max() #maximum value
my_numbers.min() #minimum value
my_numbers.sum() #sum of all values in the Series
my_numbers.mean() #mean
my_numbers.median() #median
my_numbers.var() #variance
my_numbers.std() #standard deviation
my_numbers.quantile() #return specified quantile, 0.5 if none specified
my_numbers.describe() #function that contains many summary stats from above
my_numbers.value_counts() # Tabulate out all the values. Add the argument `normalize=True` to get shares in each big.
```

Of those, two of the most powerful are `.describe()`

(for numeric Series that take on lots of values):

```
my_numbers = pd.Series(range(100))
my_numbers.describe()
```

```
count 100.000000
mean 49.500000
std 29.011492
min 0.000000
25% 24.750000
50% 49.500000
75% 74.250000
max 99.000000
dtype: float64
```

and `.value_counts()`

for numeric series with only a few unique values:

```
my_numbers = pd.Series([1, 2, 2, 2, 2, 1, 1, -1, -1])
my_numbers.value_counts()
```

```
2 4
1 3
-1 2
Name: count, dtype: int64
```

Note that `.value_counts()`

can be combined with the `normalize=True`

argument to get the share (i.e. proportion) of observations that have each unique value, rather than the count:

```
my_numbers.value_counts(normalize=True)
```

```
2 0.444444
1 0.333333
-1 0.222222
Name: proportion, dtype: float64
```

## Summary#

`pandas`

Series are dynamic tools for representing a wide array of data types than (in general) `numpy`

arrays. While `numpy`

is the preferred tool for highly specialized numerical processing, `pandas`

Series provide an important step towards understanding the representational and querying power of `pandas`

for analyzing tabular data, which is an extremely common and important type of data for data scientists to work with. The Series introduces the idea of labeled and sortable indices which will enhance our ability to query our data. In the next lesson, we will explore how to work with the contents of a `pandas`

Series.