# Subsetting and indexing Series¶

## Subsetting Series¶

Extracting a subset of elements from a Series is an extremely important task, not least because it generalizes nicely to working with ever-larger datasets (which are at the heart of data science). This process—whether applied to a Series or a dataset—is often referred to as “taking a subset”, “subsetting”, or “filtering”. If there is one skill that is key for enhancing your data science skills quickly, it’s this, because this allows you to get your data into the right format for processing as quickly as possible.

In `pandas`

, there are three ways to filter a Series: using a separate logical Series, using row-number indexing, and using index labels. I tend to use the first method most, but all three are useful. The first and second of these you will recognize from `numpy`

arrays, while the last one (since it uses index labels which only exist in `pandas`

) is unique to `pandas`

.

### Subsetting using row-number indexing¶

A different way to subset a Series is to specify the row-numbers you want to keep using the `iloc`

function. (`iloc`

stands for “integer location”, since row numbers are always integers). This will give you the behavior you’re more familiar with from `numpy`

. Just remember that, as in all of Python, the first row is numbered `0`

!

```
[1]:
```

```
import pandas as pd
fruits = pd.Series(["apple", "banana"])
fruits.iloc[0]
```

```
[1]:
```

```
'apple'
```

You can also subset with lists of rows, or ranges, just like in `numpy`

:

```
[8]:
```

```
fruits.iloc[[0, 1]]
```

```
[8]:
```

```
0 apple
1 banana
dtype: object
```

```
[9]:
```

```
fruits.iloc[0:2]
```

```
[9]:
```

```
0 apple
1 banana
dtype: object
```

### Subsetting using index values¶

Lastly, we can subset our rows using the index values associated with each row using the `loc`

function.

```
[10]:
```

```
attendance = pd.Series(
[132, 94, 112, 84, 254, 322, 472],
index=[
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
],
)
```

```
[11]:
```

```
attendance.loc["Monday"]
```

```
[11]:
```

```
132
```

You can also ask for ranges of index labels. Note that unlike in integer ranges (like the `0:2`

we used above to get rows 0 and 1), index label ranges *include* the last item in the range. So for example if I ask for `.loc["Monday":"Friday"]`

, I will get Friday included, even if `.iloc[0:2]`

doesn’t include 2.

```
[12]:
```

```
attendance.loc["Monday":"Friday"]
```

```
[12]:
```

```
Monday 132
Tuesday 94
Wednesday 112
Thursday 84
Friday 254
dtype: int64
```

### Subsetting with logicals¶

Let’s jump right into an example, using our Zoo attendance Series:

```
[13]:
```

```
attendance = pd.Series(
[132, 94, 112, 84, 254, 322, 472],
index=[
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday",
],
)
attendance
```

```
[13]:
```

```
Monday 132
Tuesday 94
Wednesday 112
Thursday 84
Friday 254
Saturday 322
Sunday 472
dtype: int64
```

Suppose we want to only get days with at least 100 people attending. We can subset our Series by using a simple test to build a Series of booleans (True and False values), then asking `pandas`

for the rows of our Series for which the entry in our test Series is `True`

:

```
[14]:
```

```
was_busy = attendance > 100
was_busy
```

```
[14]:
```

```
Monday True
Tuesday False
Wednesday True
Thursday False
Friday True
Saturday True
Sunday True
dtype: bool
```

```
[15]:
```

```
busy_days = attendance.loc[was_busy]
busy_days
```

```
[15]:
```

```
Monday 132
Wednesday 112
Friday 254
Saturday 322
Sunday 472
dtype: int64
```

We can summarize these methods in the figure below:

There is one really important distinction between how subsetting works in `pandas`

and most other languages though, which has to do with indices. Suppose we want to subset a Series with fruits to only get the entry “apple”. Would could do the following:

```
[16]:
```

```
fruits = pd.Series(["apple", "banana"])
apple_selector = pd.Series([True, False])
fruits.loc[apple_selector]
```

```
[16]:
```

```
0 apple
dtype: object
```

This looks familiar from `numpy`

, *but*:

A very important difference between `pandas`

and other languages and libraries (like `numpy`

) is that when a logical Series is passed into `loc`

, evaluation is done *not* on the basis of the order of entries, but on the basis of index values. In the case above, because we did not specify indices for either `fruits`

or `apple_selector`

, they both got the usual default index values of their initial row numbers. But let’s see what happens if we change their indices so they don’t match their order:

```
[17]:
```

```
fruits # We can leave fruits as they are
```

```
[17]:
```

```
0 apple
1 banana
dtype: object
```

```
[18]:
```

```
apple_selector = pd.Series([True, False], index=[1, 0])
apple_selector
```

```
[18]:
```

```
1 True
0 False
dtype: bool
```

Note that we’ve *flipped* the index order for `apple_selector`

: the first row has index value 1, and the second row has value 2. Now watch what happens when we put `apple_selector`

in square brackets:

```
[19]:
```

```
fruits.loc[apple_selector]
```

```
[19]:
```

```
1 banana
dtype: object
```

We get `banana`

! That’s because in `apple_selector`

, the index value associated with the `True`

entry as 1, and the row of `fruit`

that had index value 1 was `banana`

, even though they are in different rows. This is called `index alignment`

, and is absolutely crucial to keep in mind while using `pandas`

.

But note this only happens *if* your boolean array is a Series (and thus has an index). If you pass a `numpy`

boolean array or a list of booleans (neither of which have a concept of an index), then despite using `loc`

, alignment will be based on row numbers not index values (because there *are* no index values to align).

```
[20]:
```

```
fruits.loc[[True, False]]
```

```
[20]:
```

```
0 apple
dtype: object
```

**UGH** I know. If I wrote pandas, this would not work and this would throw an exception. But that’s how it is.

While this distinction between row numbers and index values is important, though, it comes up less often than you’d think. That’s because usually we subset by feeding in a new Series of booleans we made by hand; instead we build a new Series by executing a test on the Series we’re using. And when we do that, the new Series of booleans will have the same index as the old Series, so they align naturally. Look back at our example of `was_busy`

: you’ll see that it automatically got the same index as our original Series, `attendance`

. As a result, the first row of our boolean Series will have the same index value as the first row of our original Series, the second row of our boolean Series will have the same index value as the second row of our original Series, and so on. As a result, there’s no difference between matching on row order and matching on index value. But it does *occassionally* come up (like if you tried to re-sort one of these), so keep it in mind!

### Single Square Brackets (`[]`

)¶

As discussed above, because Series have both an order of rows, and labels for each row, you should always think carefully about which method of subsetting you are invoking. My advice: **Always using the ``.loc`` (for index labels) and ``.iloc`` (for row numbers) selectors. If you use these, the only surprising behavior to watch out for is that ``loc`` will align on row numbers if you pass a list or array of booleans with no index.** But since you *can’t* align on an index in that case, there’s no alternative behavior you would be expecting in that situation.

However, there is another way to subset Series that is a little… stranger. In an effort to be easier for users, `pandas`

allows subsetting using *just* square brackets (without a `.loc`

or `.iloc`

). With just square brackets, `pandas`

will do different things depending on what you put in the square brackets. *In theory* this should always “do what you want it to do”, but in my experience it’s a recipe for errors. With that in mind, I don’t suggest using it, but I will detail how it works here so you know. If this makes your head swim, just remember: **you can always use ``loc`` and ``iloc``. Single square brackets do not offer any functionality you can’t get with ``.loc`` or ``.iloc``.**

So, if you pass an index values into square brackets, `pandas`

will subset based on index values (as though you were using `.loc`

).

```
[21]:
```

```
attendance
```

```
[21]:
```

```
Monday 132
Tuesday 94
Wednesday 112
Thursday 84
Friday 254
Saturday 322
Sunday 472
dtype: int64
```

```
[22]:
```

```
attendance["Sunday"]
```

```
[22]:
```

```
472
```

Similarly, if you pass booleans to square brackets, then `pandas`

will behave like you are using `.loc`

as well:

```
[23]:
```

```
attendance[attendance > 100]
```

```
[23]:
```

```
Monday 132
Wednesday 112
Friday 254
Saturday 322
Sunday 472
dtype: int64
```

(If it’s not clear to you why `attendance[attendance > 100]`

is a test with an index: Python first evaluates `attendance > 100`

. That generates a new Series of booleans with the same index as `attendance`

. Then Python evaluates the `attendance[]`

part of the problem.)

**BUT**: *if* your Series index is not integer based, *and* if you pass integers into the square brackets, it will act like you’re using `iloc`

:

```
[24]:
```

```
attendance[0]
```

```
[24]:
```

```
132
```

*Most* of the time, this works out. But you can get confused you are working with a Series that has a numeric index. If you pass an integer into `[]`

, *and* you have an index of integers, then `[0]`

will be treated like your typing `.loc[0]`

, not `.iloc[0]`

:

```
[25]:
```

```
series_w_numeric_index = pd.Series(["dog", "cat", "fish"], index=[2, 1, 0])
series_w_numeric_index
```

```
[25]:
```

```
2 dog
1 cat
0 fish
dtype: object
```

```
[26]:
```

```
series_w_numeric_index[0]
```

```
[26]:
```

```
'fish'
```

So personally, I try to always use `loc`

or `iloc`

to avoid this kind of confusion. But if you do use `[]`

on their own, just be very careful that you don’t inadvertently select row based on index values when you think you’re selecting on

## Summary¶

Being able to select the data you need for a given analysis is a foundational skill to develop. Having the programming proficiency to be able to do this quickly will significantly reduce the time you need to prepare your data for analysis. There are three primary methods of accessing and filtering data: logical indexing, row-number indexing (e.g. `iloc`

), and index labels, and together this toolkit can enhance your ability to access and filter data. Next, you’ll explore an exercise for trying out this skills yourself.