# Working with tabular data¶

In the last class, we discussed how to work with numerical data through arrays and matrices in cases when *all the data were of the same type*. You encounter this often, so `numpy`

is a powerful tool for working with such homogeneous data (data of the same type) and in such cases is often the preferred tool if you want to do numeric computations because of its speed. However, there are a number of frequently encountered situations in which `numpy`

tools may not be ideal and where instead the `pandas`

library may be best. These include:

If you do not have purely numerical data (i.e. if your data are heterogeneous), and therefore have a mix of data types,

`numpy`

may not be appropriate, while`pandas`

can easily handle and analyze mixed data typesIf you have tabular data,

`pandas`

makes life much easier to**describe**,**summarize**,**query**, and**visualize**the data than the equivalent processes with`numpy`

.

## Mixed data types¶

Often times our data are of mixed types (e.g. integers and strings). This happens all the time. Imagine that you are collecting basic medical information from a patient. You may ask for height and weight (numerical, floating point numbers), age (integer), and blood type (categorical, string). While numpy can store these together in an array, there’s not much you’re going to be able to do with it computationally. Consider the following example where `numpy`

throws an error:

```
[1]:
```

```
import numpy as np
a = np.array([6.1, 150.0, 25, "A-"])
b = np.array([5.6, 122.0, 29, "B+"])
c = a + b
```

```
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
c:\Users\kjb17\Dropbox\Code\mids_coursera\class_3\week_2\10_intro_to_pandas.ipynb Cell 2' in <module>
<a href='vscode-notebook-cell:/c%3A/Users/kjb17/Dropbox/Code/mids_coursera/class_3/week_2/10_intro_to_pandas.ipynb#ch0000001?line=1'>2</a> a = np.array([6.1,150.0,25,'A-'])
<a href='vscode-notebook-cell:/c%3A/Users/kjb17/Dropbox/Code/mids_coursera/class_3/week_2/10_intro_to_pandas.ipynb#ch0000001?line=2'>3</a> b = np.array([5.6,122.0,29,'B+'])
----> <a href='vscode-notebook-cell:/c%3A/Users/kjb17/Dropbox/Code/mids_coursera/class_3/week_2/10_intro_to_pandas.ipynb#ch0000001?line=3'>4</a> c = a + b
UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> None
```

`pandas`

, on the other hand, doesn’t have any problem with mixed data types, on the other hand. It will use the appropriate operation for each data type, adding the numbers and concatenating the strings:

```
[2]:
```

```
import pandas as pd
a = pd.Series([6.1, 150.0, 25, "A-"])
b = pd.Series([5.6, 122.0, 29, "B+"])
c = a + b
c
```

```
[2]:
```

```
0 11.7
1 272.0
2 54
3 A-B+
dtype: object
```

While we will see how `pandas`

ability to use mixed types really shines later in this lesson, let’s also take a look at some motivating examples of what makes `DataFrames`

so useful

## A hierarchy of data types: from lists to `numpy`

arrays to `pandas`

Series and DataFrames¶

We’ve encountered a few data types already in Python including lists and dictionaries, and more recently `numpy`

arrays. Numpy arrays were essentially ways of creating 1, 2, or N-dimensional matrices and the computational tools to work with them efficiently. As a practice, it’s best to use the simplest tool for the job. If all we need is a collection and we’re not going to perform much computation on it, lists and dictionaries may be just fine. If we need to perform computation, then using the `numpy`

arrays may make more sense. In this section, we’re introducing `pandas`

objects/data types including the series and the dataframe. We can think of a `pandas`

series as a 1-dimensional numpy array with more functionality for selecting and querying the data. Data frames are then a collection of series with even more querying tools build into the objects themselves.

## Effectively working with tabular data¶

Let’s walk through 3 examples of what you can do easily with `pandas`

that can be rather complicated without it:

Quickly reading your data into a structured tabular form

Quickly describe/summarize your data

Quickly querying your dataset

Quickly plotting your data

### Example 1: Quickly reading your data into a structured tabular form¶

Using the convenient `pandas`

methods that hide away the tricky bits, loading in tabular data is trivially easy. Let’s load in a dataset to demonstrate (we’ll talk more about how these work throughout this week):

```
[5]:
```

```
import pandas as pd
smallworld = pd.read_csv("../Example_data/world-very-small.csv")
smallworld
```

```
[5]:
```

country | region | gdppcap08 | polityIV | |
---|---|---|---|---|

0 | Brazil | S. America | 10296 | 18 |

1 | Germany | W. Europe | 35613 | 20 |

2 | Mexico | N. America | 14495 | 18 |

3 | Mozambique | Africa | 855 | 16 |

4 | Russia | C&E Europe | 16139 | 17 |

5 | Ukraine | C&E Europe | 7271 | 16 |

It really doesn’t get much easier than that. We have text content (under ‘country’ and ‘region’) and numerical content (‘gdp_per_capita_2008’) and column headings are even included!

### Example 2: Quickly describe / summarize your data¶

With the use of a single `pandas`

method, we can summarize the statistics of any fully-numerical columns of data:

```
[6]:
```

```
smallworld.describe()
```

```
[6]:
```

gdp_per_capita_2008 | |
---|---|

count | 6.000000 |

mean | 14111.500000 |

std | 11863.031683 |

min | 855.000000 |

25% | 8027.250000 |

50% | 12395.500000 |

75% | 15728.000000 |

max | 35613.000000 |

### Example 3 Quickly querying your data¶

Not only can we load and describe our data quickly, but we can query our data quickly, too. Let’s say we wanted to find the countries on the list with per-capita GDP below $10,000. This also becomes extremely simple:

```
[6]:
```

```
smallworld.loc[smallworld.gdppcap08 < 10000]
```

```
[6]:
```

country | region | gdppcap08 | polityIV | |
---|---|---|---|---|

3 | Mozambique | Africa | 855 | 16 |

5 | Ukraine | C&E Europe | 7271 | 16 |

### Example 4: Quickly plotting your data¶

And if you want/need a quick look at your data, plotting is also extremely straightforward with `pandas`

as it builds on the `matplotlib`

ecosystem for plotting. Let’s create a bar plot of the GDP per capita for each of the countries in our list:

```
[7]:
```

```
smallworld.plot.bar(x="country", y="gdppcap08")
```

```
[7]:
```

```
<AxesSubplot:xlabel='country'>
```

### Drawbacks of `pandas`

¶

Despite its many advantages, and as with all tools, `pandas`

also has its drawbacks. First of all, the syntax of `pandas`

is a bit different from what we’ve discussed previously with base Python and with `numpy`

, making it a bit challenging to learn. Personally, I still find myself regularly consulting the documentation for `pandas`

when I’m using a method I haven’t used in awhile. The other drawback is that `pandas`

is particularly well-designed for 1D series/arrays and 2D matrices. It’s not suitable for handling 3D or N-D matrices (where N > 3). In such cases `numpy`

would be preferred, or the higher dimensional analogue of pandas, xarray.