# Working with tabular data#

Tabular data is any collection of data that can be organized into rows and columns, essentially a two-dimensional matrix, as opposed to unstructured text, network data, geospatial data, image data, and video data, each of which have distinct methods and file formats for the data.

In the last class, we discussed how to work with tabular numerical data through arrays and matrices in cases when *all the data were of the same type*. You encounter this often, so `numpy`

is a powerful tool for working with such homogeneous data (data of the same type) and in such cases is often the preferred tool if you want to do numeric computations because of its speed. However, there are several frequently encountered situations in which `numpy`

tools may not be ideal and where instead the `pandas`

library may be best. These include:

Your data includes not only numerical data, but also text, images, geometric objects, or any other type of data (i.e., your data are

*heterogeneous*). In these situations,`numpy`

may not be appropriate, while`pandas`

can easily organize, manipulate, and analyze this kind of “mixed data.”If you have tabular data — that is, data organized into rows and columns, as in a spreadsheet — it is often easier in

`pandas`

to**describe**,**summarize**,**query**, and**visualize**the data than the equivalent processes with`numpy`

.

## Mixed data types#

Often our data are of mixed types (e.g. integers and strings). This happens all the time. Imagine that you are collecting basic medical information from a patient. You may ask for height and weight (numerical, floating point numbers), age (integer), and blood type (categorical, string). While numpy can store these together in an array, there’s not much you’re going to be able to do with it computationally. Consider the following example where `numpy`

throws an error:

```
import numpy as np
a = np.array([6.1,150.0,25,'A-'])
b = np.array([5.6,122.0,29,'B+'])
c = a + b
```

```
---------------------------------------------------------------------------
UFuncTypeError Traceback (most recent call last)
c:\Users\kjb17\Dropbox\Code\mids_coursera\class_3\week_2\10_intro_to_pandas.ipynb Cell 2' in <module>
2 a = np.array([6.1,150.0,25,'A-'])
3 b = np.array([5.6,122.0,29,'B+'])
----> 4 c = a + b
UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> None
```

`pandas`

, on the other hand, doesn’t have any problem with mixed data types. It will use the appropriate operation for each data type, adding the numbers and concatenating the strings:

```
import pandas as pd
a = pd.Series([6.1, 150.0, 25, "A-"])
b = pd.Series([5.6, 122.0, 29, "B+"])
c = a + b
c
```

```
0 11.7
1 272.0
2 54
3 A-B+
dtype: object
```

In the summer of 2023, the

`pandas`

package was updated to`pandas 2.0`

. Inmostcases,`pandas 2.0`

operates in the same manner as`pandas 1.0`

, but there are a few important differences.Allmaterials in this course reflect the behavior of`pandas 2.0`

, and where the behavior of`pandas 2.0`

is notably different from the behavior of previous versions of`pandas`

, we will note that explicitly.It is important you understand this change took place recently because the various versions of

`pandas 1.x`

(e.g.,`pandas 1.0, 1.1, 1.2, ...`

) were around for over three years, and much of the behavior of`pandas 1.x`

was also present before`pandas 1.0`

was released. As a result, alotof answers you find on resources like StackOverflow or Google, or code you get from generative AI tools like`chatGPT`

or`Github Copilot`

will be based on behavior that, in places, may now be out of date!

## A hierarchy of data types: from lists to `numpy`

arrays to `pandas`

Series and DataFrames#

In this section, we’re introducing `pandas`

objects/data types including the `Series`

and the `DataFrames`

.

We can think of a `pandas`

`Series`

as a 1-dimensional numpy array with more functionality for selecting and querying data than we get in a 1-dimensional `numpy`

array. `DataFrames`

, similarly, are a collection of `Series`

analogous to a 2-D numpy array, but with even more querying tools built into the objects themselves.

## Effectively working with tabular data#

Let’s walk through 3 examples of what you can do easily with `pandas`

that can be rather complicated without it:

Quickly reading your data into a structured tabular form

Quickly describe/summarize your data

Quickly querying your dataset

Quickly plotting your data

### Example 1: Quickly reading your data into a structured tabular form#

Using the convenient `pandas`

methods that hide away the tricky bits, loading tabular data is trivially easy. Let’s load in a dataset to demonstrate (we’ll talk more about how these work throughout this week):

```
import pandas as pd
smallworld = pd.read_csv("./data/world-very-small.csv")
smallworld
```

country | region | gdp_per_capita_2008 | |
---|---|---|---|

0 | Brazil | S. America | 10296 |

1 | Germany | W. Europe | 35613 |

2 | Mexico | N. America | 14495 |

3 | Mozambique | Africa | 855 |

4 | Russia | C&E Europe | 16139 |

5 | Ukraine | C&E Europe | 7271 |

It really doesn’t get much easier than that. We have text content (under `country`

and `region`

) and numerical content (`gdp_per_capita_2008``) and column headings are even included!

### Example 2: Quickly describe / summarize your data#

With the use of a single `pandas`

method, we can summarize the statistics of any fully-numerical columns of data:

```
smallworld.describe()
```

gdp_per_capita_2008 | |
---|---|

count | 6.000000 |

mean | 14111.500000 |

std | 11863.031683 |

min | 855.000000 |

25% | 8027.250000 |

50% | 12395.500000 |

75% | 15728.000000 |

max | 35613.000000 |

### Example 3 Quickly querying your data#

Not only can we load and describe our data quickly, but we can query our data quickly, too. Let’s say we wanted to find the countries on the list with per-capita GDP below $10,000. This also becomes extremely simple:

```
smallworld.loc[smallworld.gdp_per_capita_2008 < 10000]
```

country | region | gdp_per_capita_2008 | |
---|---|---|---|

3 | Mozambique | Africa | 855 |

5 | Ukraine | C&E Europe | 7271 |

### Example 4: Quickly plotting your data#

If you want/need a quick look at your data, plotting is extremely straightforward with `pandas`

, as it builds on the `matplotlib`

ecosystem for plotting. Let’s create a bar plot of the GDP per capita for each of the countries in our list:

```
smallworld.plot.bar(x="country", y="gdp_per_capita_2008")
```

```
<Axes: xlabel='country'>
```

### Drawbacks of `pandas`

#

Despite its many advantages, and as with all tools, `pandas`

also has its drawbacks. First of all, the syntax of `pandas`

is a bit different from what we’ve discussed previously with base Python and with `numpy`

, making it a bit challenging to learn. Personally, I still find myself regularly consulting the documentation for `pandas`

when I’m using a method I haven’t used in a while. The other drawback is that `pandas`

is particularly well-designed for 1D series/arrays and 2D matrices. It’s not suitable for handling 3D or N-D matrices (where N > 3). In such cases, there are two main options. The first is to simply revert to our trustworthy friend `numpy`

— it’s extremely flexible, powerful, and you already know how to use it! The second option is to use the xarray package, which is to higher-dimensional numpy arrays what pandas is to 1 or 2D numpy arrays — essentially a wrapper around numpy arrays that adds an ability to label your dimensions in a more intuitive manner!

### The week ahead#

With a sense of the benefits of `pandas`

, this week we will dive into how `pandas`

can be effectively deployed as a key programming tool in working with tabular data and used to construct targeted, and sometimes complicated, queries of tabular data that can be used to answer key data science questions.