{
"cells": [
{
"cell_type": "markdown",
"id": "5b7dda17",
"metadata": {},
"source": [
"# Working with Vectors\n",
"\n",
"## Why Should I Care?\n",
"\n",
"The year is 2028, and you have just been approached by the newly elected President of the United States. She hands you a flash drive and says:\n",
"\n",
"\"On this drive is data from the U.S. Census Bureau of the total incomes of over one million households from last year. As you know for my campaign, I am deeply concerned about income inequality, and so I would like you to use this data to answer several questions for me:\n",
"\n",
"- What is the average income of US households?\n",
"- How many households are currently living below the Federal poverty line of $28,000?\n",
"- Of those households, what share are near the poverty line (say, earning more than $20,000), and how many are in extreme poverty (below $20,000)?\n",
"- If I provided a tax credit of $10,000 to those making less than $10,000, what impact would that have on income inequality?\"\n",
"\n",
"What would you do? Using some of the tools we've learned about previously—like lists—you could *maybe* load that data and do some simple calculations (like getting the average income of households), but how might you do these more complicated analyses?\n",
"\n",
"With numpy of course! As we will see in the following several readings, these types of questions—in which we seek to characterize various properties of a collection of individual measurements of income—are precisely what numpy was designed to answer. Indeed, later this week you will be provided with real US household income data from the US Census Bureau and you'll be able to conduct analyses to answer these exact types of questions!\n",
"\n",
"(And don't worry if income inequality isn't a topic that you care about—these same skills are relevant to *lots* of business questions, income inequality is just a fun illustrative example. Feel free to also imagine you're starting a business and want to use this income data to estimate the number of households that have the right income to be potential customers for your business!)\n"
]
},
{
"cell_type": "markdown",
"id": "5586d717",
"metadata": {},
"source": [
"\n",
"## Vectors in Context\n",
"\n",
"In this reading, we'll begin our introduction to numpy with the most basic form of numpy array: the vector! We'll start by helping to contextualize and explain why we use vectors, then we'll talk about how to create a vector and use it to do mathematical operations. \n",
"\n",
"As we mentioned in our last reading, the fundamental workhorse of data science in python is the numpy array. While all numpy arrays are similar, they do come in a range of flavors depending on the number of dimensions along which they organize the data they contain:\n",
"\n",
"\n",
"![numpy array example image](img/arrays_w_axis_3d.png)\n",
"\n",
"The simplest form of the numpy array is a one-dimensional array, also known as a *vector*. Vectors are a building block of data science because they are often used to represent a collection of different *measurements* or *observations* of the same thing. For example, one may use a vector to hold the heights of everyone in a classroom, or a series of measurements of one's heart rate taken over time.\n",
"\n",
"When we move from one dimension to two, the resulting array is also known as a *matrix*. Matrices (the plural of matrix) are commonly used to represent data in two different ways. \n",
"\n",
"The first way we can represent data in a matrix is by placing lots of vectors side by side so that each column becomes a different property being measured, and each row becomes a single entity whose properties are being measured. For example, you could imagine storing data about customers in a matrix, where each row is a different individual customer, and each column is a different type of data being collected (days since customer's last purchase, total dollars customer has spent at a store, customer age, etc.).\n",
"\n",
"The second common way of representing data in a matrix is by using the matrix to represent fundamentally two-dimensional data, like a picture. A simple black and white image, for example, can be represented by a matrix where the value in each cell is the darkness of the corresponding pixel, and a color image can be created by combining multiple matrices—one matrix for the amount of blue in each pixel, one for the amount of red, and one for the amount of green.\n",
"\n",
"While vectors and matrices are probably the most used types of arrays in data science, arrays can be extended into as many dimensions as one wants! For example, we could represent that color image we just described not with three matrices, but with one three-dimensional array composed of the three stacked matrices. Or we might also want to work with a three-dimensional array to represent three-dimensional data, like the results of an MRI scan of a brain with MRI signal strength in each cell, or a climate model with temperatures at a given location and altitude in each cell. Indeed, even higher dimensional arrays are commonly used, even if they're harder to visualize—for example, if we wanted to model how a three-dimensional climate model changes over time, we could think of that as a series of three-dimensional arrays (each representing the world at a given time) stacked along a fourth dimension (time)!\n"
]
},
{
"cell_type": "markdown",
"id": "ea55d6e5",
"metadata": {},
"source": [
"\n",
"## Vectors\n",
"\n",
"All the flexibility that makes arrays so powerful can also be **really** overwhelming, so while it's helpful for you to know a little about why arrays are so powerful, we'll start our lesson by just getting a firm grasp on how to work with the simplest form of arrays—the vector. \n",
"\n",
"Then, once we feel really comfortable with vectors, we'll talk about how everything you've learned about manipulating vectors can be easily generalized to these higher-dimensional arrays. Because as we'll see, the real magic of arrays isn't that they come in so many flavors—it's that arrays follow the same logic whether they're simple one-dimensional vectors or 10-dimensional tensors.\n",
"\n",
"### Creating a Vector\n",
"\n",
"Vectors are one-dimensional arrays, which means they have two key properties: first, they organize all their data in a line along one dimension (like the lists you saw in your previous readings), and second, they are *homogeneously typed*, meaning each vector only holds data of one type (integer, floating point number, etc.)."
]
},
{
"cell_type": "markdown",
"id": "503b0ecb",
"metadata": {},
"source": [
"The simplest way to create a vector is with the `np.array()` function and a list:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7285fecd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 3])"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"# A vector of ints\n",
"an_integer_vector = np.array([1, 2, 3])\n",
"an_integer_vector"
]
},
{
"cell_type": "markdown",
"id": "1e2789c4",
"metadata": {},
"source": [
"When you create a vector this way, numpy will do its best to infer the type of data you want the vector to store based on the data you provided it. You can see what it guessed by checking the `.dtype` attribute of your array:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "21cee23c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dtype('int64')"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"an_integer_vector.dtype"
]
},
{
"cell_type": "markdown",
"id": "26b9e103",
"metadata": {},
"source": [
"We'll talk more about numpy data types, but for now it's sufficient to know that `int64` is a kind of integer. So in this case, you passed `np.array()` a list of three integers, so it chose to create an array of integers!"
]
},
{
"cell_type": "markdown",
"id": "47225520",
"metadata": {},
"source": [
"Vectors aren't limited to integers, or course -- we can also create vectors of floating point numbers (numbers with decimal components), Booleans, or strings!"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "7b767eb7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1.7 , 2. , 3.14])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# A vector of floats\n",
"a_float_vector = np.array([1.7, 2, 3.14])\n",
"a_float_vector"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "a7d5926e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dtype('float64')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"a_float_vector.dtype"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d275df48",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dtype('bool')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# A vector of booleans\n",
"a_boolean_vector = np.array([True, False, True])\n",
"a_boolean_vector\n",
"\n",
"a_boolean_vector.dtype"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "69b43582",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Lassie', '盼盼', 'Hachi', 'Flipper', '🐄'], dtype='