{
"cells": [
{
"cell_type": "markdown",
"id": "30f88b4b",
"metadata": {},
"source": [
"# Doing Math With Vectors\n",
"\n",
"If vectors were just for storing data, they wouldn't be super useful. Thankfully, they're not! \n",
"\n",
"## Math with Scalars\n",
"\n",
"One of the great things about vectors is that we can use them to do math in a very concise manner. For example, if you try and do a math mathematical operation between a vector and either a scalar number or another vector of length one, numpy will just repeat the mathematical operation with each entry in the longer vector (a behavior called \"broadcasting\"). This is a very valuable trick -- since we often use vectors to store a collection of measurements of the same phenomenon (e.g. the salaries of employees, the dollar value of sales, temperature measurements, etc.), it is also the case that we often want to apply the same mathematical function to all of the entries in a vector. \n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c4f22e47",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([34255, 27222, 42250, 12000])"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"\n",
"# Suppose we are working with data on car sales,\n",
"# and we have the value of all the cars we sold last year\n",
"\n",
"sales = np.array([34_255, 27_222, 42_250, 12_000])\n",
"sales\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "6f244ccc",
"metadata": {},
"outputs": [],
"source": [
"# Now suppose that for every car we sell,\n",
"# we have to pay a sales tax of 10%.\n",
"# How could we calculate the after-tax revenue\n",
"# from each of the sales?\n",
"\n",
"# Simple!\n",
"after_tax = sales * 0.90"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a02f568a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([30329.5, 23999.8, 37525. , 10300. ])"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# And suppose we also had to pay a\n",
"# flat fee of 500 dollars to process each\n",
"# sale. Now what would our final revenue be?\n",
"\n",
"final = after_tax - 500\n",
"final"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to working with obvious math functions (e.g. `+`, `-`, `*`), this logic also applies to logical comparisons like `>`, `<`, `==`, etc. For example, suppose we wanted to identify sales for more than $30,000. We could do:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ True, False, True, False])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"final > 30_000"
]
},
{
"cell_type": "markdown",
"id": "ca9adf16",
"metadata": {},
"source": [
"### Functions\n",
"\n",
"The same thing happens with most functions -- the function gets applied to each entry. Suppose we wanted to round off all of these numbers to the nearest dollar:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "cc8fc949",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([30330., 24000., 37525., 10300.])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Round to the nearest dollar\n",
"np.round(final)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Math with Equal-Length Vectors"
]
},
{
"cell_type": "markdown",
"id": "7b983dca",
"metadata": {},
"source": [
"But we can do more than just repeat the same operation for each entry. If you have two vectors of the same length, mathematical operations will occur \"element-wise\", meaning the mathematical operation will be applied to the two 1st entries, then the two 2nd entries, then the two 3rd entries, etc. For example, if we were to add our vector of the values 0 through 4 to a vector with two 0s, then two 1s, then a 0 numpy would do the following:\n",
"\n",
"```\n",
"0 + 0 = 0 + 0 = 0 \n",
"1 + 0 = 1 + 0 = 1 \n",
"2 + 1 = 2 + 1 = 3 \n",
"3 + 1 = 3 + 1 = 4 \n",
"4 + 0 = 4 + 0 = 4 \n",
"```\n",
"\n",
"(Obviously, numpy likes to print out vectors sideways, but personally, we think of them as column vectors, so have written them out like that here).\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1fd8c68a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 2, 3, 4])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Two vectors with the same number of elements \n",
"numbers = np.arange(5)\n",
"numbers\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 1, 1, 0])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers2 = np.array([0, 0, 1, 1, 0])\n",
"numbers2\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 3, 4, 4])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"numbers3 = numbers2 + numbers\n",
"numbers3\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How might this be helpful? Suppose that in addition to information about the sale price of all of the cars we sold last year, we also had data on what those cars cost us (the dealership):"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([27750, 23500, 39200, 6700])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prices = np.array([27_750, 23_500, 39_200, 6_700])\n",
"prices"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can combine the after-tax revenue we made from each sale with what the cars we sold cost the dealership to estimate the net profit from each sale:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 2579.5, 499.8, -1675. , 3600. ])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"final - prices"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cool! As we can see, we made substantially more on some of those sales than others. In fact, from this, we can see that we need to have a discussion with whoever negotiated the third sale since we ended up *losing* $1,675 on that sale!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other Shapes"
]
},
{
"cell_type": "markdown",
"id": "da70de46",
"metadata": {},
"source": [
"We've now seen how we can do math with numpy vectors + scalars, numpy vectors + other vectors of length 1, and numpy vectors + other vectors of the same length. But if you try an operation with two vectors of different lengths, and one *isn't* of size one, you get an error that, for the moment, will feel a little cryptic but which we'll dive into in detail soon:\n",
"\n",
"```python\n",
"vect1 = np.array([1, 2, 3])\n",
"vect2 = np.array([1, 2, 3, 4, 5, 6])\n",
"vect1 + vect2\n",
"\n",
"---------------------------------------------------------------------------\n",
"ValueError Traceback (most recent call last)\n",
"/var/folders/tj/s8f2_ks15h315z5thvtnhz8r0000gp/T/ipykernel_30350/1706447136.py in \n",
" 1 vect1 = np.array([1, 2, 3])\n",
" 2 vect2 = np.array([1, 2, 3, 4, 5, 6])\n",
"----> 3 vect1 + vect2\n",
"\n",
"ValueError: operands could not be broadcast together with shapes (3,) (6,) \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Vectorized Code / Vectorization\n",
"\n",
"It's worth quickly noting that the way that numpy broadcasts mathematical operations across vectors results in a style of programming that is relatively unique to data science: vectorization/writing vectorized code.\n",
"\n",
"If you ask the average programmer to take two vectors `1, 2, 3, 4, 5` and `6, 7, 8, 9, 10` and add the first number of the first vector to the first number of the second vector, then add the second number of the first vector to the second number of the second factor, etc., you would probably get a loop that looks something like this:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Either this or you'd get lists\n",
"vector1 = np.array([1, 2, 3, 4, 5])\n",
"vector2 = np.array([6, 7, 8, 9, 10])\n",
"\n",
"results = list()\n",
"for i in range(len(vector1)):\n",
" # Note you can pull items from a vector like \n",
" # items from a list with `[]`. We'll talk more\n",
" # about that in an upcoming reading.\n",
" summation = vector1[i] + vector2[i]\n",
" results.append(summation)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And this code isn't *wrong*. But hopefully, it's easy to see how much more verbose this is than `vector1 + vector2`. And because we do this type of operation *all the time* in data science, this ability to avoid writing explicit loops over the entries in a vector is a *really* important feature of numpy. It makes code much, much easier to read and understand. In fact, as we'll discuss at length in a later reading, it is also a style of programming that allows numpy to run much more quickly than it would if we wrote for loops all the time.\n",
"\n",
"So if you're reading all this and saying \"But I just want to write a loop!\", please bear with us and try to embrace this style of programming when working with numpy arrays -- we promise, there's a reason it's how nearly all data science code is written!"
]
},
{
"cell_type": "markdown",
"id": "ba046896",
"metadata": {},
"source": [
"## Summarizing Vectors \n",
"\n",
"In the previous examples, we did mathematical operations that acted on the individual entries in a vector, but another type of mathematical operation we sometimes do with vectors is to calculate a property of the entries *as a group*. For example, if we had a vector where each element was a person's height, we might want to know the height of the tallest person in the vector, the height of the shortest person in the vector, the median or mean height of the group, or even the standard deviation of heights. \n",
"\n",
"For that, numpy provides a *huge* range of numeric functions:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"171"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Toy height vector\n",
"# (Obviously you could easily find the tallest, shortest, etc.\n",
"# in this data set without numpy -- this is just an example!)\n",
"heights_in_cm = np.array([155, 171, 162, 170, 171])\n",
"\n",
"# Tallest\n",
"np.max(heights_in_cm)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"155"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Shortest\n",
"np.min(heights_in_cm)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "e56b858a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"170.0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Median\n",
"np.median(heights_in_cm)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "7da84ce3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"6.368673331236263"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Standard deviation\n",
"np.std(heights_in_cm)"
]
},
{
"cell_type": "markdown",
"id": "b1bc2081",
"metadata": {},
"source": [
"Here's a short (very incomplete!) list of these kinds of functions:"
]
},
{
"cell_type": "markdown",
"id": "be1a8ee8",
"metadata": {},
"source": [
"```python\n",
"len(numbers) # number of elements \n",
"np.max(numbers) # maximum value\n",
"np.min(numbers) # minimum value\n",
"np.sum(numbers) # sum of all entries\n",
"np.mean(numbers) # mean\n",
"np.median(numbers) # median\n",
"np.var(numbers) # variance\n",
"np.std(numbers) # standard deviation\n",
"np.quantile(numbers, q) # the qth quintile of numbers\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "754fcb29",
"metadata": {},
"source": [
"**Don't** worry about memorizing these or anythingâ€”basically, you just need to have a sense of the kinds of things you can do with functions, and if you ever need one but can't remember the name of the function, you can google \"numpy [what you want to do]\" to get the specific function name, or check out the [numpy documentation](https://numpy.org/doc/stable/reference/routines.math.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And of course, these different types of manipulation can also be combined! For example, suppose we wanted to know the number of sales that generated more than $30,000 in revenue. First, we could do the same manipulation we did up top:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ True, False, True, False])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"large_sales = final > 30_000\n",
"large_sales"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we can sum up that vector (remember: `True` in Python is treated like `1` and `False` is treated like `0` when passed to functions like `np.sum()` and `np.mean()`):"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.sum(large_sales)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises\n",
"\n",
"1. Suppose the following were the heart rates reported by your Fitbit over the day: `68, 65, 77, 110, 160, 161, 162, 161, 160, 161, 162, 163, 164, 163, 162, 100, 90, 97, 72, 60, 70`. Put these numbers into a numpy array.\n",
"2. A commonly used measure of health is a person's *resting heart rate* (basically, how low your heart rate goes when you aren't doing anything). Find the minimum heart rate you experienced over the day.\n",
"3. One measure of exercise intensity is your maximum heart rateâ€”suppose that during the day these data were collected, you are deliberately exercising. Find your maximum heart rate.\n",
"4. Let's try to calculate the share of readings that were taken when you were exercising. First, create a new vector that takes on the value of `True` when your heart rate is above 120, and `False` when your heart rate is below 120.\n",
"5. Now use a summarizing function to calculate the share of observations for which your heart rate was above 120!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.10.6 ('base')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "718fed28bf9f8c7851519acf2fb923cd655120b36de3b67253eeb0428bd33d2d"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}