# Doing Math With Vectors#

If vectors were just for storing data, they wouldn’t be super useful. Thankfully, they’re not!

## Math with Scalars#

One of the great things about vectors is that we can use them to do math in a very concise manner. For example, if you try and do a math mathematical operation between a vector and either a scalar number or another vector of length one, numpy will just repeat the mathematical operation with each entry in the longer vector (a behavior called “broadcasting”). This is a very valuable trick – since we often use vectors to store a collection of measurements of the same phenomenon (e.g. the salaries of employees, the dollar value of sales, temperature measurements, etc.), it is also the case that we often want to apply the same mathematical function to all of the entries in a vector.

import numpy as np

# Suppose we are working with data on car sales,
# and we have the value of all the cars we sold last year

sales = np.array([34_255, 27_222, 42_250, 12_000])
sales

array([34255, 27222, 42250, 12000])

# Now suppose that for every car we sell,
# we have to pay a sales tax of 10%.
# How could we calculate the after-tax revenue
# from each of the sales?

# Simple!
after_tax = sales * 0.90

# And suppose we also had to pay a
# flat fee of 500 dollars to process each
# sale. Now what would our final revenue be?

final = after_tax - 500
final

array([30329.5, 23999.8, 37525. , 10300. ])


### Other Shapes#

We’ve now seen how we can do math with numpy vectors + scalars, numpy vectors + other vectors of length 1, and numpy vectors + other vectors of the same length. But if you try an operation with two vectors of different lengths, and one isn’t of size one, you get an error that, for the moment, will feel a little cryptic but which we’ll dive into in detail soon:

vect1 = np.array([1, 2, 3])
vect2 = np.array([1, 2, 3, 4, 5, 6])
vect1 + vect2

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/tj/s8f2_ks15h315z5thvtnhz8r0000gp/T/ipykernel_30350/1706447136.py in <module>
1 vect1 = np.array([1, 2, 3])
2 vect2 = np.array([1, 2, 3, 4, 5, 6])
----> 3 vect1 + vect2

ValueError: operands could not be broadcast together with shapes (3,) (6,)


## Vectorized Code / Vectorization#

It’s worth quickly noting that the way that numpy broadcasts mathematical operations across vectors results in a style of programming that is relatively unique to data science: vectorization/writing vectorized code.

If you ask the average programmer to take two vectors 1, 2, 3, 4, 5 and 6, 7, 8, 9, 10 and add the first number of the first vector to the first number of the second vector, then add the second number of the first vector to the second number of the second factor, etc., you would probably get a loop that looks something like this:

# Either this or you'd get lists
vector1 = np.array([1, 2, 3, 4, 5])
vector2 = np.array([6, 7, 8, 9, 10])

results = list()
for i in range(len(vector1)):
# Note you can pull items from a vector like
# items from a list with []. We'll talk more
summation = vector1[i] + vector2[i]
results.append(summation)


And this code isn’t wrong. But hopefully, it’s easy to see how much more verbose this is than vector1 + vector2. And because we do this type of operation all the time in data science, this ability to avoid writing explicit loops over the entries in a vector is a really important feature of numpy. It makes code much, much easier to read and understand. In fact, as we’ll discuss at length in a later reading, it is also a style of programming that allows numpy to run much more quickly than it would if we wrote for loops all the time.

So if you’re reading all this and saying “But I just want to write a loop!”, please bear with us and try to embrace this style of programming when working with numpy arrays – we promise, there’s a reason it’s how nearly all data science code is written!

## Summarizing Vectors#

In the previous examples, we did mathematical operations that acted on the individual entries in a vector, but another type of mathematical operation we sometimes do with vectors is to calculate a property of the entries as a group. For example, if we had a vector where each element was a person’s height, we might want to know the height of the tallest person in the vector, the height of the shortest person in the vector, the median or mean height of the group, or even the standard deviation of heights.

For that, numpy provides a huge range of numeric functions:

# Toy height vector
# (Obviously you could easily find the tallest, shortest, etc.
# in this data set without numpy -- this is just an example!)
heights_in_cm = np.array([155, 171, 162, 170, 171])

# Tallest
np.max(heights_in_cm)

171

# Shortest
np.min(heights_in_cm)

155

# Median
np.median(heights_in_cm)

170.0

# Standard deviation
np.std(heights_in_cm)

6.368673331236263


Here’s a short (very incomplete!) list of these kinds of functions:

len(numbers) # number of elements
np.max(numbers) # maximum value
np.min(numbers) # minimum value
np.sum(numbers) # sum of all entries
np.mean(numbers) # mean
np.median(numbers) # median
np.var(numbers) # variance
np.std(numbers) # standard deviation
np.quantile(numbers, q) # the qth quintile of numbers


Don’t worry about memorizing these or anything—basically, you just need to have a sense of the kinds of things you can do with functions, and if you ever need one but can’t remember the name of the function, you can google “numpy [what you want to do]” to get the specific function name, or check out the numpy documentation.

And of course, these different types of manipulation can also be combined! For example, suppose we wanted to know the number of sales that generated more than \$30,000 in revenue. First, we could do the same manipulation we did up top:

large_sales = final > 30_000
large_sales

array([ True, False,  True, False])


Then we can sum up that vector (remember: True in Python is treated like 1 and False is treated like 0 when passed to functions like np.sum() and np.mean()):

np.sum(large_sales)

2


## Exercises#

1. Suppose the following were the heart rates reported by your Fitbit over the day: 68, 65, 77, 110, 160, 161, 162, 161, 160, 161, 162, 163, 164, 163, 162, 100, 90, 97, 72, 60, 70. Put these numbers into a numpy array.

2. A commonly used measure of health is a person’s resting heart rate (basically, how low your heart rate goes when you aren’t doing anything). Find the minimum heart rate you experienced over the day.

3. One measure of exercise intensity is your maximum heart rate—suppose that during the day these data were collected, you are deliberately exercising. Find your maximum heart rate.

4. Let’s try to calculate the share of readings that were taken when you were exercising. First, create a new vector that takes on the value of True when your heart rate is above 120, and False when your heart rate is below 120.

5. Now use a summarizing function to calculate the share of observations for which your heart rate was above 120!