Modifying Subsets of Vectors¶
The subsetting logic from the previous reading isn’t just for extracting subsets of vectors to analyze – it’s also useful for modifying vectors. The idea is that instead of keeping elements that meet a logical condition or occur at a specific index, we can change them!
For example, let’s consider the vector with the salaries of everyone in my company. Suppose we wanted to give a raise to one of our workers – the person earning $80,000 – how would we correct that mistake without re-creating the full vector?
The answer is that we can fix it using indexing, or a logical statement.
[1]:
import numpy as np
# Create a vector with salaries of employees
salaries = np.array([105_000, 50_000, 55_000, 80_000])
salaries
[1]:
array([105000, 50000, 55000, 80000])
[2]:
salaries[salaries == 80_000] = 90_000 # using a logical statement
salaries
[2]:
array([105000, 50000, 55000, 90000])
[3]:
salaries[3] = 90_000 # using indexing
salaries
[3]:
array([105000, 50000, 55000, 90000])
Note that we can also make modifications to subsets of rows by using subsets on BOTH sides of the assignment operator.
For example, we wanted to give a raise to everyone enter company who made less than $75,000.
If we were getting a raise to everyone, we could just to:
salaries = salaries + 10_000
But then we’d be giving a raise to some of the folks who make more than 75,000 dollars. So we have to (a) pull out the salaries that are less than 75,000 dollars, (b) increment them up by 10,000 dollars, and (c) re-insert them, replacing the old salaries:
[4]:
salaries = np.array([105_000, 50_000, 55_000, 80_000])
# Get lower salaries
lower_salaries = salaries[salaries < 75_000]
lower_salaries
[4]:
array([50000, 55000])
[5]:
# Increase them all by ten thousand
new_salaries = lower_salaries + 10_000
new_salaries
[5]:
array([60000, 65000])
[6]:
# Re-insert
salaries[salaries < 75_000] = new_salaries
salaries
[6]:
array([105000, 60000, 65000, 80000])
Note that this last operation worked because the vector on the left side of the assignment operator had a length of two, and the new vector on the right-hand side was also of length two, so numpy could match the entries being subset on the left to entries on the right one-to-one.
But while we can do this is all these separate steps, we can also collapse this:
[7]:
# Re-create her original salary vector
salaries = np.array([105_000, 50_000, 55_000, 80_000])
[8]:
salaries[salaries < 75_000] = salaries[salaries < 75_000] + 10_000
salaries
[8]:
array([105000, 60000, 65000, 80000])
Again, note this only worked because we were careful to ensure that the vector on the right of the assignment operator “fit” into the space being subset on the left! This is a trick we use a lot in data science, so make sure you’re comfortable with it before proceeding.
Modifying Vectors and Data Types¶
You may not have noticed, but up till now we’ve only being doing “like-for-like” substitutions. For example, when we changed an entry in age
, we were always replacing one int
with another.
This is important, because as we discussed in our last reading, vectors are homogeneously typed, meaning that unlike lists, you can’t put different types of data in an array.
Now when we’re creating a vector, numpy will use type promotion to pick a type that accommodates everything you’re putting into an array. For example, if I pass both bools and integers to np.array()
, it will just type promote everything to be integers:
[9]:
np.array([True, False, 7])
[9]:
array([1, 0, 7])
But once a vector has been created, numpy stops being so considerate: if you try and cram data of a different type into a vector of a given type, it will try to coerce the data into the established type of the array.
For example, if we try and cram 7 into an array that’s already of type bool
, numpy will coerce 7 into type bool (e.g. run Boolean(7)
), which will turn 7
into True
even though this is causing information to be lost:
[10]:
bool_vector = np.array([True, False])
bool_vector
[10]:
array([ True, False])
[11]:
bool_vector[1] = 7
bool_vector
[11]:
array([ True, True])
Similarly, if you try and put a floating point number into an integer vector, that float will be type coerced into an integer, which is accomplished by just truncating any information after the decimal:
[12]:
int_vector = np.array([1, 2, 3])
int_vector
[12]:
array([1, 2, 3])
[13]:
int_vector[0] = 42.989723798729874
int_vector
[13]:
array([42, 2, 3])
This is why, as we mentioned in the last reading, you might not always want to let numpy pick your datatypes for you. Suppose in the example above, for example, you know you might later need to put a floating point number into int_vector
– you could instead tell numpy to make it a floating point number vector at creation:
[14]:
no_longer_an_int_vector = np.array([1, 2, 3], dtype="float")
no_longer_an_int_vector[0] = 42.989723798729874
no_longer_an_int_vector
[14]:
array([42.9897238, 2. , 3. ])
I know this can be a little confusing, so here’s a recap:
When creating a vector, numpy will do everything it can to ensure that you don’t lose any information by type promoting your data to the lowest type that preserves all the information in your data.
Once a vector has been created, numpy’s hands are tied, so it will use type coercion to force the data you’re trying to put into your existing vector into the established type, even if that causes information loss.
Exercises¶
Create the following vector of salaries:
50_000, 105_250, 55_000, 89_000
. What is the total payroll (sum of all salaries for the company)?Now suppose our evil CEO has decided to give herself a raise. Take your salary vector and modify it so that the CEO – the person making 105,000 dollars – gets a raise of 15%.
115% of 105,250 dollars is 121,037.50 dollars. Is that the value in your array? If not, can you tell why not?
Recreate your vector, do something with the
dtype
argument so that when you give the CEO a raise of 15%, she ends up with a salary of 121,037.50 dollars.Now suppose this has so annoyed the lowest paid employee (the woman earning 50,000 dollars) that she demands a raise as well. Increase her salary by 20%.
This has so irritated the other two employees you must now give them 10% raises. Increase their salaries by 10%.
Now calculate the total payroll for the company. In the end, what did the CEO’s ~16,000 raise end up costing the company?