Numpy Exercises

Note: Most students taking this class are Duke MIDS students who have worked with numpy previously. As a result, these exercises are very light on basic pandas Series and DataFrame manipulations. If you are new to numpy, I would advise looking into some addition practice opportunities with numpy, as discussed in the Advice for Non-MIDS Students page.

Exercise 1

First, lets make a common array to work with.

[1]:
import numpy as np
np.random.seed(21) # This guarantees the code will generate the same set of random numbers whenever executed
random_integers = np.random.randint(1,high=500000, size=(20, 5))
random_integers
[1]:
array([[ 80842, 333008, 202553, 140037,  81969],
       [ 63857,  42105, 261540, 481981, 176739],
       [489984, 326386, 110795, 394863,  25024],
       [ 38317,  49982, 408830, 485118,  16119],
       [407675, 231729, 265455, 109413, 103399],
       [174677, 343356, 301717, 224120, 401101],
       [140473, 254634, 112262,  25063, 108262],
       [375059, 406983, 208947, 115641, 296685],
       [444899, 129585, 171318, 313094, 425041],
       [188411, 335140, 141681,  59641, 211420],
       [287650,   8973, 477425, 382803, 465168],
       [  3975,  32213, 160603, 275485, 388234],
       [246225,  56174, 244097,   9350, 496966],
       [225516, 273338,  73335, 283013, 212813],
       [ 38175, 282399, 318413, 337639, 379802],
       [198049, 101115, 419547, 260219, 325793],
       [148593, 425024, 348570, 117968, 107007],
       [ 52547, 180346, 178760, 305186, 262153],
       [ 11835, 449971, 494184, 472031, 353049],
       [476442,  35455, 191553, 384154,  29917]])

Exercise 2

What is the average value of the second column (to two decimal places)

Exercise 3

What is the average value of the first 5 rows of the third and fourth columns?

Exercise 4

Close Python. On a piece of paper, write down the final result of the following code:

Execise 5

Keep Python Closed! Write down the final result of the following code:

Exercise 6

Now open python and check your answers to Exercises 4 and 5.

Working with Views

One of the nuances of numpy can can easily lead to problems is that when one takes a slice of an array, one does not actually get a new array; rather, one is given a “view” on the original array, meaning they are sharing the same underlying data.

This is similar to the idea that variables are just pointers, and that different variables may point to the same object (discussed in the Python v. R / Variables as Pointers tutorial.) But it is slightly different in that if two variables both point to the same set, where the two variables behave the same way. But one variable points to an array, and a second variable is a slice of that array, they are both accessing the same data in the same array, but they present it differently. For example:

[10]:
import numpy as np
my_array = np.array([1, 2, 3, 4])
my_array
[10]:
array([1, 2, 3, 4])
[11]:
my_slice = my_array[1:3]
my_slice
[11]:
array([2, 3])

Since my_array and my_slice are both pointing to the same array, changes to one will propagate to the other. For example, if I modify the 2 entry in my_slice, it will appear in my_array:

[12]:
my_slice[0] = -1
my_slice
[12]:
array([-1,  3])
[13]:
my_array
[13]:
array([ 1, -1,  3,  4])

But while my_array and my_slice are accessing the same underlying data, they are indexed differently. We changed the first item (index 0 in my_slice, but that change impacted the entry in the second position of my_array (index 1):

[14]:
my_array[1]
[14]:
-1

It is also worth emphasizing at this point that while slices will get you a view of an array, if you slice a Python list, you get a new object. This behavior is entirely limited to numpy.

[15]:
x = [1, 2, 3]
y = x[0:2]
y[0] = "a change"
y
[15]:
['a change', 2]
[16]:
x
[16]:
[1, 2, 3]

When do you get a view, and when do you get a copy?

OK, now the really annoying thing: when do I get a view, and when do I get a copy?

Generally speaking:

  • you get a view if you do a plain, basic slice of an array, and
  • the view remains a view if you edit it by modifying it using basic indexing (i.e. you use ``[]`` on the left side of the assignment operator).

Outside of those two behaviors, you will usually get a copy.

So, for example, this slice will get you a view:

[17]:
my_array = np.array([1, 2, 3])
my_slice = my_array[1:3]
my_slice[0] = -1
my_array
[17]:
array([ 1, -1,  3])

But if you use “fancy indexing” (where you pass a list when making your slice), you will NOT get a view:

[18]:
my_array = np.array([1, 2, 3])
my_slice = my_array[[1,2]]
my_slice[0] = -1
my_array
[18]:
array([1, 2, 3])

Similarly, if you edit using basic indexing (like we did above), those edits will propogate from the slice back to the originally array (or the other way around). But if you modify a slice without using basic indexing, you get a copy, so changes won’t propagate:

[19]:
my_array = np.array([1, 2, 3])
my_slice = my_array[1:3]
my_slice = my_slice * 2
my_slice
[19]:
array([4, 6])
[20]:
my_array
[20]:
array([1, 2, 3])

(If you want to do a full-array manipulation and preserve your view, always use square brackets on the left side of the assignment operator (=):

[21]:
my_array = np.array([1, 2, 3])
my_slice = my_array[1:3]
my_slice[:] = my_slice * 2
my_slice
[21]:
array([4, 6])
[22]:
my_array
[22]:
array([1, 4, 6])

How to Manage Views In Your Work

Views exist because they are more memory efficient (a view doesn’t require making a new copy of data) and faster (again, no copying required). And if you’re doing super-computer simulations where every milisecond counts, or working with truely huge datasets, this is important. But for most data scientists, I tend to see it as a a trap waiting to get you in trouble.

This is especially since there’s no reliable way to check if two arrays are views of one another except by modifying one and seeing if the other changes. (You may find people saying otherwise; don’t trust them!). The way we use is in regular Python to see if two variables point at the same object doesn’t work for numpy arrays. Thus its on you to remember the rules.

My advice on copies: UNLESS YOU REALLY NEED A VIEW AND ARE BEING SUPER CAREFUL: don’t use views for anything but looking at data. If you ever want to modify or work with a sub-array, just make a copy to be safe. Computers are fast enough and ram is plentiful enough that for most applications, it’s almost never a problem.

Exercise 7

Close your computer / laptop. Let’s try and work out a few problems in our heads to test our understanding of numpy views. Let’s start with the following array:

Now, on a piece of paper write down the value of my_slice = my_array[:, 1:3].

Exercise 8

Now suppose we run the code my_array[:, :] = my_array * 2. Now what does my_slice look like?

Exercise 9

Now suppose we run my_array = my_array * 2. What does my_slice look like?

Exercise 10

Stop, open Python, and try running these examples. Were your predictions correct? If not, why not?

Exercise 11

OK, let’s close Python again and go back to pen and paper. Let’s also reset my_array and start over with the following code:

[27]:
my_array = np.array([[1, 2, 3], [4, 5, 6]])
print(my_array)
[[1 2 3]
 [4 5 6]]
[28]:
my_slice = my_array[:, 1:3].copy()
print(my_slice)
[[2 3]
 [5 6]]

Now suppose we run the following code: my_array[:, :] = my_array * 2. What does my_slice look like?

Note: Don’t trust my_array.base

You will find some tutorials online that suggest you can test if one array is a view of another with the code my_slice.base is my_array. The problem is… this doesn’t always work. It does sometimes:

[29]:
my_array = np.array([1, 2, 3])
my_slice = my_array[1:3]
my_slice.base is my_array
[29]:
True

But not always. Here’s an example where my_array and my_slice point to the same data, but my_slice.base is my_array returns false.

[30]:
my_array = np.array([1, 2, 3])
my_array = my_array[1:4]
my_slice = my_array[1:3]
my_slice.base is my_array
[30]:
False
[31]:
my_slice
[31]:
array([3])
[32]:
my_array
[32]:
array([2, 3])
[33]:
# But a change to `my_slice` still impacts `my_array`.
my_slice[0] = -1
my_array
[33]:
array([ 2, -1])

(The reason is that the .base property can be defined recursively. In this case, the slicing of my_array made my_array a view on data you can no longer access, so they actually do both point to the same data, but that data is not my_array, it’s my_array.base. So:

[34]:
my_slice.base is my_array.base
[34]:
True

In practice, you can get infinite chains of .base.base....

And yes, if this is making your head hurt, that’s because you’re doing it right. :)