# Views and Copies¶

Before we dive into matrices, there’s one nuance to how vectors (and indeed all numpy arrays) work that we need to cover: how numpy manages memory allocation when we take subsets of arrays.

Because this reading relates to the nuances of how numpy is actually managing how it writes 1s and 0s in memory, it may seem a little intimidating at first—don’t worry! This topic is definitely a little mindbending, and many learners may need to read this more than once to develop a good understanding of what’s going on, and most learners won’t develop an intuitive sense of what’s going on until you been working with numpy for a while. But this isn’t the start of a big transition of the course into abstract computer science theory—we will get back to working with data quickly. This is just one kinda esoteric topic that numpy users have to wrestle with that we can’t avoid introducing around here.

## Memory Allocation & Subsetting¶

In our previous reading, we talked about how we could not just look at subsets of vectors, but also store those subsets in a new variable. For example, we could pull the middle entry of one vector and assign it to a new vector:

[43]:

import numpy as np
a = np.array([42, 47, -1])
a

[43]:

array([42, 47, -1])

[44]:

new = a[1]
new

[44]:

47


At the time, we illustrated what was happening with this set of pictures:

And that was close to the truth about what was going on, but it wasn’t quite the full truth.

The reality is that when we create a subset in numpy and assign it to a new variable, what is actually happening is not that the variable is being assigned a copy of the values being the subset, but rather the variable is being assigned a reference to the subset, something that looks more like this:

When numpy creates a reference to a subset of an existing array, that reference is called a view, because it’s not a copy of the data in the original array, but an easy way to referring back to the original array – it provides a view onto a subset of the original array.

Why is this distinction important? It’s important because it means that both variables – a and new are actually both referencing the same data, and so changes make through one variable will propagate to the other.

To illustrate in more detail, let’s create two new vectors: my_vector and my_subset, where my_subset (as the name implies) is just a subset of my_vector:

[45]:

my_vector = np.array([1, 2, 3, 4])
my_vector

[45]:

array([1, 2, 3, 4])

[46]:

my_subset = my_vector[1:3]
my_subset

[46]:

array([2, 3])


Now suppose we change the first entry of my_subset to be -99:

[47]:

my_subset[0] = -99


Since the first entry in my_subset is just a reference to the second entry in my_vector, the change I made to my_subset will also propagate to my_vector:

[48]:

my_vector

[48]:

array([  1, -99,   3,   4])


And just as edits to my_subset will propagate to my_vector, so too will edits to my_vector propagate forward to my_subset:

[49]:

my_vector[2] = 42
my_subset

[49]:

array([-99,  42])


### Language and Symmetry¶

It’s worth pausing for a moment to point out a bit of a problem with the language of views and copies. It is common, in numpy circles, to look at the example above and talk about my_vector being the original data, and my_subset as a view. And it is true that, because my_vector came first, there is a difference between my_vector and my_subset in terms of how numpy is creating and managing these objects.

But from your perspective as a user, it is important to recognize that there is a symmetric dependency between my_vector and my_subset in the example above. Yes, one may be “the original,” but once a view has been created, changes to either array have the potential to propagate to the other: changes to the my_subset may resultant changes to my_vector, and changes to my_vector can impact the my_subset (if they impact the portion of the array referenced by the subset).

So when you think about views, always remember that what we’re talking about is multiple objects sharing the same data, even if we tend to only talk about one of our arrays as “a view.”

### Why? Why Would numpy Do This?!¶

It is not uncommon, when they are first introduced to this behavior, for students to feel a little betrayed by numpy. “Why,” they ask, “why would numpy do something that makes it so much harder to keep track of the consequences of changes I make to my data?”

The short answer, as with most things in numpy, is that it’s all about speed. Creating a new copy of the data contained in the subset of a vector takes time (it literally requires your computer to write lots of 1s and 0s to memory), and so creating views instead of copies makes numpy faster.

How much faster? The short answer is: a lot faster. The longer answer? Well, let’s talk a little more about how views and copies work, then we can do an experiment to measure the speed difference below.

## When do I get a view, and when do I get a copy?¶

Because numpy will usually create views when you subset a vector, and changes to views will propagate to the vectors associated with other variables, it’s really important to keep track of when the object you’re working with is a copy.

Which brings us to the next slightly frustrating thing about numpy: the way that you ask for a subset will determine whether you get a view or a copy.

### Views and Copies from Subsetting¶

Generally speaking, numpy will give you a view if you use simple indexing to get a subset of an array, but it will provide a copy if you use any other methods. Recall that simple indexing is when you pass a single index, or a range of indices separated by a :. So my_vector[2] is simple indexing, and so is my_vector[2:4].

So, for example, this simple indexing returns a view:

[55]:

my_array = np.array([1, 2, 3])
my_subset = my_array[1:3]
my_subset

[55]:

array([2, 3])

[56]:

my_subset[0] = -1
my_array

[56]:

array([ 1, -1,  3])


But if you ask for a subset any other way—such as with “fancy indexing” (where you pass a list when making your slice) or Boolean subsetting—you will NOT get a view, you will get a copy. As a result, changes made to your subset will not propagate back to my_array:

[57]:

my_array = np.array([1, 2, 3])
my_subset = my_array[[1,2]]
my_subset[0] = -1
my_array

[57]:

array([1, 2, 3])

[58]:

my_array = np.array([1, 2, 3])
my_slice = my_array[my_array >= 2]
my_slice[0] = -1
my_array

[58]:

array([1, 2, 3])


### Views and Copies When Editing¶

We established above that numpy will only return a view when you subset with simple indexing, but not when you use fancy indexing or Boolean subsetting.

But it’s also important to understand what types of modifications of a view will result in changes that propagate back to the original array.

But if you modify a view with a simple indexing on the left-hand side of the assignment indicator (e.g., my_subset[0] = ... or my_subset[0:2] = ...), that change will propagate back to the original array (my_array).

But if we modify our vector and assign it to my_subset without that simple indexing on the left-hand side of the assignment operator, numpy will actually just create a new vector and assign it to the variable, not modify entries in our current vector. So in the following example, when numpy sees my_subset * 2 it just creates a new vector with values equal to double the values in my_subset, then assigns that vector to the variable my_subset—it doesn’t modify the data originally associated with my_subset (which is the same data underlying my_array):

[59]:

my_array = np.array([1, 2, 3])
my_subset = my_array[1:3]
my_subset = my_subset * 2
my_subset

[59]:

array([4, 6])

[60]:

my_array

[60]:

array([1, 2, 3])


If you want ever do want to do a full-array manipulation and preserve your view, you can just use square brackets on the left side of the assignment operator with just ::

[61]:

my_array = np.array([1, 2, 3])
my_subset = my_array[1:3]
my_subset[:] = my_subset * 2
my_subset

[61]:

array([4, 6])

[62]:

my_array

[62]:

array([1, 4, 6])


## Making a Copy¶

Of course, this type of propagating behavior is not always desirable, and so if one wishes to pull a subset of a vector (or array) that is a full copy and not a view, one can just use the .copy() method:

[63]:

my_vector = np.array([1, 2, 3, 4])
my_subset = my_vector[1:3].copy()
my_subset

[63]:

array([2, 3])

[64]:

my_subset[0] = -99
my_subset

[64]:

array([-99,   3])

[65]:

my_vector

[65]:

array([1, 2, 3, 4])


## How Much Faster Are Views?¶

As previously noted, the reason that numpy uses views is because of the speed. But how much speed are we talking about?

Let’s create a little example to find out.

Suppose you work for an electric car company and are interested in understanding whether the performance of your energy recovery system declines after the car has been running for a while. To test this, you pull data on how efficiently the energy recovery system has been operating that’s been collected every couple milliseconds over a long drive:

[ ]:

# Generate 1 million observations of
# fake efficiency data
efficiency_data = np.random.normal(100, 50, 1_000_000)


Now to see whether efficiency is changing over time, suppose that we want to measure average efficiency for the first third of our data and compare it to average efficiency for the last third of our data. We could do this with code that looks something like:

[ ]:

degredation_over_time = np.mean(efficiency_data[700_000:1_000_001]) - np.mean(efficiency_data[0:300_000])


But notice that nested in this are two subsets of our data—one that subsets the first 300,000 observations, and one that subsets the last 300,000 observations. These are precisely the type of operations for which numpy doesn’t want to spend time creating a full copy of those data, since we never actually want a new copy for future manipulations!

So let’s see how much faster those subsets are using views as opposed to copies:

[ ]:

import time
start = time.time()

# Let's do the subset 10,000 times and divide
# the overall time taken by 100
# so any small fluctuations in speed average out

# First with simple indexing to get views
for i in range(10_000):
initial_data = efficiency_data[0:300_000]
final_data = efficiency_data[700_000:1_000_001]

stop = time.time()
duration_with_views = (stop - start) / 10_000
print(f"Subsets with views took {duration_with_views * 1_000:.4f} milliseconds")

Subsets with views took 0.0005 milliseconds

[ ]:

# Fancy indexing *includes* the last endpoint
# so shifted down by 1 from simple indexing
first_subset = np.arange(0, 299_999)
second_subset = np.arange(700_000, 1_000_000)

start = time.time()

# Now do the subset using fancy indexing
# to ensure that we get copies

for i in range(10_000):
initial_data = efficiency_data[first_subset]
final_data = efficiency_data[second_subset]

stop = time.time()
duration_with_copies = (stop - start) / 10_000
print(f"Subsets with copies took {duration_with_copies * 1_000:.4f} milliseconds")

Subsets with copies took 1.5027 milliseconds

[ ]:

print(f"Subsets with copies took {duration_with_copies / duration_with_views:,.0f} times as long as with views")

Subsets with copies took 3,022 times as long as with views


So that’s why, despite being kinda a pain, numpy does this views / copies trick: because the speed up is more than 1,000x.

Now, does that mean that you should never use .copy() or fancy indexing? Let’s not get ahead of ourselves—even on my several-year-old on my Intel-based Macbook Pro, creating those subsets with fancy indexing (and thus using copies) may have been a lot slower than with simple indexing, but even with a vector with one million entries, each subset still took less than a millisecond.

Personally, I don’t think I’ve ever had occasion to worry about whether numpy is going to slowly because something I’m doing is generating copies, and honestly I’m much more worried about corrupting my data accidentally at some point because I’m working with a view instead of a copy than I am this performance penalty. So if I’m ever uncertain about whether I should use of you or a copy in a given circumstance, I will almost always just throw in a .copy().

But I do probably benefit from the fact that views are being used behind the scenes in the high performance libraries I use—like when I use a numpy library functions like np.sum(), or statistical modeling functions in machine learning libraries. And some of you students may very well end up doing high performance work (say, climate modelling, or high-frequency trading) where this type of performance difference does matter, and so that’s why it’s there!