Speed in Python#

First, the good news: as a Data Scientist working in Python, most of the time whatever libraries you want to use have already been carefully engineered to deliver amazing performance, meaning working about code speed is something you usually won’t have to worry about. But understanding why programs like Python can sometimes be slow is still important for being able to avoid accidentally writing code that runs so slowly you can’t get your work done, and for trouble shooting if you end up in a situation where you really need something to go faster.

The Python Ease of Use / Speed Tradeoff#

There are two primary factors that determine the speed of a program:

  • The number of computations computer has to complete (i.e. adding, multiplying, etc.)

  • The amount it has to access and move data around (as discussed in What Is Big Data?, memory access is slow).

This is obviously kinda abstract, so let’s look at some real code and discuss what you computer is actually doing when it runs Python code. In particular, let’s focus on what happens in a loop like this:

summation = 0
for i in range(1000):
    summation = summation + i

Now, you might think that all Python has to do to run this loop is add a bunch of numbers. But unfortunately, that’s not the case.

At each step of this loop, Python has to do the following:

  1. Dereference summation (find the place in memory that is associated with the variable summation and read the data there)

  2. Read that data to figure out the type of summation (in this case, it’s an integer).

  3. Dereference i

  4. Read that data to figure out the type of i (in this case, it’s an integer).

  5. Lookup how it should interpret + given that summation is an integer and i is an integer.

    • Remember: if summation and i were strings, it wouldn’t be doing addition, it would be concatenating! And because integers and floating point numbers are different things from the perspective of a computer, if i were an integer and summation were a floating point number, + would actually require (a) converting i to a floating point number, then (b) adding the floating point version of i to summation.

  6. Compile low-level binary code (referred to as “machine code”) that can be read by the computer’s processor to tell it to execute an addition of two integers AND to also check and make sure that the operation won’t cause an integer overflow (remember: Python checks for integer overflows! That’s another operation!).

  7. Run that code.

As you can see, there’s a lot going on in that simple line of code, and that process has to repeat every step of the loop.

Now, you might ask, is all of that necessary?

The answer is both yes and no. There are languages that don’t require all those steps (like C and C++, the standard-bearers for speed), but if Python didn’t do all those things when it ran, you wouldn’t be able to use Python the way you do.

Python is designed to be easy for you, the programmer. Unlike in some languages (like C), in Python you don’t have to declare that i is an integer explicitly in your code, and you are allowed to change i from an integer to a floating point number whenever you want. That makes Python what’s called a dynamically typed language, and it makes it very easy to use. If you wrote this in C, every time you made a new variable you’d have to declare it’s type. But the fact it’s dynamically typed means the language is never quite sure of the type of a variable until it explicitly checks that type, because it could change at any time.

Moreover, Python is a language you can use interactively. You can’t open a terminal and write and execute C code one line at a time. For languages like C (what are called “compiled languages”), you write all of your code, then hit compile, and your computer reads all your code and compiles low-level binary code for that entire program at once. Then the only way to use your C code is to run that compiled program. But the advantage of that is that step 6 (compiling your code) is done once in advance, and you don’t have to do that every time you run your code.

But to make Python interactive, it was decided that Python should evaluate each line of code when it reaches that code. That massively increases it’s flexibility and allows you to program interactively, but for loops like this, it means to do 1,000 additions, it also has to compile low-level machine code 1,000 times. (Note: there are newer languages like Julia that have are still interactive, but which are also smart enough to recognize these situations and only compile machine code once. Moreover, these languages also reduce the amount of “dereferencing” required. But that’s not Python.)

Rules for Going Fast in Python#

So now that you have some sense of why Python can be slow, let’s talk about how to keep your Python code fast (note to R users: basically everything I say here about Python also applies to R):

Rule 1: Use other people’s functions#

Let’s start with the dirty secret of Python: when you use a library in Python, the code you’re running was almost never actually written in Python. Instead, the functions you’re calling have a “thin” layer of Python at the top (which is what you’re interacting with), but when it comes to doing serious computations, the functions are actually calling tools written in C, which is blazing fast. (This is also true of R, by the way).

Thus the first rule of speed in Python: when you need speed, use tools that other people have written and distributed in big libraries (like pandas, sci-py, statsmodels, etc.). Because what you’re really doing is using C, and C is 10s or 100s of times faster than Python.

Rule 2: Vectorize, Don’t write loops#

When you use “vectorized” code (meaning code where you interact two full vectors or matrices in one command instead of looping over elements), you get huge performance gains. For example, consider the performance differences between the following two blocks of code, both of which do exactly the same thing:

import numpy as np

numbers = np.arange(1000000)
array([     0,      1,      2, ..., 999997, 999998, 999999])
# Make a new array where each number is doubled
# done with a loop
b = numbers.copy()
for i in range(len(b)):
    b[i] = numbers[i] * 2
107 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Now vectorized
b = numbers * 2
690 µs ± 9.32 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That’s a 300x speedup! For the exact same code!!!!

OK, so why does this happen?

The answer is twofold.

First, when Python dereferences a variable (i.e. get the object associated with a variable), there are actually two steps to that process: first, Python must look up where to find the object in memory; then second it must go get the object.

Because Python evaluates every line of code when it reaches that line, in the loop example, in ever step of the loop Python has to look up where to find the array, then go get the array value.

But in the vectorized function, Python (or rather, the numpy code written in C that gets called) is designed to understand that it’s about to do something to every entry of an array, so it remembers where the array is located, and so only has to look up where to find the array once.

In addition, arrays are typed, meaning that Python also knows that every entry of the array it’s modifying is an integer. As a result, it doesn’t have to check the type of every entry in the array when the operation is vectorized, it checks once and knows that it’s working with an array of integers.

(You may say “but can’t it do the same in the loop?” In theory it could, and there are languages like Julia that are designed to be smarter about this kind of stuff. But currently, because Python only evaluates a line of code when it is reached means that it can be frustratingly myopic.)

A Note On Object Vectors#

You now also know enough to understand why object vectors in numpy and pandas are less performant.

When you’re working with a pandas Series or numpy array of integers, all of the entries of those arrays live in the same place in memory. This is because it is easy for a computer to put, say, 100 integers side by side, since it knows that each integer requires exactly 64 bits and so it can put them all end to end and tell the computer that each number starts 64 bits after the last.

But an object array is a little different. Instead of laying all the contents of the array in one place, an object array puts a collection of memory addresses in one place. This is because objects can be of arbitrary size, so you can’t put them all in one place in memory reliably. For example, suppose you had an array that had the following strings: 'this', 'is', 'an', 'array'. The first entry is longer than the second, is the same as the third, and is shorter than the third. You can’t just put them end-to-end and tell the computer “each number starts X bits after the last”.

But you can do that with memory addresses. So an object array is a collection of memory addresses (each the same size) laid end to end.

But that means that to look at the second entry of an object array, you have to go to the second location in the array, read the address, then go to that address. Moreover, because object arrays can hold anything, you could have an integer in the first entry, and a string in the second entry, so you can’t assume that when you see code like:

my_array * 2

that * will mean the same thing for each entry. In one entry, you might be doing integer multiplication; in another you might be doing floating point multiplication, in another you may be doubling up a list!

Indeed we can see this is we make a numpy object array full of ints and compare it to a numpy integer array. The both have the same content, but they are organized in memory differently:

import numpy as np

object_numbers = np.array(np.arange(1000000), dtype="object")
numbers = np.arange(1000000)
%timeit object_numbers * 2
15.5 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit numbers * 2
667 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

See? Same operation (doubling each entry of arrays with the integers from 0 to 1,000,000), but the object array operation is ~25x slower.