Solving Performance Issues

So: you’ve done all the things suggested in the last page on Performance basics that you can do within the constraints of your project, and you still have a performance problem. Now what?

Do you need to optimize?

[ADD MORE!]

First, ask yourself if this is a problem you really need to solve.

xkcd_optimization

Profiling Code

If you take nothing else away from this page, please read and remember this section!

There’s no reason to tune a line of code that is only responsible for 1/100 of your running time, so before you invest in speeding up your code, figure out exactly what in your code is causing it to be slow – a process known as “profiling”.

Thankfully, because this is so important, there are lots of tools (called profilers) for measuring exactly how long your computer is spending doing each step in a block of code. Here are a couple, with some demonstrations below:

  • Profiling in R: the two packages I’ve seen used most are Rprof and lineprof.

  • Profiling in Python: if you use Jupyter Notebooks or Jupyter Labs, you can use the prun tool. If for some reason you’re not using Jupyter, here’s a guide to a few other tools.

Profiling Example

To illustrate, let’s write a function (called my_analysis) which we can pretend is a big analysis that’s causing me problems. Within this analysis we’ll place several functions, most of which are fast, but one of which is slow. To make it really easy to see what is fast and what is slow, these functions will just call the time.sleep() function, which literally just tells the computer to pause for a given number of seconds (i.e. time.sleep(10) makes execution pause for 10 seconds).

[2]:
import time

def a_slow_function():
    time.sleep(5)
    return 1

def a_medium_function():
    time.sleep(1)
    return 1

def a_fast_function():
    return 1

def my_analysis():
    x = 0
    x = x + a_slow_function()
    x = x + a_medium_function()
    x = x + a_fast_function()
    print(f'the result of my analysis is: {x}')

my_analysis()
the result of my analysis is: 3

Now we can profile this code with the IPython magic %prun:

[3]:
%prun my_analysis()
the result of my analysis is: 3

      44 function calls in 6.009 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2    6.009    3.004    6.009    3.004 {built-in method time.sleep}
     3    0.000    0.000    0.000    0.000 socket.py:337(send)
     1    0.000    0.000    6.009    6.009 {built-in method builtins.exec}
     1    0.000    0.000    6.009    6.009 <ipython-input-2-2718bcdb1d57>:14(my_analysis)
     3    0.000    0.000    0.000    0.000 iostream.py:197(schedule)
     2    0.000    0.000    0.000    0.000 iostream.py:384(write)
     1    0.000    0.000    0.000    0.000 {built-in method builtins.print}
     3    0.000    0.000    0.000    0.000 threading.py:1080(is_alive)
     1    0.000    0.000    1.004    1.004 <ipython-input-2-2718bcdb1d57>:7(a_medium_function)
     3    0.000    0.000    0.000    0.000 threading.py:1038(_wait_for_tstate_lock)
     2    0.000    0.000    0.000    0.000 iostream.py:309(_is_master_process)
     1    0.000    0.000    5.005    5.005 <ipython-input-2-2718bcdb1d57>:3(a_slow_function)
     3    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.lock' objects}
     2    0.000    0.000    0.000    0.000 iostream.py:322(_schedule_flush)
     3    0.000    0.000    0.000    0.000 iostream.py:93(_event_pipe)
     2    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
     3    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
     1    0.000    0.000    6.009    6.009 <string>:1(<module>)
     3    0.000    0.000    0.000    0.000 threading.py:507(is_set)
     2    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1    0.000    0.000    0.000    0.000 <ipython-input-2-2718bcdb1d57>:11(a_fast_function)

The output shows a number of things, but the most important are tottime and cumtime.

From tottime we can see that 6 seconds was dedicated to running time.sleep().

From cumtime, you can also see in which functions time.sleep() took the most time. As you can see, cumtime is not equal to the total time the function took to run – rather, it’s all the time spent within each function. time.sleep() has a cumtime of 6.009 because a total of 6 seconds was spend while that function ran, but it is also the case that a_slow_function (listed as <ipython-input-2-2718bcdb1d57>:3(a_slow_function)) has a cumtime of 5 seconds (because that function was in the process of executing when time.sleep() paused for 5 seconds).

From this, we can deduce that time.sleep() was slowing down our code, and that the occurance of time.sleep() that slowed down our code the most was in a_slow_function.

Speeding Code with Cython

There are two libraries designed to allow you to massively speed up Python code. The first is called Cython, and it is a way of writing code that is basically Python with type declarations. For example, if you wanted to add up all the numbers to a million in Python, you could write something like the following (obviously not the most concise way to do it, but you get the idea):

[10]:
def avg_numbers_up_to(N):
    adding_total = 0
    for i in range(N):
        adding_total = adding_total + i

    avg = adding_total / N

    return avg

But in Cython, you would write:

def avg_numbers_up_to(int N):
    cdef int adding_total

    adding_total = 0

    for i in range(N):
        adding_total = adding_total + i

    cdef float avg
    avg = adding_total / N

    return avg

Then to integrate this into your Python code, you would save this function definition into a new file (with the suffix .pyx (say, avg_numbers.pyx), and put this code at the top of your Python script:

from distutils.core import setup
from Cython.Build import cythonize

setup(ext_modules=cythonize('avg_numbers.pyx'))

Then you can call your Cythonized function (avg_number_up_to) in your normal Python script, but you’ll now find it runs ~10x - 100x faster! (Note that this speedup is only likely when compared to pure python code. If you’re comparing Cython to a library function that was already written in C, youre Cythonized Python is unlikely be any faster (and may be slower) than that library function.

Also, note that in Cythonized code, loops are just as fast as vectorized code!

Won’t help if bottleneck is a single library function.

Cython Advantages

[ADD MORE!]

  • Can make C libraries directly accessable from Python

  • Robust

Cython limitations

There are a few limitations to be aware of, however:

  • Cython only really works with (a) native Python and (b) NumPy (numpy instructions here). Some other Python libraries are / can be supported, but it’s not nearly as straightfoward as the example above.

  • The function you write will not be dynamically typed, so if you said the function would accept integers, you can only give it integers.

  • Distributing code you write with Cython can be tricky.

Speeding Code with Numba

Another tool you can use is numba. Numba is a program that, when it works, is super easy and kinda magic, but can also be rather finicky.

The idea of numba is that it treats each function like it’s own little program, and tries to compile it to make it faster.

It can operate in two modes. In the first (“python mode”), it achieves it’s speed-up by saving the machine code that was used the first time a function is run. The speed benefits of this aren’t huge – Python still has to do all the work of doing type checking and de-referencing, but it only has to actually convert what it’s doing to machine code once, so if you plan on using a function over and over, it can be beneficial.

The second mode (“nopython mode”) is blazing fast. In nopython mode, numba analyzes your function, and then makes inferences about the types of variables that it will encounter. For example, if you gave the following code to Python:

def my_big_loop(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

And then you ran that code with N=1000000, then numba would look at the function and think to itself “ok, so N is an integer. And accumulator starts as an integer. And if I add integers, they will always stay integers. So accumulator + i will always be integer addition. So I don’t have to think about types at every stop of this loop!

The only catch is that numba can’t always do nopython mode. For example, numba isn’t compatible with pandas, so if you put pandas code in a function you pass to numba, it can’t work in nopython mode.

But when it does work, it’s magic, because instead of making a new file that has to be compiled seperately and which won’t “just work” on other computers, to make numba work you just add a “decorator” to the function you want to speed up. For example:

[17]:
# Without numba
def my_big_loop(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator
[18]:
%timeit my_big_loop(1000000)
59.4 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[19]:
# With numba
from numba import jit

# We just add this "decorator" (A line that starts with @ just above a function)
# The "nopython=True" option says to jit "tell me if you can't work in nopython more,
# dont' just silently revert to Python mode."
@jit(nopython=True)
def my_big_loop(N):
    accumulator = 0
    for i in range(N):
        accumulator = accumulator + i
    return accumulator

%timeit my_big_loop(1000000)
174 ns ± 5.36 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Not bad, huh? Add one line of code, and this speeds up by 600x!

But as I said, it doesn’t work with everything. Here’s an intro to numba, and here’s a full list of things it can handle in nopython mode, and things it can’t..

Also, as with Cython, note that in nopython Numba functions, loops are just as fast as vectorized code!

Type Stability

[ADD MORE]

Use Julia

I would be remiss at this point to not mention one other option for getting more speed: use the programming language Julia. Julia is a very new language that has syntax that is very similar to Python, but which runs tens or hundreds of times faster out of the box. Basically, it’s kinda like an entire language built around the technology also used by numba, but where numba is kind of finiky because it’s been tacked on to a language that was never built for speed, Julia was designed from the ground up for speed.

If you want to know why I love Julia, you can find a talk I gave on it here. It’s a little old (I refer to Julia 1.0 not being out yet, but Julia’s up to 1.2 now), but the core arguments all still apply.

To be clear, I wouldn’t recommend jumping languages if you just have one function you need to speed up, but if you’re doing work that causes you to have performance issues regularly, consider Julia.

[ADD NOTE ON TYPE STABILITY ALSO APPLYING HERE]

Parallelization

If you’ve done all this and your code is still too slow, it’s time to look into parallelization, which we’re doing next!