Numbers in Computers

As a Data Scientist, you will spend a lot more time playing with numbers than most programmers. As a result, it pays to understand how numbers are represented in computers, and how those representations can get you into trouble.

This lesson is divided into two parts. In the first portion, we’ll cover the basics of how computers think about numbers, and what issues can potentially arise with the two main numerical representations you’ll use. In the second portion, we’ll discuss when you need to worry about these hazards both (a) when using vanilla Python, and (b) when using numpy and pandas.

The Two Classes of Numbers: Integers and Floating Point Numbers

Broadly speaking, computers have two ways of representing numbers: integers and floating point numbers. In most intro computer science courses, students are taught that the integers are for… well integers (whole numbers), and floating point numbers are for numbers with decimal points, and that is true to a point. But below the hood, integers and floating point numbers work in very different ways, and there are distinct hazards when working with both.

To learn the ins-and-outs of how integers and floating point numbers work, please review the following materials (these explanations are very good, and there’s no reason to try and write my own explanations when these exist). Then continue to the section below on Python-specific hazards.

Integers

To see a great discussion of integers (and their major pitfall: integer overflow), please watch this video.

If after watching you feel you would like to learn more, then Chapters 7 and 8 of Code: The Hidden Language of Computer Hardware and Software by Charles Petzold will get into integers in great detail.

Floating Point Numbers

Integers, as a data type, are wonderful. They are precise and pretty intuitive. But they also have their weaknesses: namely, they can’t represent numbers with decimal points (which we use all the time), and they can’t represent really big numbers.

So how do we deal with decimals and really big numbers? Floating point numbers!

To learn about floating point numbers, please:

Numeric Hazards in Python, Numpy, and Pandas

So in general terms, the dangers to integers and floating points are:

  • Integers can overflow, resulting in situations where adding two big numbers results in a … negative number.
  • Floating point numbers are always imprecise, resulting in situations where apparently simple math breaks (e.g. in Python 0.1 + 0.1 + 0.1 == 0.3 returns False)
  • Floating point numbers can only keep track of so many leading digits, meaning that you can’t work with BOTH very large and very small floating points at the same time (e.g. in Python, 2.32781**55 + 1 == 2.32781**55 returns True).

But when we we need to worry about these issues?

The answer is that it depends on whether you’re using regular, vanilla Python, or numpy / pandas.

Integer Overflows in Python

Python is meant to be a friendly language, and one manifestation of that is that in vanilla Python, you can’t overflow your integers! That’s because whenever Python does an integer computation, it stops to check whether you the integer in question has been allocated enough bits to store the result, and if not, it just adds more bits! So if you do math with an integer that won’t fit in 64 bits, it will just allocate more bits to the integer!

[6]:
# Here's a really big integer
x = 2**63
[7]:
# Now let's make it bigger so it can't fit in 64 bits!
x = x ** 4
x
[7]:
7237005577332262213973186563042994240829374041602535252466099000494570602496

See? No problem!

Interger Overflows in numpy and pandas

The problem with what Python does with integers is that, while convenient, it’s slow. Asking Python to add two integers doesn’t just require the computer to add two integers; it requires it to also check the size of the result, and if that size is so big it won’t fit in the existing number of bits that have been allocated, it has to allocate more bits. This makes adding integers in Python much, much slower than it could be. Like… 10x slower.

That’s why libraries like numpy and pandas – which are designed for performance when working with huge datasets – don’t check for integer overflows. This makes them much faster, but if you add two really big integers in numpy (or add even small numbers to a really big number) and the result is bigger than what fits in the available bits, you’ll just end up with a negative number.

How much faster? Here’s a comparison of adding up all the integers from 1 to 1,000,000 using regular Python integers (which check for overflows) and using numpy tools (which do not). Some of this difference is coming from things other than over-flowing checking, but this gives you a sense of the performance cost of making integers safer in regular Python:

[46]:
# Regular Python:
%timeit sum(range(1000000))
22.9 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[47]:
# Numpy
%timeit np.sum(np.arange(1000000))
1.23 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But as I said, while it may be fast, it can also be dangerous:

[3]:
import numpy as np
a = np.array([2**63-1, 2**63-1], dtype='int')
a
[3]:
array([9223372036854775807, 9223372036854775807])
[4]:
a + 1
[4]:
array([-9223372036854775808, -9223372036854775808])

It’s also important to understand that with numpy and pandas, you control the size of integers, and thus how big of an integer you can make before you have overflow problems. By default, numpy will make your integers the size your system processor works with natively (usually 64 bits on a modern computer, but sometimes 32 bits on an older computer). But numpy also let’s you make arrays that are 16 bits (int16), 32 bits (int32) or 64 bits (int64). This can be very useful when working with big datasets: smaller integers take up less memory, and sometimes calculations with smaller integers can be faster due to some intricacies of how computers use memory. But if you do use smaller integer sizes, then you really need to be careful with your overflows! int16 can only store numbers up to 32,768!

[5]:
x = np.array(32768, dtype='int16')
x + 1
[5]:
-32767

Also, note that numpy and pandas have “unsigned” integers (uint16, uint32, uint64). These are like regular integers, except they don’t allocate half their values to negative numbers, so their upper limit is 2x the same sized regular integer. In general, though, it’s good to avoid uints, as it’s too easy to underflow by hitting the bottom of the values it can tolerate (i.e. going below zero):

[6]:
x = np.array([2], dtype='uint64')
x
[6]:
array([2], dtype=uint64)
[8]:
x - 3
[8]:
array([18446744073709551615], dtype=uint64)

Floating Point Precision

Unfortunately, while vanilla Python can protect you from integer overflows, it can’t do anything about floating point precision. Whether you’re using numpy or not, you’re stuck with these types of things:

[9]:
0.1 + 0.1 + 0.1 == 0.3
[9]:
False

and

[10]:
2.32781**55 + 1 == 2.32781**55
[10]:
True

But you also get weird things like 7.5 rounding up and 10.5 rounding down:

[21]:
round(7.5)
[21]:
8
[22]:
round(10.5)
[22]:
10

So just remember: whatever you’re doing with floating point numbers, exact, knife-edge tests may do weird things.

If you want protections against this, consider using the isclose function from numpy library, which will return True if the two arguments it is passed are really close. (by really close, I mean that np.isclose(a, b) checks for whether \(absolute(a - b) <= (atol + rtol * absolute(b))\) where the relative tolerance (\(rtol\)) is \(10^{-5}\), and the absolute tolerance (\(atol\)) is \(10^{-8}\) by default. You can also change these tolerances if you want, as shown in the help file).

[64]:
np.isclose(0.1 + 0.1 + 0.1, 0.3)
[64]:
True

Exercises!

do exercises