# Numbers in Computers¶

As a Data Scientist, you will spend a *lot* more time playing with numbers than most programmers. As a result, it pays to understand how numbers are represented in computers, and how those representations can get you into trouble.

This lesson is divided into two parts. In the first portion, we’ll cover the basics of how computers think about numbers, and what issues can potentially arise with the two main numerical representations you’ll use. In the second portion, we’ll discuss when you need to worry about these hazards both (a) when using vanilla Python, and (b) when using `numpy`

and `pandas`

.

## The Two Classes of Numbers: Integers and Floating Point Numbers¶

Broadly speaking, computers have two ways of representing numbers: integers and floating point numbers. In most intro computer science courses, students are taught that the integers are for… well integers (whole numbers), and floating point numbers are for numbers with decimal points, and that is true to a point. But below the hood, integers and floating point numbers work in very different ways, and there are distinct hazards when working with both.

To learn the ins-and-outs of how integers and floating point numbers work, please review the following materials (these explanations are *very* good, and there’s no reason to try and write my own explanations when these exist). Then continue to the section below on Python-specific hazards.

### Integers¶

To see a great discussion of integers (and their major pitfall: integer overflow), please watch this video.

If after watching you feel you would like to learn more, then Chapters 7 and 8 of *Code: The Hidden Language of Computer Hardware and Software* by Charles Petzold will get into integers in great detail.

### Floating Point Numbers¶

Integers, as a data type, are wonderful. They are precise and pretty intuitive. But they also have their weaknesses: namely, they can’t represent numbers with decimal points (which we use all the time), and they can’t represent really big numbers.

So how do we deal with decimals and really big numbers? Floating point numbers!

To learn about floating point numbers, please:

## Numeric Hazards in Python, Numpy, and Pandas¶

So in general terms, the dangers to integers and floating points are:

- Integers can overflow, resulting in situations where adding two big numbers results in a … negative number.
- Floating point numbers are always imprecise, resulting in situations where apparently simple math breaks (e.g. in Python
`0.1 + 0.1 + 0.1 == 0.3`

returns`False`

) - Floating point numbers can only keep track of so many leading digits, meaning that you can’t work with BOTH very large and very small floating points at the same time (e.g. in Python,
`2.32781**55 + 1 == 2.32781**55`

returns`True`

).

But when we we need to worry about these issues?

The answer is that it depends on whether you’re using regular, vanilla Python, or `numpy`

/ `pandas`

.

### Integer Overflows *in Python*¶

Python is meant to be a friendly language, and one manifestation of that is that in vanilla Python, you can’t overflow your integers! That’s because whenever Python does an integer computation, it stops to check whether you the integer in question has been allocated enough bits to store the result, and if not, it just adds more bits! So if you do math with an integer that won’t fit in 64 bits, it will just allocate more bits to the integer!

```
[6]:
```

```
# Here's a really big integer
x = 2**63
```

```
[7]:
```

```
# Now let's make it bigger so it can't fit in 64 bits!
x = x ** 4
x
```

```
[7]:
```

```
7237005577332262213973186563042994240829374041602535252466099000494570602496
```

See? No problem!

### Interger Overflows *in numpy and pandas*¶

The problem with what Python does with integers is that, while convenient, it’s slow. Asking Python to add two integers doesn’t just require the computer to add two integers; it requires it to *also* check the size of the result, and if that size is so big it won’t fit in the existing number of bits that have been allocated, it has to allocate more bits. This makes adding integers in Python much, much slower than it could be. Like… 10x slower.

That’s why libraries like `numpy`

and `pandas`

– which are designed for performance when working with huge datasets – don’t check for integer overflows. This makes them *much* faster, but if you add two really big integers in `numpy`

(or add even small numbers to a *really* big number) and the result is bigger than what fits in the available bits, you’ll just end up with a negative number.

How much faster? Here’s a comparison of adding up all the integers from 1 to 1,000,000 using regular Python integers (which check for overflows) and using `numpy`

tools (which do not). Some of this difference is coming from things other than over-flowing checking, but this gives you a sense of the performance cost of making integers safer in regular Python:

```
[46]:
```

```
# Regular Python:
%timeit sum(range(1000000))
```

```
22.9 ms ± 464 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

```
[47]:
```

```
# Numpy
%timeit np.sum(np.arange(1000000))
```

```
1.23 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```

But as I said, while it may be fast, it can also be dangerous:

```
[3]:
```

```
import numpy as np
a = np.array([2**63-1, 2**63-1], dtype='int')
a
```

```
[3]:
```

```
array([9223372036854775807, 9223372036854775807])
```

```
[4]:
```

```
a + 1
```

```
[4]:
```

```
array([-9223372036854775808, -9223372036854775808])
```

It’s also important to understand that with `numpy`

and `pandas`

, you control the size of integers, and thus how big of an integer you can make before you have overflow problems. By default, `numpy`

will make your integers the size your system processor works with natively (usually 64 bits on a modern computer, but sometimes 32 bits on an older computer). But `numpy`

also let’s you make arrays that are 16 bits (`int16`

), 32 bits (`int32`

) or 64 bits (`int64`

). This can be very
useful when working with big datasets: smaller integers take up less memory, and sometimes calculations with smaller integers can be faster due to some intricacies of how computers use memory. But if you do use smaller integer sizes, then you really need to be careful with your overflows! `int16`

can only store numbers up to 32,768!

```
[5]:
```

```
x = np.array(32768, dtype='int16')
x + 1
```

```
[5]:
```

```
-32767
```

Also, note that `numpy`

and `pandas`

have “unsigned” integers (`uint16`

, `uint32`

, `uint64`

). These are like regular integers, except they don’t allocate half their values to negative numbers, so their upper limit is 2x the same sized regular integer. In general, though, it’s good to avoid `uints`

, as it’s too easy to *underflow* by hitting the *bottom* of the values it can tolerate (i.e. going below zero):

```
[6]:
```

```
x = np.array([2], dtype='uint64')
x
```

```
[6]:
```

```
array([2], dtype=uint64)
```

```
[8]:
```

```
x - 3
```

```
[8]:
```

```
array([18446744073709551615], dtype=uint64)
```

## Floating Point Precision¶

Unfortunately, while vanilla Python can protect you from integer overflows, it can’t do anything about floating point precision. Whether you’re using `numpy`

or not, you’re stuck with these types of things:

```
[9]:
```

```
0.1 + 0.1 + 0.1 == 0.3
```

```
[9]:
```

```
False
```

and

```
[10]:
```

```
2.32781**55 + 1 == 2.32781**55
```

```
[10]:
```

```
True
```

But you also get weird things like 7.5 rounding up and 10.5 rounding down:

```
[21]:
```

```
round(7.5)
```

```
[21]:
```

```
8
```

```
[22]:
```

```
round(10.5)
```

```
[22]:
```

```
10
```

So just remember: whatever you’re doing with floating point numbers, exact, knife-edge tests may do weird things.

**If you want protections against this**, consider using the `isclose`

function from `numpy`

library, which will return `True`

if the two arguments it is passed are *really* close. (by *really* close, I mean that `np.isclose(a, b)`

checks for whether \(absolute(a - b) <= (atol + rtol * absolute(b))\) where the relative tolerance (\(rtol\)) is \(10^{-5}\), and the absolute tolerance (\(atol\)) is \(10^{-8}\) by default. You can also change these tolerances if you want,
as shown in the help file).

```
[64]:
```

```
np.isclose(0.1 + 0.1 + 0.1, 0.3)
```

```
[64]:
```

```
True
```

## Exercises!¶

do exercises