Aside: NumPy/Pandas Speed

CMPT 353

Aside: NumPy/Pandas Speed

In Exercise 4, the Cities: Temperatures and Density question had very different running times, depending how you approached the haversine calculation.

Why? You were doing the same basic computation either way.

Why So Slow?

Here's a reduced version of the problem: we'll create this DataFrame:

n = 100000000
df = pd.DataFrame({
    'a': np.random.randn(n),
    'b': np.random.randn(n),
    'c': np.random.randn(n),
})

… and (for some reason) we want to calculate

\[ \sin(a-1) + 1\,. \]

Why So Slow?

The underlying problem: DataFrames (in Pandas here, but also Spark) are an abstraction of what's really going on. Underneath, there's some memory being moved around and computation happening.

The abstraction is leaky, as they all are.

Having a sense of what's happening behind-the-scenes will help us use the tools effectively.

Why So Slow?

One fact to notice: each Pandas Series is stored as a NumPy array.

i.e. this is an array that is already in memory, so refering to it is basically free.

df['col'].values  # a column

This isn't in memory (in this form) and must be constructed:

df.iloc[0]  # a row

Why So Slow?

So, any time we operate on a Pandas series as a unit, it's probably going to be fast.

Pandas is column-oriented: it stores columns in contiguous memory.

NumPy Expression

This is (analogous to) the solution I was hoping for:

def do_work_numpy(a: np.ndarray) -> np.ndarray:
    return np.sin(a - 1) + 1

result = do_work_numpy(df['a'])

The arithmetic is done as single operations on NumPy arrays.

[Adding Python type hints to make argument/return types clear. In these, a pd.Series can be trivially converted to np.ndarray.]

NumPy Expression

def do_work_numpy(a: np.ndarray) -> np.ndarray:
    return np.sin(a - 1) + 1

result = do_work_numpy(df['a'])

The np.sin and the +/- operations are done by NumPy at C speeds (with SSE instructions).

Running time: 0.75 s.

Applying to a Series

This wasn't a option on E4 because you needed two columns, but for this problem:

def do_work(a: float) -> float:
    return math.sin(a - 1) + 1

result = df['a'].apply(do_work)

The do_work function gets called \(n\) times (once for each element in the series). Arithmetic done in Python.

Running time: 11.6 s.

Vectorizing

I saw something like this a few times:

def do_work(a: float) -> float:
    return math.sin(a - 1) + 1
do_work_vector = np.vectorize(do_work, otypes=[np.float64])

result = do_work_vector(df['a'])

The do_work function still gets called \(n\) times, but it's hidden by vectorize, which makes it look like a NumPy function. Arithmetic still done in Python.

Running time: 10.0 s.

Applying By Row

If you used .apply() twice in E4, it was like:

def do_work_row(row: pd.Series) -> float:
    return math.sin(row['a'] - 1) + 1

result = df.apply(do_work_row, axis=1)

This is a by-row application: do_work_row is called on every row in the DataFrame. But the rows don't exist in memory, so they must be constructed. Then the function called, and arithmetic done in Python.

Running time: 194 s.

Using Python

This is what the no loops restriction prevented:

def do_work_python(a: np.ndarray) -> np.ndarray:
    result = np.empty(a.shape)
    for i in range(a.size):
        result[i] = math.sin(a[i] - 1) + 1
    return result

result = do_work_python(df['a'])

The loop is done in Python; the arithmetic is done in Python.

Running time: 95 s.

With NumExpr

Let's look again at the best-so-far version:

def do_work_numpy(a: np.ndarray) -> np.ndarray:
    return np.sin(a - 1) + 1

result = do_work_numpy(df['a'])

NumPy has to calculate and store each intermediate result, which creates overhead. This is a limitation of Python & the NumPy API: NumPy calculates a-1, then calls np.sin on the result, then adds to that result.

With NumExpr

The NumExpr package overcomes this: has its own expression syntax that gets compiled internally. Then you can apply that expression (to the local variables).

import numexpr
def do_work_numexpr(a: np.ndarray) -> np.ndarray:
    expr = 'sin(a - 1) + 1'
    return numexpr.evaluate(expr, local_dict=locals())

result = do_work_numexpr(df['a'])

With NumExpr

This way, the whole expression can be calculated (on each element, in some C code somewhere, using the SSE instructions), and the result stored in a new array.

Running time: 0.10 s.

NumExpr also powers the Pandas eval function: pd.eval('sin(a - 1) + 1', engine='numexpr').

Summary

Method	Time	Relative Time
NumPy expression	0.75 s	1.00
Series.apply	11.6 s	15.34
Vectorized	10.0 s	13.18
DataFrame.apply	194.0 s	256.52
Python loop	94.8 s	125.70
NumExpr	0.1 s	0.14

No, I didn't do any stats on it. Maybe Series.apply and np.vectorize are indistibuishable, but that's not surprising: they both just call a Python function on each value. Everything else is so different, I'm confident \(p<0.05\) if I looked.

Summary

But are any of these fast?

The same operation in C code took 0.625 s (gcc -O3). A little faster than NumPy, but several times slower than NumExpr.

It's not obvious, but NumExpr does the calculations in parallel by default: it used all of my CPU cores automatically. It got a 7.3× speedup over plain NumPy.

Summary

Lessons:

The abstractions you're using need to be in the back of your head somewhere.
Moving data around in memory is expensive.
Python is still slow, but NumPy (and friends) do a good job insulating us from that.

Don't believe me? Notebook with the code.

Don't believe me even more? A Beginner’s Guide to Optimizing Pandas Code for Speed. [I did it first, I swear!]