In Exercise 4, the Cities: Temperatures and Density

question had very different running times, depending how you approached the haversine calculation.

Why? You were doing the same basic computation either way.

The answer will lead nicely into problems we'll see again the the Big Data topic.

Here's a reduced version of the problem: we'll create this DataFrame:

n = 100000000 df = pd.DataFrame({ 'a': np.random.randn(n), 'b': np.random.randn(n), 'c': np.random.randn(n), })

… and (for some reason) we want to calculate

\[ \sin(a-1) + 1\,. \]The underlying problem: DataFrames (in Pandas here, but also Spark) are an abstraction of what's really going on. Underneath, there's some memory being moved around and computation happening.

The abstraction is leaky, as they all are.

Having a sense of what's happening behind-the-scenes will help us use the tools effectively.

One fact to notice: each Pandas `Series`

is stored as a NumPy array.

i.e. this is an array that *is already in memory*, so refering to it is basically free.

df['col'].values

This isn't in memory (in this form) and must be constructed:

df.iloc[0] # a row object

So, any time we operate on a **Pandas series** as a unit, it's probably going to be fast.

Pandas is column-oriented: it stores columns in contiguous memory.

The solution I was hoping for:

def do_work_numpy(a): return np.sin(a - 1) + 1 result = do_work_numpy(df['a'])

The arithmetic is done as single operations on NumPy arrays.

def do_work_numpy(a): return np.sin(a - 1) + 1 result = do_work_numpy(df['a'])

The `np.sin`

and the `+`

/`-`

operations are done by NumPy at C speeds (with SSE instructions).

Running time: 1.30 s.

This wasn't a option on E4 because you needed *two* columns, but for this problem:

def do_work(a): return math.sin(a - 1) + 1 result = df['a'].apply(do_work)

The `do_work`

function gets called \(n\) times (once for each element in the series). Arithmetic done in Python.

Running time: 20.6 s.

I saw something like this a few times:

def do_work(a): return math.sin(a - 1) + 1 do_work_vector = np.vectorize(do_work, otypes=[np.float]) result = do_work_vector(df['a'])

The `do_work`

function *still* gets called \(n\) times, but it's hidden by `vectorize`

, which makes it *look* like a NumPy function. Arithmetic still done in Python.

Running time: 17.1 s.

If you used `.apply()`

twice in E4, it was like:

def do_work_row(row): return math.sin(row['a'] - 1) + 1 result = df.apply(do_work_row, axis=1)

This is a by-row application: `do_work_row`

is called on *every row* in the DataFrame. But the rows don't exist in memory, so they must be constructed. Then the function called, and arithmetic done in Python.

Running time: 736 s.

This is what the no loops

restriction prevented:

def do_work_python(a): result = np.empty(a.shape) for i in range(a.size): result[i] = math.sin(a[i] - 1) + 1 return result result = do_work_python(df['a'])

The loop is done in Python; the arithmetic is done in Python.

Running time: 1230 s.

Let's look again at the best-so-far version:

def do_work_numpy(a): return np.sin(a - 1) + 1 result = do_work_numpy(df['a'])

NumPy has to calculate and store each intermediate result, which creates overhead. This is a limitation of Python & the NumPy API: NumPy calculates `a-1`

, then calls `np.sin`

on the result, then adds to that result.

The NumExpr package overcomes this: has its own expression syntax that gets compiled internally. Then you can apply that expression (to the local variables).

import numexpr def do_work_numexpr(a): expr = 'sin(a - 1) + 1' return numexpr.evaluate(expr, local_dict=locals()) result = do_work_numexpr(df['a'])

This way, the whole expression can be calculated (on each element, in some C code somewhere, using the SSE instructions), and the result stored in a new array.

Running time: 0.193 s.

NumExpr also powers the Pandas `eval`

function: `pd.eval('sin(a - 1) + 1', engine='numexpr')`

.

Method | Time | Relative Time |
---|---|---|

NumPy expression | 1.30 s | 1.00 |

Series.apply | 20.6 s | 15.85 |

Vectorized | 17.1 s | 13.10 |

DataFrame.apply | 736 s | 565.02 |

Python loop | 1230 s | 944.32 |

NumExpr | 0.193 s | 0.15 |

But are any of these fast

?

The same operation in C code took 0.994 s (`gcc -O3`

). Faster than NumPy, but **several times slower** than NumExpr.

It's not obvious, but NumExpr does the calculations in parallel by default. This was a six-core processor and it got a 6.74× speedup over plain NumPy.

Lessons:

- The abstractions you're using need to be in the back of your head somewhere.
- Moving data around in memory is expensive.
- Python is still slow, but NumPy (and friends) do a good job insulating us from that.

Don't believe me? Notebook with the code.

Don't believe me even more? A Beginner’s Guide to Optimizing Pandas Code for Speed. [I did it first, I swear!]

Extra bonus late addition to these slides: a notebook that times Pandas vs Dask on haversine calculations.

For a large enough data set, Dask *should* speed things up. (8.8M rows in this test.)

Times on my office desktop (4 core/8 thread):

Method | Time | Relative Time |
---|---|---|

Pandas | 1.24 s | 1.00 |

NumExpr | 2.03 s | 1.64 |

Dask, single file | 1.43 s | 1.15 |

Dask, partitioned | 0.76 s | 0.61 |

Times in CSIL, giving Dask 10 workstations as workers:

Method | Time | Relative Time |
---|---|---|

Pandas | 0.85 s | 1.00 |

NumExpr | 1.30 s | 1.53 |

Dask, single file | 0.60 s | 0.71 |

Dask, partitioned | 0.53 s | 0.62 |

The visualization of the task graph that Dask generates (for the partitioned input case) gives a lot of info about what it's going to do.