Python doesn't seem like an obvious choice for data analysis.

Fun to write, but not noted for being fast, and somehow bad at data…

There is one obvious data structure for storing data: the Python list.

`values = [10, 20, 30, 40]`

This is represented (in CPython) as an array of pointers to Python objects.

- Not memory efficient.
- Bad locality of reference (i.e. poor memory cache performance).
- Each value is a Python objects with extra memory overhead.

So, Python is a bad language to represent collections of data?

No, only the built-in options. The solution is to not use Python lists for large array-like data.

The standard solution for array-like data in Python.

Gives a data types for fixed-type \(n\)-dimensional arrays. Represented as a C-style array internally.

Also includes many useful tools for working with those arrays efficiently.

Importing:

import numpy as np

By convention, import as `np`

(not `numpy`

) to save a few characters (since it's used a lot).

a = np.array([10, 20, 30, 40], dtype=np.int)

This allocates an array equivalent to the C:

int values[4] = {10, 20, 30, 40}

[Initializing from a Python list is usually a bad idea: requires memory allocation for both.]

NumPy arrays have a fixed type for all elements. Chosen from the NumPy types (or custom types). Some examples:

`np.byte`

`np.int32`

`np.int64`

`np.uint64`

`np.double`

`np.complex`

`np.str_`

`np.object_`

They also have a fixed shape with any number of dimensions:

print(a.shape) twodim = np.array([[1,2,3],[4,5,6]], dtype=np.float) print(twodim.ndim, twodim.dtype, twodim.shape) threedim = np.array([[[1,2], [3,4]], [[5,6], [6,7]]]) print(threedim.ndim, threedim.dtype, threedim.shape)

Output:

(4,) 2 float64 (2, 3) 3 int64 (2, 2, 2)

Creating from Python lists is okay for toy examples, but you should usually construct more efficiently.

print( np.arange(6, 14) ) print( np.linspace(0, 5, 11, dtype=np.float32) ) print( np.empty((2, 3), dtype=np.int) ) # uninitialized! print( np.zeros((2, 3)) )

[ 6 7 8 9 10 11 12 13] [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. ] [[140506921020328 140506921020328 140506806120376] [140506776227352 632708812243096 0]] [[ 0. 0. 0.] [ 0. 0. 0.]]

NumPy is happy with large arrays (up to your memory limits):

print( np.zeros((1000, 10000)) )

[[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]]

There is a collection of array creation functions for convenience:

print( np.identity(4) ) print( np.random.rand(3,4) )

[[ 1. 0. 0. 0.] [ 0. 1. 0. 0.] [ 0. 0. 1. 0.] [ 0. 0. 0. 1.]] [[ 0.50385525 0.85417757 0.02541011 0.83578425] [ 0.44444898 0.6818858 0.8581728 0.09154115] [ 0.86676983 0.39580819 0.52601014 0.83193706]]

Result: memory efficiency like C. Programming with Python.

😄NumPy includes many tricks to operate on arrays… mostly the whole array at once.

If you're writing a loop, you're probably doing it wrong.

**Course rule**: no loops. (Except where specified, or in the project if you really need them.)

some_ints = np.array([[1,2], [3,4]], dtype=np.int64) ident = np.identity(2, dtype=np.int32) print(some_ints + ident) print(some_ints * ident) # element-wise multiplication print(some_ints @ ident) # matrix multiplication, or np.dot() print(some_ints * 3) # multiplication by a scalar

[[2 2] [3 5]] [[1 0] [0 4]] [[1 2] [3 4]] [[ 3 6] [ 9 12]]

NumPy has broadcasting rules that determine how operands are combined. Roughly:

- Identically-shaped arrays: element-wise.
- Array and scalar: scalar applied to each element.
- \(n\)-dim and vector: stretch the vector.
- Higher-dimensional analogue of that.

There are many methods in arrays and functions in NumPy that operate on arrays as a unit.

print(some_ints) print(some_ints.sum()) print(some_ints.sum(axis=0)) print(some_ints.sum(axis=1)) print(np.sum(some_ints, axis=1))

[[1 2] [3 4]] 10 [4 6] [3 7] [3 7]

The `np.vectorize`

function turns an existing function into a ufunc that can operate on an entire array by implicitly iterating the function.

def double_plus_single(x): return 2*x + 1 double_plus = np.vectorize(double_plus_single, otypes=[np.float]) res = double_plus(some_ints) print(res) print(res.dtype)

[[3. 5.] [7. 9.]] float64

A ufunc that returns tuples of values creates *multiple arrays*, not an array containing tuples.

def smaller_bigger_single(x): return (x-1, x+1) smaller_bigger = np.vectorize(smaller_bigger_single, otypes=[np.int, np.int]) sm, bg = smaller_bigger(some_ints) print(sm) print(bg)

[[0 1] [2 3]] [[2 3] [4 5]]

Using `np.vectorize`

is flexible but not always fast, because it requires calling the Python function many times (which comes with some overhead).

e.g. first calculation of `res`

here takes about 6 times longer than the second (with identical results).

def double_plus_single(x): return 2*x + 1 double_plus = np.vectorize(double_plus_single, otypes=[np.float]) res = double_plus(some_ints) res = 2 * some_ints + 1

Indexing into a many-dimensional array can be intricate.

Let's experiment with indexing with NumPy (and Jupyter).

… and then with ways to iterate over arrays and how fast they are.

Warning: combinations of array dimensions as tuples, and array methods vs functions can be inconsistent.

a1a = np.zeros((3, 4)) # a1b = np.zeros(2, 3) # error: must be a tuple a2a = a1a.reshape((6, 2)) a2b = a1a.reshape(6, 2) a2c = np.reshape(a1a, (6, 2)) # a2d = np.reshape(a1a, 6, 2) # error: 2 is some other parameter a3a = np.random.rand(2, 3) #a3b = np.random.rand((2, 3)) # error: can't be a tuple a3c = np.random.random((2, 3)) # a3d = np.random.random(2, 3) # error: must be a tuple

Summary:

- NumPy is very good at storing arrays: single-type, many-dimensional, etc.
- Operating on arrays is best done with single NumPy functions/methods, which are fast.

NumPy is very good at what it does: arrays.

Single-type arrays and operations on them aren't all there is to data storage and manipulation.

Pandas is a Python library that gives richer tools for manipulating data.

- A Series is a 1D array, internally stored as a NumPy array.
- A DataFrame is a collection of columns (Series); the Series data forms rows.
- Entries in Series and DataFrames are labeled.

In some ways, a DataFrame is a lot like a SQL database table:

- DF Series ≈ SQL columns. Each column has a fixed type.
- DF row ≈ SQL row. One entry from each column.
- DF label ≈ SQL primary key.

… but the way you work with the data is usually quite different.

The Pandas module is usually imported as a short-form, like NumPy:

import numpy as np import pandas as pd

DataFrames can be created many ways, including a Python dict:

df = pd.DataFrame( { 'value': [1, 2, 3, 4], 'word': ['one', 'two', 'three', 'four'] } ) print(df)

value word 0 1 one 1 2 two 2 3 three 3 4 four

In a notebook, DataFrames are even nicer looking:

A label was added to the rows: an integer incrementing from zero.

Usually that's fine: indexing individual elements is rare anyway. We usually work on entire DataFrames/Series at once.

You can read/write entire series with a single operation:

df['double'] = df['value'] * 2 print(df) print(df[df['value'] % 2 == 0])

value word double 0 1 one 2 1 2 two 4 2 3 three 6 3 4 four 8 value word double 1 2 two 4 3 4 four 8

Note: the labels aren't an index: they are really labels that stay with the rows as they are manipulated.

Rows are identified by their label, not their order: that's usually very convenient.

french_words = pd.Series( ['trois', 'deux', 'un'], index=[2, 1, 0] ) df['french'] = french_words print(df)

value word double french 0 1 one 2 un 1 2 two 4 deux 2 3 three 6 trois 3 4 four 8 NaN

Also note: missing values are okay. Represented as not a number

.

Like with NumPy arrays, you almost always want to work with DataFrames as a unit, not row-by-row. Another example:

cities = pd.DataFrame( [[2463431, 2878.52], [1392609, 5110.21], [5928040, 5905.71]], columns=['population', 'area'], index=pd.Index(['Vancouver','Calgary','Toronto'], name='city') ) print(cities)

population area city Vancouver 2463431 2878.52 Calgary 1392609 5110.21 Toronto 5928040 5905.71

You can apply Python operators, or methods on a Series, or a NumPy ufunc, or ….

def double_plus(x): return 2*x + 1 double_plus = np.vectorize(double_plus, otypes=[np.float]) cities['density'] = cities['population'] / cities['area'] cities['pop-rank'] = cities['population'].rank(ascending=False) cities['something'] = double_plus(cities['population']) print(cities)

population area density pop-rank something city Vancouver 2463431 2878.52 855.797771 2.0 4926863.0 Calgary 1392609 5110.21 272.515024 3.0 2785219.0 Toronto 5928040 5905.71 1003.781086 1.0 11856081.0

It's occasionally useful to change the index to a data column, or a data column to the index.

del cities['something'] cities = cities.reset_index() cities = cities.set_index('pop-rank') print(cities)

city population area density pop-rank 2.0 Vancouver 2463431 2878.52 855.797771 3.0 Calgary 1392609 5110.21 272.515024 1.0 Toronto 5928040 5905.71 1003.781086