Python doesn't seem like an obvious choice for data analysis.
Fun to write, but not noted for being fast, and somehow bad at data…
There is one obvious data structure for storing data: the Python list.
values = [10, 20, 30, 40]
This is represented (in CPython) as an array (?) of pointers to Python objects.
So, Python is a bad language to represent collections of data?
No, only the built-in options. The solution is to not use Python lists for large array-like data.
The standard solution for array-like data in Python.
Gives a data types for fixed-type \(n\)-dimensional arrays. Represented as a C-style array internally.
Also includes many useful tools for working with those arrays efficiently.
Importing:
import numpy as np
By convention, import as np
(not numpy
) to save a few characters (since it's used a lot).
a = np.array([10, 20, 30, 40], dtype=np.int64)
This allocates an array equivalent to the C:
long values[4] = {10, 20, 30, 40};
[Initializing from a Python list is usually a bad idea: requires memory allocation for both.]
NumPy arrays have a fixed type for all elements. Chosen from the NumPy types (or custom types). Some examples:
np.byte
np.int32
np.int64
np.uint64
np.double
np.complex
np.str_
np.object_
They also have a fixed shape with any number of dimensions:
print(a.shape) twodim = np.array([[1,2,3],[4,5,6]], dtype=np.float64) print(twodim.ndim, twodim.dtype, twodim.shape) threedim = np.array([[[1,2], [3,4]], [[5,6], [6,7]]]) print(threedim.ndim, threedim.dtype, threedim.shape)
Output:
(4,) 2 float64 (2, 3) 3 int64 (2, 2, 2)
Creating from Python lists is okay for toy examples, but you should usually construct more efficiently.
print( np.arange(6, 14) ) print( np.linspace(0, 5, 11, dtype=np.float32) ) print( np.empty((2, 3), dtype=np.int64) ) # uninitialized! print( np.zeros((2, 3)) )
[ 6 7 8 9 10 11 12 13] [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. ] [[140506921020328 140506921020328 140506806120376] [140506776227352 632708812243096 0]] [[ 0. 0. 0.] [ 0. 0. 0.]]
NumPy is happy with large arrays (up to your memory limits):
print( np.zeros((1000, 10000)) )
[[ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] ..., [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.] [ 0. 0. 0. ..., 0. 0. 0.]]
There is a collection of array creation functions for convenience:
print( np.identity(4) ) print( np.random.rand(3,4) )
[[ 1. 0. 0. 0.] [ 0. 1. 0. 0.] [ 0. 0. 1. 0.] [ 0. 0. 0. 1.]] [[ 0.50385525 0.85417757 0.02541011 0.83578425] [ 0.44444898 0.6818858 0.8581728 0.09154115] [ 0.86676983 0.39580819 0.52601014 0.83193706]]
Result: memory efficiency like C. Programming with Python.
😄NumPy includes many tricks to operate on arrays… mostly the whole array at once.
If you're writing a loop, you're probably doing it wrong.
Course rule: no loops. (Except where specified, or in the project if you really need them.)
some_ints = np.array([[1,2], [3,4]], dtype=np.int64) ident = np.identity(2, dtype=np.int32) print(some_ints + ident) print(some_ints * ident) # element-wise multiplication print(some_ints @ ident) # matrix multiplication, or np.dot() print(some_ints * 3) # multiplication by a scalar
[[2 2] [3 5]] [[1 0] [0 4]] [[1 2] [3 4]] [[ 3 6] [ 9 12]]
NumPy has broadcasting rules that determine how operands are combined. Roughly:
There are many methods in arrays and functions in NumPy that operate on arrays as a unit.
print(some_ints) print(some_ints.sum()) print(some_ints.sum(axis=0)) print(some_ints.sum(axis=1)) print(np.sum(some_ints, axis=1))
[[1 2] [3 4]] 10 [4 6] [3 7] [3 7]
The np.vectorize
function turns an existing function into a ufunc that can operate on an entire array by implicitly iterating the function.
def double_plus_single(x): return 2*x + 1 double_plus = np.vectorize(double_plus_single, otypes=[np.float64]) res = double_plus(some_ints) print(res) print(res.dtype)
[[3. 5.] [7. 9.]] float64
A ufunc that returns tuples of values creates multiple arrays, not an array containing tuples.
def smaller_bigger_single(x): return (x-1, x+1) smaller_bigger = np.vectorize(smaller_bigger_single, otypes=[np.int, np.int]) sm, bg = smaller_bigger(some_ints) print(sm) print(bg)
[[0 1] [2 3]] [[2 3] [4 5]]
Using np.vectorize
is flexible but not always fast, because it requires calling the Python function many times (which comes with some overhead).
e.g. first calculation of res
here takes about 6 times longer than the second (with identical results).
def double_plus_single(x): return 2*x + 1 double_plus = np.vectorize(double_plus_single, otypes=[np.float64]) res = double_plus(some_ints) res = 2 * some_ints + 1
Indexing into a many-dimensional array can be intricate.
Let's experiment with indexing with NumPy (and Jupyter).
… and then with ways to iterate over arrays and how fast they are.
Warning: combinations of array dimensions as tuples, and array methods vs functions can be inconsistent.
a1a = np.zeros((3, 4)) # a1b = np.zeros(3, 4) # error: must be a tuple a2a = a1a.reshape((6, 2)) a2b = a1a.reshape(6, 2) a2c = np.reshape(a1a, (6, 2)) # a2d = np.reshape(a1a, 6, 2) # error: 2 is some other parameter a3a = np.random.rand(2, 3) #a3b = np.random.rand((2, 3)) # error: can't be a tuple a3c = np.random.random((2, 3)) # a3d = np.random.random(2, 3) # error: must be a tuple
Summary:
NumPy is very good at what it does: arrays.
Single-type arrays and operations on them aren't all there is to data storage and manipulation.
Pandas is a Python library that gives richer tools for manipulating data.
In some ways, a DataFrame is a lot like a SQL database table:
… but the way you work with the data is usually quite different.
The Pandas module is usually imported as a short-form, like NumPy:
import numpy as np import pandas as pd
DataFrames can be created many ways, including a Python dict:
df = pd.DataFrame( { 'value': [1, 2, 3, 4], 'word': ['one', 'two', 'three', 'four'] } ) print(df)
value word 0 1 one 1 2 two 2 3 three 3 4 four
In a notebook, DataFrames are even nicer looking:
A label was added to the rows: an integer incrementing from zero.
Usually that's fine: indexing individual elements is rare anyway. We usually work on entire DataFrames/Series at once.
You can read/write entire series with a single operation:
df['double'] = df['value'] * 2 print(df) print(df[df['value'] % 2 == 0])
value word double 0 1 one 2 1 2 two 4 2 3 three 6 3 4 four 8 value word double 1 2 two 4 3 4 four 8
Note: the labels aren't a position: they are really labels that stay with the rows as they are manipulated.
Rows are identified by their label, not their order: that's usually very convenient.
french_words = pd.Series( ['trois', 'deux', 'un'], index=[2, 1, 0] ) df['french'] = french_words print(df)
value word double french 0 1 one 2 un 1 2 two 4 deux 2 3 three 6 trois 3 4 four 8 NaN
Also note: missing values are okay. Represented as not a number
.
Like with NumPy arrays, you almost always want to work with DataFrames as a unit, not row-by-row. Another example:
cities = pd.DataFrame( [[2463431, 2878.52], [1392609, 5110.21], [5928040, 5905.71]], columns=['population', 'area'], index=pd.Index(['Vancouver','Calgary','Toronto'], name='city') ) print(cities)
population area city Vancouver 2463431 2878.52 Calgary 1392609 5110.21 Toronto 5928040 5905.71
You can apply Python operators, or methods on a Series, or a NumPy ufunc, or ….
def double_plus(x): return 2*x + 1 double_plus = np.vectorize(double_plus, otypes=[np.float64]) cities['density'] = cities['population'] / cities['area'] cities['pop-rank'] = cities['population'].rank(ascending=False) cities['something'] = double_plus(cities['population']) print(cities)
population area density pop-rank something city Vancouver 2463431 2878.52 855.797771 2.0 4926863.0 Calgary 1392609 5110.21 272.515024 3.0 2785219.0 Toronto 5928040 5905.71 1003.781086 1.0 11856081.0
It's occasionally useful to change the index to a data column, or a data column to the index.
del cities['something'] cities = cities.reset_index() cities = cities.set_index('pop-rank') print(cities)
city population area density pop-rank 2.0 Vancouver 2463431 2878.52 855.797771 3.0 Calgary 1392609 5110.21 272.515024 1.0 Toronto 5928040 5905.71 1003.781086