We'll be returning to material in the Python Data Science Handbook:
Another good treatment of ML: Hands-On Machine Learning, Aurélien Géron. [Hands-On ML: SFU Library]
Compared to CMPT 410:
I'm less concerned about machine learning as a topic of study, and more concerned about using it as a tool to make sense of data.
Obviously, 2–3 weeks will cover a lot less detail than a whole course.
We will generally use the scikit-learn libraries which implements common ML algorithms for Python.
Usual parts we'll have (for supervised
ML):
better?
We want to take the sample inputs and build a model that captures the calculation that is needed to produce correct
output.
If we have done it right, the model should be able to predict output for new inputs that we haven't seen before.
These models
of the problem won't be completely designed by us.
We will create the overall design of the model: choose the technique and set some parameters that need to be decided by a human, hyperparamters. Then we will train it on the known inputs/outputs to determing parameters in the model.
We just talked about linear regression as a statistical technique. It turns out that the machine learning people claim it too…
What we did before:
from scipy import stats
reg = stats.linregress(x, y)
print(reg.slope, reg.intercept)
0.523895506988829 -1.503281343279347
The same basic operation with scikit-learn to get the same results:
from sklearn.linear_model import LinearRegression
X = np.stack([x], axis=1)
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
print(model.coef_[0], model.intercept_)
0.5238955069888295 -1.5032813432793506
Things to note…
The np.stack
function is used to join 1D arrays (in this case, only one) into a 2D array.
arr = np.array([1,2,3])
print(arr)
print(np.stack([arr], axis=1))
[1 2 3] [[1] [2] [3]]
In this example, .reshape(-1, 1)
would also work.
If you already have 2D arrays, np.concatenate
will put them together into a single 2D array.
Or just select appropriate columns from a Pandas DataFrame.
features = data[['x1', 'x2', 'x3']]
target = data['y']
model.fit(features, target)
The ML convention is to have 2D arrays (matrices) in capitalized variable names (X
vs x
), which isn't standard Python style.
X = np.stack([x], axis=1)
I hate it, but I'll go with it.
Another convention: scikit-learn uses trailing underscores to indicate values that were estimated/fitted/learned. *
print(model.coef_[0], model.intercept_)
Also not very Pythonic, but there it is.
Style aside, the operations we're doing are:
model = LinearRegression(fit_intercept=True)
model.fit(X, y)
Fit: compute the parameters that do the best job matching the model to the given data.
Also called training the model. The known correct inputs/outputs are the training data.
For this model, the parameters are slope and intercept. Best
is determined by the smallest sum of square error.
Once we have a model, we can use it to make predictions.
X_fit = [[15], [-19]]
y_fit = model.predict(X_fit)
print(y_fit)
[ 6.35515126 -11.45729598]
i.e. when the input is [15]
, the model predicts that the correct output is 6.35515126
.
Or we can predict the entire range and draw it:
plt.plot(x, y, 'b.')
plt.plot(X_fit, y_fit, 'g.', markersize=25)
plt.plot(x, model.predict(X), 'r-')
plt.legend(['training data', 'specific predictions from prev slide', 'predicted line'])
Summary:
When setting up the linear fit, we could have done the intercept
manually like this:
X_with = np.concatenate([np.ones_like(X), X], axis=1)
print(X_with[:4])
[[ 1. 29.16862815] [ 1. 12.8018865 ] [ 1. 19.74485581] [ 1. 34.89071839]]
Then we don't need the fit to worry about an intercept…
Then we don't need the regression to worry about an intercept.
model = LinearRegression(fit_intercept=False)
model.fit(X_with, y)
print(model.coef_)
[-1.50328134 0.52389551]
Same intercept and slope. We basically just fit two a_i parameters in:
y = a_0 1 + a_1 x\,.What do we do when we get non-linear values, where a line won't be a good model?
Obvious answer: fit something with more freedom than just a line. We're going to need a different function with more parameters.
If we can create an array with 1=x^0 and x=x^1 values for a linear fit, we can do more of the same.
And, we don't even have to do it manually.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=True)
X = np.array([[2], [3], [4]])
print(poly.fit_transform(X))
[[ 1. 2. 4. 8.] [ 1. 3. 9. 27.] [ 1. 4. 16. 64.]]
[[ 1. 2. 4. 8.] [ 1. 3. 9. 27.] [ 1. 4. 16. 64.]]
If we can do a linear fit to that input, we're finding a_i parameters to fit this to the data:
y = a_0 x^0 + a_1 x^1 + a_2 x^2 + a_3 x^3\,.We can use this to predict with a polynomial:
poly = PolynomialFeatures(degree=3, include_bias=True)
X_poly = poly.fit_transform(X)
model = LinearRegression(fit_intercept=False)
model.fit(X_poly, y)
What did the fit really decide?
print(model.coef_)
[ 5.16190050e-01 1.29319129e-01 -4.21668370e-04 -1.06965776e-03]
In other words, our predictions are,
y_{\mathrm{pred}} = 0.516 + 0.129 x - 0.000422 x^2 - 0.00107 x^3\,.That seems useful: we can use higher-degree polynomial inputs and simple linear fitting to allow more flexibility.
A higher-degree polynomial allows the .fit()
process more freedom to fit the data better.
If some is good, more must be better.
poly = PolynomialFeatures(degree=11, include_bias=True)
X_poly = poly.fit_transform(X)
model = LinearRegression(fit_intercept=False)
model.fit(X_poly, y)
The fit line does an excellent job fitting the points, but a bad job fitting what I think is going on.
What we did: take an x value and transform it to several f_1(x), f_2(x), \ldots, f_k(x) values, then do a linear fit to find a_i that give the best prediction,
y_{\mathrm{predicted}} = a_1 f_1(x) + a_2 f_2(x) + \cdots + a_k f_k(x)\,.The f_i functions are basis functions.
There's nothing special about the powers (except that they're often useful).
These could have been any functions: Gaussian curves, sine/cosine curves, logarithms/exponentials, … . You can choose whatever makes sense for your data.
It's common to have a situation like this: transform the data (in one, two, or more ways). Then fit some model to the result.
That becomes cumbersome: constantly must transform the X
values when trying to use the model.
poly = PolynomialFeatures(degree=3, include_bias=True)
X_poly = poly.fit_transform(X)
model = LinearRegression(fit_intercept=False)
model.fit(X_poly, y)
And later,
plt.plot(X_range[:, 0], model.predict(poly.transform(X_range)), 'r-')
Scikit-learn can combine several steps into a pipeline model:
from sklearn.pipeline import make_pipeline
model = make_pipeline(
PolynomialFeatures(degree=3, include_bias=True),
LinearRegression(fit_intercept=False)
)
This model takes our X
values, creates polynomial features, and does a linear fit on them.
Then we can treat this like a single model and not explicitly worry about the steps.
model = make_pipeline(
PolynomialFeatures(degree=3, include_bias=True),
LinearRegression(fit_intercept=False)
)
model.fit(X, y)
plt.plot(X_range[:, 0], model.predict(X_range), 'r-')
Pipeline steps are all scikit-learn estimator objects.
Steps 1 to n-1 will be transformers that can .fit()
and .transform()
to modify the values (or .fit_transform()
which combines the two).
The last step is an estimator which can .fit()
and .predict()
.
It's easy to end up with several steps, and nice to not have to deal with each one every time you work with X
values.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
model = make_pipeline(
SimpleImputer(strategy='mean'), # impute missing values
MinMaxScaler(), # scale each feature to 0-1
PolynomialFeatures(degree=3, include_bias=True),
LinearRegression(fit_intercept=False)
)
Course rule: if it can be a pipeline, it must be.
We can create a pipeline that uses a custom basis function for linear regression. Our transformation function takes the X
array and returns a new one.
def sigmoid_basis(X): # [[x]] -> [[1, x, 1/(1 + e^x)]]
one = np.ones_like(X)
sigmoid = 1 / (1 + np.exp(X))
return np.concatenate((one, X, sigmoid), axis=1)
from sklearn.preprocessing import FunctionTransformer
model = make_pipeline(
FunctionTransformer(sigmoid_basis, validate=True),
LinearRegression(fit_intercept=False)
)
model.fit(X, y)
print(model.named_steps['linearregression'].coef_)
Results: excellent (particularly since I knew the data source when designing the basis functions)
[ 1.01205597 -0.00198257 -1.03079678]
The high-degree polynomial had a lot of freedom to fit the points, but somehow followed them too closely. How can we decide that is the wrong thing (for a real data set)?
The problem: we're relying too much on this exact data set. We train on it, and then claim it fits the data perfectly. Of course it does.
We are overfitting the training data.
We need to look at the model on some other data where we can get a more honest evaluation.
Common solution: break the available data up.
The usual thing to do is to split into training data which is used to fit the model, and validation data which is used to evaluate how well that went.
If the model does well on the validation data (which it hasn't seen before), that's a good sign.
As usual, we can do that in one line:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
print(X_train.shape, X_valid.shape)
print(y_train.shape, y_valid.shape)
(45, 1) (15, 1) (45,) (15,)
We can use the validation data to produce a score
: some value where larger = better fit.
model.fit(X_train, y_train)
print(model.score(X_valid, y_valid))
0.9742339805873135
We have some confidence that the score is related to how well the model will predict for new never-before-seen values.
It also makes some sense to compare the scores on the training and validation data. It will usually be a little lower on the validation data, but being much smaller is a sign of overfitting.
print(model.score(X_train, y_train))
print(model.score(X_valid, y_valid))
0.9837583175794484 0.9742339805873135
The .score()
method is vaguely equivalent to:
y_predicted = model.predict(X_valid) score = somehow_score_predictions(y_valid, y_predicted)
But was the value we found good? We could explore…
def score_polyfit(n):
model = make_pipeline(
PolynomialFeatures(degree=n, include_bias=True),
LinearRegression(fit_intercept=False)
)
model.fit(X_train, y_train)
print('n=%i: score=%.5g' % (n, model.score(X_valid, y_valid)))
score_polyfit(1)
score_polyfit(5)
score_polyfit(9)
score_polyfit(13)
score_polyfit(17)
n=1: score=0.8202 n=5: score=0.97423 n=9: score=0.98873 n=13: score=0.98741 n=17: score=0.42737
Apparently degree 9 is pretty good.
n=1: score=0.8202 n=5: score=0.97423 n=9: score=0.98873 n=13: score=0.98741 n=17: score=0.42737
But (as is often the case) there are several values that are all quite good: it's hard to pick exactly one correct
value. Probably degree 5–13 would be pretty good for this data.
It can also be useful to look at the scores on both the training and validation data.
trainscore = model.score(X_train, y_train) validscore = model.score(X_valid, y_valid)
n=1: trainscore=0.8479 validscore=0.8202 n=5: trainscore=0.98376 validscore=0.97423 n=9: trainscore=0.99286 validscore=0.98873 n=13: trainscore=0.99321 validscore=0.98741 n=17: trainscore=0.67966 validscore=0.42737
If there's a big drop training to validation, it's usually a sign of overfitting.
This is the prediction of the degree 9 model, with training (blue) and validation points (green).
In case you're interested, the degree 17 model: