Let's take a quick look at a few other ML tools we could use. We won't cover these in-depth, but it might be useful later to know they exist, and explore more when you need to.
One distinction we haven't made so far:
y
values).rightanswer known but the algorithm tries to find some structure in the data.
One we have made, but let's review:
Suppose we want to do regression, but fitting data where a linear model doesn't make sense.
Several of the techniques we have seen for classification (i.e. predict a category) also lend themselves to regression (i.e. predict a number) if we look at them the right way.
The score is no longer the accuracy (fraction correct), it's now the coefficient of determination (\(r^2\)).
A \(k\)-nearest neighbours regressor: find the \(k\) nearest training points and use their values to make a prediction. Take the mean of the \(k\) points and that's the prediction
(or take the weighted mean, median, etc).
from sklearn.neighbors import KNeighborsRegressor model = KNeighborsRegressor(5) model.fit(X_train, y_train) print(model.score(X_valid, y_valid))
0.9093807231761435
A random forest regressor: instead of decision being a category, put a numeric value at each leaf of each tree, and that's the prediction. Average the values given by the trees in the forest to make a prediction.
from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor(30, max_depth=4) model.fit(X_train, y_train) print(model.score(X_valid, y_valid))
0.9535893369669444
The other tree-based classifiers adapt similarly.
A neural network regressor: build a neural network; instead of softmax activation, use the identity function on the last layer, i.e. use the sum of the inputs as the prediction. Train the network so that matches the training data
from sklearn.neural_network import MLPRegressor model = MLPRegressor(hidden_layer_sizes=(8, 6), activation='logistic', solver='lbfgs') model.fit(X_train, y_train) print(model.score(X_valid, y_valid))
0.9925958528254171
Clustering is an unsupervised problem: find observations that are similar
.
We don't have any correct
values to train on. We need the algorithm to discover which points can be grouped together.
I generated some points in five groups:
… but we pretend we don't know the structure:
Different clustering algorithms will give different results, and take different parameters…
With k-means clustering (KMeans
):
from sklearn.cluster import KMeans model = KMeans(n_clusters=5) y = model.fit_predict(X)
With agglomerative clustering (AgglomerativeClustering
):
from sklearn.cluster import AgglomerativeClustering model = AgglomerativeClustering(n_clusters=5) y = model.fit_predict(X)
With affinity propagation (AffinityPropagation
):
from sklearn.cluster import AffinityPropagation model = AffinityPropagation(random_state=None) y = model.fit_predict(X)
from sklearn.cluster import DBSCAN model = DBSCAN(eps=1.1, min_samples=4) y = model.fit_predict(X)
A not-very-obvious application of clustering: image palette reduction. Suppose we have an image and would like to reduce to a smaller colour palette.
Possible strategy: take all of the image's pixel values, find \(n\) clusters, and declare each cluster centre will be a colour in the output image.
e.g. we have this image and want to reduce to 32 colours. image source
We can use K-Means (a variant that's faster on large training sets) to find colour clusters, and then reassign the cluster centre to each pixel. [complete code]
clusterer = MiniBatchKMeans(N_COLOURS, batch_size=10000) clusterer.fit(imgdata) pixel_cluster = clusterer.predict(imgdata) # map each pixel back to the cluster centre (an RGB value) colours = clusterer.cluster_centers_.astype(np.uint8) imgdata = colours[pixel_cluster, :].reshape(shape)
Results, original (left) and reduced palette (right):
Maybe not the best palette for this image, but pretty good for a few lines of code.
Another unsupervised technique: anomaly detection (or outlier detection or novelty detection).
The idea: find observations that are unusual
in some way. Use that to try to identify fraudulent credit card charges, attackers on your server, etc.
Of course, scikit-learn has many anomaly detection tools.
Some data generated with clusters and outliers:
One anomaly detection algorithm and its results:
from sklearn.neighbors import LocalOutlierFactor model = LocalOutlierFactor() y = model.fit_predict(X)
There's going to be some tradeoff between false-positives and false-negatives. Which algorithm you choose and how you parameterize it depends on the application.
When is machine learning a tool we should reach for?
Compared to statistics, there is less certainty in results: ML techniques don't give a nice comforting p-value.
They make predictions, and you do some testing to convince yourself the predictions are good. How certain can you be that they'll be correct on never-before-seen data?
On the other hand, no other techniques at our disposal can do what machine learning can.
A T-test isn't going to distinguish between hand-written digits. In ML, it's a hello-world.