Machine Learning: Other Techniques

CMPT 353

Machine Learning: Other Techniques

Let's take a quick look at a few other ML tools we could use. We won't cover these in-depth, but it might be useful later to know they exist, and explore more when you need to.

Machine Learning: Other Techniques

One distinction we haven't made so far:

Supervised learning: techniques where we have training examples where we know the correct result (the y values).
Unsupervised learning: techniques where there is no right answer known but the algorithm tries to find some structure in the data.

Machine Learning: Other Techniques

One we have made, but let's review:

Classification: supervised ML where we're predicting a class/category.
Regression: supervised ML where we're predicting a number.

More Regression

Suppose we want to do regression, but fitting data where a linear model doesn't make sense.

More Regression

Several of the techniques we have seen for classification (i.e. predict a category) also lend themselves to regression (i.e. predict a number) if we look at them the right way.

The score is no longer the accuracy (fraction correct), it's now the coefficient of determination (\(r^2\)).

More Regression

A \(k\)-nearest neighbours regressor: find the \(k\) nearest training points and use their values to make a prediction. Take the mean of the \(k\) points and that's the prediction (or take the weighted mean, median, etc).

from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(5)
model.fit(X_train, y_train)
print(model.score(X_valid, y_valid))

0.9093807231761435

More Regression

More Regression

A random forest regressor: instead of decision being a category, put a numeric value at each leaf of each tree, and that's the prediction. Average the values given by the trees in the forest to make a prediction.

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(30, max_depth=4)
model.fit(X_train, y_train)
print(model.score(X_valid, y_valid))

0.9535893369669444

The other tree-based classifiers adapt similarly.

More Regression

More Regression

A neural network regressor: build a neural network; instead of softmax activation, use the identity function on the last layer, i.e. use the sum of the inputs as the prediction. Train the network so that matches the training data

from sklearn.neural_network import MLPRegressor
model = MLPRegressor(hidden_layer_sizes=(8, 6),
                     activation='logistic', solver='lbfgs')
model.fit(X_train, y_train)
print(model.score(X_valid, y_valid))

0.9925958528254171

More Regression

Clustering

Clustering is an unsupervised problem: find observations that are similar.

We don't have any correct values to train on. We need the algorithm to discover which points can be grouped together.

Clustering

I generated some points in five groups:

Clustering

… but we pretend we don't know the structure:

Clustering

Different clustering algorithms will give different results, and take different parameters…

Clustering

With k-means clustering (KMeans):

from sklearn.cluster import KMeans
model = KMeans(n_clusters=5)
y = model.fit_predict(X)

Clustering

With agglomerative clustering (AgglomerativeClustering):

from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=5)
y = model.fit_predict(X)

Clustering

With affinity propagation (AffinityPropagation):

from sklearn.cluster import AffinityPropagation
model = AffinityPropagation(random_state=None)
y = model.fit_predict(X)

Clustering

With DBSCAN (DBSCAN):

from sklearn.cluster import DBSCAN
model = DBSCAN(eps=1.1, min_samples=4)
y = model.fit_predict(X)

Clustering Colours

A not-very-obvious application of clustering: image palette reduction. Suppose we have an image and would like to reduce to a smaller colour palette.

Possible strategy: take all of the image's pixel values, find \(n\) clusters, and declare each cluster centre will be a colour in the output image.

Clustering Colours

e.g. we have this image and want to reduce to 32 colours. image source

Clustering Colours

We can use K-Means (a variant that's faster on large training sets) to find colour clusters, and then reassign the cluster centre to each pixel. [complete code]

clusterer = MiniBatchKMeans(N_COLOURS, batch_size=10000)
clusterer.fit(imgdata)
pixel_cluster = clusterer.predict(imgdata)

# map each pixel back to the cluster centre (an RGB value)
colours = clusterer.cluster_centers_.astype(np.uint8)
imgdata = colours[pixel_cluster, :].reshape(shape)

Clustering Colours

Results, original (left) and reduced palette (right):

Maybe not the best palette for this image, but pretty good for a few lines of code.

Anomaly Detection

Another unsupervised technique: anomaly detection (or outlier detection or novelty detection).

The idea: find observations that are unusual in some way. Use that to try to identify fraudulent credit card charges, attackers on your server, etc.

Of course, scikit-learn has many anomaly detection tools.

Anomaly Detection

Some data generated with clusters and outliers:

Anomaly Detection

One anomaly detection algorithm and its results:

from sklearn.neighbors import LocalOutlierFactor
model = LocalOutlierFactor()
y = model.fit_predict(X)

Anomaly Detection

There's going to be some tradeoff between false-positives and false-negatives. Which algorithm you choose and how you parameterize it depends on the application.

Overview

Broadly, we are this far with ML:

When Machine Learning?

When is machine learning a tool we should reach for?

Compared to statistics, there is less certainty in results: ML techniques don't give a nice comforting p-value.

They make predictions, and you do some testing to convince yourself the predictions are good. How certain can you be that they'll be correct on never-before-seen data?

When Machine Learning?

On the other hand, no other techniques at our disposal can do what machine learning can.

A T-test isn't going to distinguish between hand-written digits. In ML, it's a hello-world.