Machine Learning: Other Techniques

CMPT 353, Fall 2019

Machine Learning: Other Techniques

Let's take a quick look at a few other ML tools we could use. We won't cover these in-depth, but it might be useful later to know they exist, and explore more when you need to.

Machine Learning: Other Techniques

One distinction we haven't made so far:

  • Supervised learning: techniques where we have training examples where we know the correct result (the y values).
  • Unsupervised learning: techniques where there is no right answer known but the algorithm tries to find some structure in the data.

Clustering

Clustering is an unsupervised problem: find observations that are similar.

We don't have any correct values to train on. We need the algorithm to discover which points can be grouped together.

Clustering

I generated some points in five groups:

clusters to discover

Clustering

… but we pretend we don't know the structure:

unlabelled clusters

Clustering

Different clustering algorithms will give different results, and take different parameters…

Clustering

from sklearn.cluster import KMeans
model = KMeans(n_clusters=5)
y = model.fit_predict(X)
KMeans clusters

Clustering

from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(n_clusters=5)
y = model.fit_predict(X)
Agglomerative clusters

Clustering

from sklearn.cluster import AffinityPropagation
model = AffinityPropagation()
y = model.fit_predict(X)
AffinityPropagation clusters

Clustering Colours

A not-very-obvious application of clustering: image palette reduction. Suppose we have an image and would like to reduce to a smaller colour palette.

Possible strategy: take all of the image's pixel values, find \(n\) clusters, and declare each cluster centre will be a colour in the output image.

Clustering Colours

e.g. we have this image and want to reduce to 32 colours. image source

input image

Clustering Colours

We can use K-Means (a variant that's faster on large training sets) to find colour clusters, and then reassign the cluster centre to each pixel. [complete code]

clusterer = MiniBatchKMeans(N_COLOURS, batch_size=10000)
clusterer.fit(imgdata)
pixel_cluster = clusterer.predict(imgdata)

# map each pixel back to the cluster centre (an RGB value)
colours = clusterer.cluster_centers_.astype(np.uint8)
imgdata = colours[pixel_cluster, :].reshape(shape)

Clustering Colours

Results, original (left) and reduced palette (right):

input image input image

Maybe not the best palette for this image, but pretty good for a few lines of code.

Anomaly Detection

Another unsupervised technique: anomaly detection (or outlier detection or novelty detection).

The idea: find observations that are unusual in some way. Use that to try to identify fraudulent credit card charges, attackers on your server, etc.

Anomaly Detection

Some data generated with clusters and outliers:

data with some anomalies

Anomaly Detection

One anomaly detection algorithm and its results:

from sklearn.neighbors import LocalOutlierFactor
model = LocalOutlierFactor()
y = model.fit_predict(X)
anomalies detected

Anomaly Detection

There's going to be some tradeoff between false-positives and false-negatives. Which algorithm you choose and how you parameterize it depends on the application.

More Regression

Suppose we want to do regression, but fitting data where a linear model doesn't make sense.

non-linear data

More Regression

Several of the techniques we have seen for classification (i.e. predict a category) also lend themselves to regression (i.e. predict a number) if we look at them the right way.

The score is no longer the accuracy (fraction correct), it's now the coefficient of determination (\(r^2\)).

More Regression

A \(k\)-nearest neighbours regressor: find the \(k\) nearest training points and use their values to make a prediction. Usually take the mean of the \(k\) points, but could be weighted mean, median, etc.

from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(5)
model.fit(X_train, y_train)
print(model.score(X_valid, y_valid))
0.9093807231761435

More Regression

kNN regressor fit

More Regression

A random forest regressor: instead of decision being a category, put a predicted numeric value at each leaf of each tree. Average the values given by the trees in the forest to make a prediction.

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(30, max_depth=4)
model.fit(X_train, y_train)
print(model.score(X_valid, y_valid))
0.9535893369669444

The other tree-based classifiers adapt similarly.

More Regression

random forest regressor fit

More Regression

A neural network regressor: build a neural network; instead of softmax activation, use the identity function on the last layer, i.e. use the sum of the inputs as the prediction. Train the network so that matches the training data

from sklearn.neural_network import MLPRegressor
model = MLPRegressor(hidden_layer_sizes=(8, 6),
                     activation='logistic', solver='lbfgs')
model.fit(X_train, y_train)
print(model.score(X_valid, y_valid))
0.9925958528254171

More Regression

NN regressor fit

When Machine Learning?

When is machine learning a tool we should reach for?

Compared to statistics, there is less certainty in results: ML techniques don't give a nice comforting p-value.

They make predictions, and you do some testing to convince yourself the predictions are good. How certain can you be that they'll be correct on never-before-seen data?

When Machine Learning?

On the other hand, no other techniques at our disposal can do what machine learning can.

A T-test isn't going to distinguish between hand-written digits. In ML, it's a hello-world.

handwritten digits