CMPT 732, Fall 2021
We have input columns (features), \(x\), and parameters \(\theta\).
Functions \(y(x;\theta)\) produce predictions, \(y\).
We want parameters \(\theta\) that minimize (some measure of) error between the predictions \(y\) and target labels \(t\).
We train the model on labeled data to estimate good values of \(\theta\). We can check performance on other labeled data to check how the model will (probably) behave on future predictions where we don't have
truth: validation and testing.
If we have a discrete set of labels/predictions, then we have a classification problem: we are trying to predict the class or category.
If we have a continuous set of labels/predictions, then we have a regression problem: we are trying to predict a numeric value.
These are supervised learning, where we know
right answers and want to make predictions to match.
Maybe we don't have labels are are trying to partition our inputs into to-be-discovered categories: clustering.
Or we want to find data point that don't fit in with the rest: outlier detection.
These are unsupervised learning.
Spark has some ML tools built in.
The collection of models/tools in Spark aren't as complete as Scikit-Learn, but they're pretty good. Roughly: the most common ML models that can be adapted to big data sets are in Spark.
As with other components: there's an older RDD-based API, and a newer DataFrame-based API. We'll talk only about the DataFrame ML tools: the
See also the Spark ML Guide.
Complete code for the example to follow:
Pieces we'll have:
[Concepts may be familiar to you from Scikit-Learn, but also maybe not…]
We are often going to want to take the data we have, and manipulate it before passing it into the model: feature engineering, but also just reformatting and tidying the data.
That's what the
Transformers are for. Common pattern: data →
Transformer → ⋯ →
Estimator → predictions.
You could apply transformations manually, but that's going to be tedious and error-prone: you have to apply the same ones for training, validation, testing, predicing.
A pipeline describes a series of transformations to its input, finishing with an estimator to make predictions. Implemented with
A pipeline can be trained as a unit: some transforms need training (PCA, indexer, etc), and the estimator certainly does.
Estimators need a single column of all features put together into a vector, so minimal pipeline might be:
assemble_features = VectorAssembler( inputCols=['length', 'width', 'height'], outputCol='features') regressor = GBTRegressor( featuresCol='features', labelCol='volume') pipeline = Pipeline(stages=[assemble_features, regressor])
If we have some training data:
model = pipeline.fit(training)
Once trained, we can predict, possibly on some validation data:
predictions = model.transform(validation) predictions.show()
Once you have a trained model, you probably want to come up with a score to see how it's working. This is done with
Regression and classification are evaluated differently, but the API is the same.
r2_evaluator = RegressionEvaluator( predictionCol='prediction', labelCol='volume', metricName='r2') r2 = r2_evaluator.evaluate(predictions) print(r2)
There are lots of ML algorithms to choose from. All of them implement the
Outside of the core Spark ML tools, there are other ML things where Spark might help.
Spark Deep Learning: deep learning tools implemented in Spark. Particularly nice API for transfer learning where you adapt a pre-trained model to your specific problem.
The toy problem from last week: generate random numbers for length, width, height. Calculate volume as their product. Use ML to predict the volume.
Let's see if we can do better. And let's compare-and-contrast with Scikit-Learn.
Some notes on the Spark regression:
model.transform(validation)) are the full DataFrame: original data, transformer output, targets, predictions.
Let's do the same task with Scikit-Learn to see how they are different.
We could try different models or hyperparameters, but instead let's change the problem.
Maybe we don't care about the numeric value of these values, only their sign: are they positive or negative? That's a classification problem, much easier, and an excuse for some more transformations.
For some more advanced ML in Spark, see the Databricks blog post Fine-Grained Time Series Forecasting.