Extract-Transform-Load

CMPT 353

Extract-Transform-Load

As we have seen, when you get data, it's generally not in the format/​shape you want.

The step in your pipeline to reshape it can be expensive: loading from an inefficient format and doing initial processing.

Extract-Transform-Load

It would be useful to start by getting your data into a nice format so you can move on from there.

That's extract-transform-load (or ETL).

The monthly_totals.py question on Exercise 1 is an ETL task.

Extract

The extract step is what we have been talking about: taking the data from the format you find it, and reading it so you can work with it.

Transform

The transform step encompasses whatever steps are needed to get the data into a useful format.

Could be nothing. Could be a lot of work.

Transform

Some things you might do as part of the transform:

  • Fixing data types, splitting strings, etc.
  • Reshaping the data (e.g. a pivot) to what you'll need later.
  • Aggregating data.
  • Filtering out data you don't care about.
  • Joining data from multiple sources.
  • De-identification.
  • Some data cleaning (more later).
  • assert is_sane(record)

Load

Once you have some useful data, you're probably going to want to save it, so it can be loaded in the next pipeline step. (i.e. load it into your main data store)

Load

You'd like this to be as efficient as possible.

You could save as CSV or JSON, but they're not extremely efficient: text-based, limited data types, need to be parsed again.

… but at least you know they will work.

Load

Could store to a database (relational or NoSQL): they're good at data.

Could save HDF5: a format designed to efficiently represent data sets.

In the big data world, Parquet is a common choice.

Summary

Don't work with stupidly-formatted data all the time. ETL to create something nice to work with.