CMPT 353


As we have seen, when you get data, it's generally not in the format/​shape you want.

The step in your pipeline to reshape it can be expensive: loading from an inefficient format and doing initial processing.


It would be useful to start by getting your data into a nice format so you can move on from there.

That's extract-transform-load (or ETL).

The monthly_totals.py question on Exercise 1 is an ETL task.


The extract step is what we have been talking about: taking the data from the format you find it, and reading it so you can work with it.


The transform step encompasses whatever steps are needed to get the data into a useful format.

Could be nothing. Could be a lot of work.


Some things you might do as part of the transform:

  • Fixing data types, splitting strings, etc.
  • Reshaping the data (e.g. a pivot) to what you'll need later.
  • Aggregating data.
  • Filtering out data you don't care about.
  • Joining data from multiple sources.
  • De-identification.
  • Some data cleaning (more later).
  • assert is_sane(record)


Once you have some useful data, you're probably going to want to save it, so it can be loaded in the next pipeline step. (i.e. load it into your main data store)


You'd like this to be as efficient as possible.

You could save as CSV or JSON, but they're not extremely efficient: text-based, limited data types, need to be parsed again.

… but at least you know they will work.


Could store to a database (relational or NoSQL): they're good at data.

Could save HDF5: a format designed to efficiently represent data sets.

In the big data world, Parquet is a common choice.


Don't work with stupidly-formatted data all the time. ETL to create something nice to work with.