Data Management

CMPT 732, Fall 2023

The V's

Remember the “The Four V's”?

  • Volume: the amount of data.
  • Velocity: the data arrives quickly and constantly.
  • Variety: many differently-structured (or less-structured) input data sets.
  • Veracity: some data might be incorrect, or of unknown correctness.

We have spent most of our time on volume. What about the others?

The V's

Velocity can be addressed by Spark Streaming or other streaming tools.

Veracity (a.k.a. correctness) is generally solved with data cleaning… more coverage in CMPT 733.

What about variety? Let's get there…

OLAP vs OLTP

Most traditional databases are focussed on OLTP (OnLine Transaction Processing) work: processing queries quickly to back a web site or similar.

In contrast, OLAP (OnLine Analytical Processing) focusses on less-real-time analytics: reporting, business intelligence, data mining, etc.

OLAP vs OLTP

The categories aren't mutually exclusive: the same database server could do both.

But complex reporting queries can slow down transactions, so it might also be reasonable to separate them. Easy solution: master database for OLTP and read-only replica for OLAP.

OLAP vs OLTP

In this course, we are (I am?) generally thinking or OLAP-like workloads. Big data can be the result of aggregating smaller OLTP systems and doing analysis.

Extract-Transform-Load

Sometimes, the data you get isn't in the format you want.

e.g. schema for OLTP might not be what you need for OLAP: maybe it makes sense to denormalize or aggregate before doing any OLAP work.

e.g. data from multiple sources need to be combined for analysis.

Extract-Transform-Load

e.g. data arrived in a slow-to-process format (like monolithic .json.bz2) but you want something faster (like partitioned .parquet.lz4). This is easy in Spark:

spark.read.json('horrible-download.json.bz2', schema=schema) \
    .repartition(120) \
    .write.parquet('usable-data', compression='lz4')

e.g. that, but split at the command line:

bunzip2 -d horrible.json.bz2 | split -a 5 -d -C 1024M - part-

e.g. Or load into Cassandra before further processing.

Extract-Transform-Load

Generally, the solution is to take the format you have, transform to the format you need, and save.

Or Extract-Transform-Load, ETL.

Can also include extracting, validating, cleaning, joining, aggregating, ….

Extract-Transform-Load

ETL can be done with tools you know: Spark, Spark Streaming, just programming.

There are also dedicated tools to define data processing steps: Apache Flume, Apache NiFi, Amazon Data Pipeline, ….

Data Warehousing

It's common for any organization to have many places data is stored: SQL database for web site, SQL database for HR system, spreadsheet with budgets, ….

For analysis, it probably needs to all be in one system: a data warehouse, e.g. AWS Redshift.

This addresses at least one kind of data variety.

Data Warehousing

The idea: take (all of the) OLTP system(s) and other data sources and ETL to get them all into one OLAP system.

Call that a data warehouse. Use it for analytics, reporting, etc.

e.g. it would be nice for SFU to have a data warehouse. Possible inputs: databases from goSFU, the financial system, the HR system, computer accounts, ID photos. If that existed, it would be much easier to answer questions about what's happening around here.

Data Warehousing

The data store for a warehouse could be stored anywhere that makes sense: SQL database, NoSQL database, HDFS+Spark, dedicated warehouse tool.

Some data warehousing tools: Amazon RedShift, Google BigQuery, Teradata.

Data Warehousing

When importing data into a warehouse, it may be useful to reshape it:

  • denormalize or re-normalize for easier querying
  • add indices that aren't necessary in OLTP
  • unify identifiers (entity resolution/​record linkage)
  • keep history of previous values

Data Warehousing

Make sure any data import task is idempotent.

It's going to fail one day: make sure it can be re-started safely and heal itself. i.e. never blindly insert data.

  1. Begin transaction.
  2. Is data present?
  3. If yes, update.
  4. If not, insert.
  5. Commit.

Data Lakes

It might not actually be necessary to copy the data into one system: maybe it can be queried in-place.

Typically a data warehouse involves copying data into a structured central repository for reporting and analytics. It's also possible to do OLAP work on data in-place or in its raw format: a data lake.

Data Lakes

There are several tools to query data from different sources: Spark, Hive, Apache Impala, Apache Drill, AWS Spectrum and Athena.

Data Lakes

With either a Warehouse or Lake, the goal is to to answer some questions.

The choice between them is going to depend on the number/​type of questions, their frequency, the amount/​variety of data, etc.