Data Warehouses

CMPT 353

Data Warehouses

All of the DataFrame tools we have looked at do some kind of read all of the data and calculate on it tasks.

Most of the data we worked with was from files, but there are other options.

Data Warehouses

Reading files from disk is fine, but if you have 100 calculations/​summaries/​reports to do on a data set, the work of reading+parsing the files adds up.

Databases

Another data tool category: relational databases like PostgreSQL, MySQL, Microsoft SQL Server.

We saw DuckDB doing DataFrame-like things, but starting to feel like a database: work was specified with SQL queries, using syntax you (probably) already know.

Databases

Are we converging in the middle? Maybe we should just do the work in a database? That might be where the data lives anyway.

Think about a web site with a database backend: that's the authoritative copy of the data. Why not just query it there for data science things?

Databases

Problem #1: the database is probably already busy.

The web site (or whatever) probably does a lot of queries and expects them to complete quickly. Everybody's going to be mad if the data science team throws a bunch of huge queries in, slowing down the web site.

Databases

Problem #2: it's not really what most relational databases are designed for.

The common relational databases (and the data schemas people usually design for them) are optimized for OLTP (OnLine Transaction Processing). e.g. insert one row; query to find purchases for one customer.

Databases

For data science work, we're usually imagining OLAP (OnLine Analytical Processing) workloads. e.g. process all of the purchases and aggregate by day to find the days with the most sales from customers who aren't part of the rewards program.

Common relational databases can do that, but doing it alongside the production OLTP work is probably a bad idea.

Databases

Problem #3: the data you need might not all be in the same place.

It's common to have multiple system running within an organization, so it's not going to be possible to query across them. e.g. at SFU, goSFU contains student/​course information, HAP has HR and payroll stuff, KUALI keeps track of research grants, … .

Data Warehouse

So, we invent a data warehouse.

The idea: take all of the data you need to query for analytics, and get it all into one place (the data warehouse) so you can query efficiently.

Data Warehouse

A data warehouse could be a regular relational database that mirrors the original data (e.g. a read-only replica or nightly data dump) and is used for analytic queries.

Or it could be a distinct data warehouse system that aggregates data from many sources and lets them be queried. e.g. Amazon Redshift can load data from various sources, and then allows SQL queries on it.

Data Warehouse

Getting data from its home to a data warehouse could be a straight copy, or could be an ETL task that reshapes the data into somethat that's better for analytic queries.

The decision to have a data warehouse (vs just leaving the data in its original location) probably depends on how much data, how many data sources, how many analytic queries are being done, etc.

ClickHouse

As an example, let's look at ClickHouse, which is described as a fast open-source OLAP database.

It seems like it would be a good basis for a data warehouse. I'm going to do the same weather query as in the previous slide deck, but with a data warehouse workflow.

ClickHouse

First, we would have to get the data from its original location (files in this case) into the data warehouse.

I did an appropriate CREATE TABLE weather… in ClickHouse so the data has a destination.

ClickHouse

Then I loaded the data into a Polars DataFrame, converted that to Arrow, and that can be loaded into a ClickHouse table.

weather_df = pl.read_csv(input, has_header=False,
    new_columns=colnames, schema_overrides={'date': pl.Utf8})
client.insert_arrow('weather', weather_df.to_arrow())

ClickHouse

That took about 18 s, which is slow but there's almost certainly a better way I don't know. It worked, then the data is in ClickHouse to be explored.

We imagine that I now have many queries to do on that data, but I only have one…

ClickHouse

Then the whole task can be done as one big query, materializing it into a table.

client.command('''CREATE TABLE maxdate Engine=Log
    AS SELECT
        formatDateTime(date, '%Y-%m-%d') AS date,
        MAX(tmax) AS max_temp
    FROM (SELECT
        station,
        parseDateTime(date, '%Y%m%d') AS date,
        value/10 AS tmax
    FROM weather
    WHERE observation='TMAX'
        AND station LIKE 'CA%'
        AND qflag='')
    GROUP BY date ORDER BY date''')

ClickHouse

Then I can query into a Pandas DataFrame and output files like the others.

maxtemp = client.query_df('SELECT * FROM maxdate')
maxtemp.to_json(output, orient='records', lines=True)

Total to do the ETL task and output: 0.1 s. Not a fair comparison with the others since it was already processed into ClickHouse, but very fast.

ClickHouse

And that's kind of the point of a data warehouse…

If you have a lot of analytic queries to do, the cost of setting up a data warehouse and (regularly) injesting data is worth it, because it's amortized over lots of data science work.

Data Lake

Or maybe instead of moving the data into a data warehouse, it makes sense to leave it where it already naturally is, and query from there: a data lake.

That can make sense if you have relatively few queries, or the data is in a good place/​format already.

e.g. Amazon Athena can use data from many sources in the AWS ecosystem and work on it with either SQL or Spark.

Data Lake

We could imagine using Spark as the core of a data lake: just read the data from wherever it lives and do the analysis in Spark.

customers = spark.read.jdbc('jdbc:postgresql://dbsvr/sales',
    dbtable='customers')
sales = spark.read.jdbc('jdbc:postgresql://dbsvr/sales',
    query='SELECT * FROM sales JOIN products ON …')
employees = spark.read.csv('//hr_share/employees.csv',
    schema=employee_schema)
logs = spark.read.text('s3://our_webserver_logs/current/')
⋮

Data Lake

The data lake approach can make sense if (some combination of) …

  • you have relatively few analytic queries to do;
  • the data is in a good format naturally;
  • the data is big enough it's a pain to move somewhere else;
  • the data is small enough it's easy to load as needed;

Summary

The costs of a data warehouse: another system to run, another copy of the data to store, data must be injested regularly to keep it fresh, etc.

But the benefit: it's now in a system designed for fast OLAP work.

Summary

Some people definitely have the idea analytics tool == data warehouse: don't fall into that trap. There's a tradeoff to be made when introducing any new piece of technology.

You know many ways to query data. Choose the right one for the situation you're in.