All of the DataFrame tools we have looked at do some kind of read all of the data and calculate on it
tasks.
Most of the data we worked with was from files, but there are other options.
Reading files from disk is fine, but if you have 100 calculations/summaries/reports to do on a data set, the work of reading+parsing the files adds up.
Another data tool category: relational databases like PostgreSQL, MySQL, Microsoft SQL Server.
We saw DuckDB doing DataFrame-like things, but starting to feel like a database: work was specified with SQL queries, using syntax you (probably) already know.
Are we converging in the middle? Maybe we should just do the work in a database? That might be where the data lives anyway.
Think about a web site with a database backend: that's the authoritative copy of the data. Why not just query it there for data science things?
Problem #1: the database is probably already busy.
The web site (or whatever) probably does a lot of queries and expects them to complete quickly. Everybody's going to be mad if the data science team throws a bunch of huge queries in, slowing down the web site.
Problem #2: it's not really what most relational databases are designed for.
The common relational databases (and the data schemas people usually design for them) are optimized for OLTP (OnLine Transaction Processing). e.g. insert one row; query to find purchases for one customer.
For data science work, we're usually imagining OLAP (OnLine Analytical Processing) workloads. e.g. process all of the purchases and aggregate by day to find the days with the most sales from customers who aren't part of the rewards program.
Common relational databases can do that, but doing it alongside the production OLTP work is probably a bad idea.
Problem #3: the data you need might not all be in the same place.
It's common to have multiple system running within an organization, so it's not going to be possible to query across them. e.g. at SFU, goSFU contains student/course information, HAP has HR and payroll stuff, KUALI keeps track of research grants, … .
So, we invent a data warehouse.
The idea: take all of the data you need to query for analytics, and get it all into one place (the data warehouse) so you can query efficiently.
A data warehouse could be a regular relational database that mirrors the original data (e.g. a read-only replica or nightly data dump) and is used for analytic queries.
Or it could be a distinct data warehouse system that aggregates data from many sources and lets them be queried. e.g. Amazon Redshift can load data from various sources, and then allows SQL queries on it.
Getting data from its home
to a data warehouse could be a straight copy, or could be an ETL task that reshapes the data into somethat that's better for analytic queries.
The decision to have a data warehouse (vs just leaving the data in its original location) probably depends on how much data, how many data sources, how many analytic queries are being done, etc.
As an example, let's look at ClickHouse, which is described as a fast open-source OLAP
database.
It seems like it would be a good basis for a data warehouse. I'm going to do the same weather query as in the previous slide deck, but with a data warehouse workflow.
First, we would have to get the data from its original location (files in this case) into the data warehouse.
I did an appropriate
in ClickHouse so the data has a destination.CREATE TABLE weather…
Then I loaded the data into a Polars DataFrame, converted that to Arrow, and that can be loaded into a ClickHouse table.
weather_df = pl.read_csv(input, has_header=False, new_columns=colnames, schema_overrides={'date': pl.Utf8}) client.insert_arrow('weather', weather_df.to_arrow())
That took about 18 s, which is slow but there's almost certainly a better way I don't know. It worked, then the data is in ClickHouse to be explored.
We imagine that I now have many queries to do on that data, but I only have one…
Then the whole task can be done as one big query, materializing it into a table.
client.command('''CREATE TABLE maxdate Engine=Log AS SELECT formatDateTime(date, '%Y-%m-%d') AS date, MAX(tmax) AS max_temp FROM (SELECT station, parseDateTime(date, '%Y%m%d') AS date, value/10 AS tmax FROM weather WHERE observation='TMAX' AND station LIKE 'CA%' AND qflag='') GROUP BY date ORDER BY date''')
Then I can query into a Pandas DataFrame and output files like the others.
maxtemp = client.query_df('SELECT * FROM maxdate') maxtemp.to_json(output, orient='records', lines=True)
Total to do the ETL task and output: 0.1 s. Not a fair comparison with the others since it was already processed into ClickHouse, but very fast.
And that's kind of the point of a data warehouse…
If you have a lot of analytic queries to do, the cost of setting up a data warehouse and (regularly) injesting data is worth it, because it's amortized over lots of data science work.
Or maybe instead of moving the data into a data warehouse, it makes sense to leave it where it already naturally
is, and query from there: a data lake.
That can make sense if you have relatively few queries, or the data is in a good place/format already.
e.g. Amazon Athena can use data from many sources in the AWS ecosystem and work on it with either SQL or Spark.
We could imagine using Spark as the core of a data lake: just read the data from wherever it lives and do the analysis in Spark.
customers = spark.read.jdbc('jdbc:postgresql://dbsvr/sales', dbtable='customers') sales = spark.read.jdbc('jdbc:postgresql://dbsvr/sales', query='SELECT * FROM sales JOIN products ON …') employees = spark.read.csv('//hr_share/employees.csv', schema=employee_schema) logs = spark.read.text('s3://our_webserver_logs/current/') ⋮
The data lake approach can make sense if (some combination of) …
goodformat naturally;
The costs of a data warehouse: another system to run, another copy of the data to store, data must be injested regularly to keep it fresh, etc.
But the benefit: it's now in a system designed for fast OLAP work.
Some people definitely have the idea analytics tool == data warehouse
: don't fall into that trap. There's a tradeoff to be made when introducing any new piece of technology.
You know many ways to query data. Choose the right one for the situation you're in.