More Data Science

CMPT 353

More Data Science

The term data science includes a lot of stuff.

We have covered a reasonable sampling of it this semester, but what have we missed?

Other Technologies

A notable data science technology we ignored all semester: the R language.

An open source language/toolchain designed for statistical analysis. It's the other common choice for data science work (along with Python).

Short summary: it's probably worse as a programming language, but better as a stats tool.

Other Technologies

We also didn't use Scala, which is a shame because it's the implementation language for Spark.

Or visualization tools like Tableau, which is a shame because it's a good tool and they hire a lot of people in the Vancouver office.

Databases

We skipped talking about databases because there's a separate course for that (which you should take), but they're critical data science tools.

Databases allow us to access particular (subsets of) records very quickly and easily. The data-in-files approach we have been using doesn't: it's usually all (most) or nothing.

Databases

Relational databases (and SQL) are the most common database tools. Both Pandas and Spark can read/write SQL data (or just… y'know… programming).

As mentioned earlier: relational databases and SQL have been the first-choice data storage tool for decades. That isn't by accident: they're good at what they do.

Databases

There are also many modern non-relational databases: NoSQL databases. These tend to be designed to scale across many servers, so can handle larger volume and velocity.

They usually don't allow joins (i.e. don't deal with relations) since they are (in general) hard to scale.

In a data science context, it's probably reasonable to think of these as big data tools.

Topics We Covered

We had limited time in the topics we covered. Each one of them could have been (at least) a course all by itself.

Hopefully this course gave enough introduction that you know the basics and can learn more as necessary.

Topics We Covered

If the semester was a little longer, I would have wanted to talk about:

More stats techniques: paired tests, experimental design, effect size, etc.
More ML techniques: more clustering and regression, logistic regression, neural net techniques. Cross-validation, regularization, hyperparameter search.
Neural networks with Keras or PyTorch.
Big data setup and configuration, more of what the systems imply about performance and techniques.

Practical Omissions

Some things were hard to fit into a course like this for practical reasons. Courses are always artificial…

Practical Omissions

Understanding the problem: If actually doing data science work, you'll find yourself being asked questions about whatever field you're in, which probably isn't computer science. Actually understanding the field is going to be a big part of any data analysis job.

Weather, colours, Reddit comments, and GPS positions were just convenient topics that you could understand without spending time on non-data related stuff.

Practical Omissions

Understanding the question: Questions aren't usually handed to you as a well-defined assignment question. Much more likely is a query like we want to know if there's anything interesting in the data we have about X. Figuring out what they mean by interesting is not a skill to be ignored.

Again, it would have been off-topic to do this in exercises, but in the project, you need to understand and/or create the question.

Practical Omissions

Getting data: using APIs to fetch data comes up a lot. So does cleaning up arbitrarily-weird data that you get.

We used data from an API (e.g. the Twitter data) but you were never actually required to collect it yourself. Using an API has logistical problems: registering to get API keys, having \(n\) students all hitting an API when the exercise is due, etc.

Practical Omissions

Data cleaning: All cleaning tasks were all too well-defined to be realistic.

The project should open this door, at least a little.

Practical Omissions

Cloud computing resources: Having a Spark cluster sitting around is unusual, and maybe doing all Pandas/stats/ML analysis on your laptop isn't realistic either.

It's common to move to do work on a cloud computing platform: AWS, Google Cloud, Azure, Databricks.

Again, logistics: expecting everybody to set up an AWS account is difficult.

Practical Omissions

Communication: Communicating your results to people is also critical. As I have said before: results don't do any good if nobody knows about them.

I'm hoping your W courses point you in the right direction.

Other Courses

Some other courses that might be good to take if you're interested in data science…

Database Systems (CMPT 354)
Machine Learning (CMPT 410)
Data Mining (often as CMPT 459 or 741)
Computational Linguistics (CMPT 413)
Computational Vision (CMPT 412)
Deep Learning (CMPT 420)
Visualization (CMPT 767, IAT 355)
Hands-on stats (maybe STAT 285 or PSYC 210 or SA 257)