The term data science
includes a lot of stuff.
We have covered a reasonable sampling of it this semester, but what have we missed?
A notable data science technology we ignored all semester: the R language.
An open source language/toolchain designed for statistical analysis. It's the other common choice for data science work (along with Python).
Short summary: it's probably worse as a programming language, but better as a stats tool.
We also didn't use Scala, which is a shame because it's the implementation language for Spark.
Or visualization tools like Tableau, which is a shame because it's a good tool and they hire a lot of people in the Vancouver office.
We skipped talking about databases because there's a separate course for that (which you should take), but they're critical data science tools.
Databases allow us to access particular (subsets of) records very quickly and easily. The data-in-files approach we have been using doesn't: it's usually all (most) or nothing.
Relational databases (and SQL) are the most common database tools. Both Pandas and Spark can read/write SQL data (or just… y'know… programming).
As mentioned earlier: relational databases and SQL have been the first-choice data storage tool for decades. That isn't by accident: they're good at what they do.
There are also many modern non-relational databases: NoSQL databases. These tend to be designed to scale across many servers, so can handle larger volume and velocity.
They usually don't allow joins (i.e. don't deal with relations) since they are (in general) hard to scale.
In a data science context, it's probably reasonable to think of these as big data
tools.
We had limited time in the topics we covered. Each one of them could have been (at least) a course all by itself.
Hopefully this course gave enough introduction that you know the basics and can learn more as necessary.
If the semester was a little longer, I would have wanted to talk about:
Some things were hard to fit into a course like this for practical reasons. Courses are always artificial…
Understanding the problem: If actually doing data science
work, you'll find yourself being asked questions about whatever field you're in, which probably isn't computer science
. Actually understanding the field is going to be a big part of any data analysis job.
Weather, colours, Reddit comments, and GPS positions were just convenient topics that you could understand without spending time on non-data related stuff.
Understanding the question: Questions aren't usually handed to you as a well-defined assignment question. Much more likely is a query like we want to know if there's anything interesting in the data we have about X.
Figuring out what they mean by interesting
is not a skill to be ignored.
Again, it would have been off-topic to do this in exercises, but in the project, you need to understand and/or create the question.
Getting data: using APIs to fetch data comes up a lot. So does cleaning up arbitrarily-weird data that you get.
We used data from an API (e.g. the Twitter data) but you were never actually required to collect it yourself. Using an API has logistical problems: registering to get API keys, having \(n\) students all hitting an API when the exercise is due, etc.
Data cleaning: All cleaning tasks were all too well-defined to be realistic.
The project should open this door, at least a little.
Cloud computing resources: Having a Spark cluster sitting around is unusual, and maybe doing all Pandas/stats/ML analysis on your laptop isn't realistic either.
It's common to move to do work on a cloud computing platform: AWS, Google Cloud, Azure, Databricks.
Again, logistics: expecting everybody to set up an AWS account is difficult.
Communication: Communicating your results to people is also critical. As I have said before: results don't do any good if nobody knows about them.
I'm hoping your W courses point you in the right direction.
Some other courses that might be good to take if you're interested in data science…