Returning to the big question for the course: You get some data. Then what do you do?
The answer is long, but we can get started…
There is a common collection of steps that are necessary.
People starting data science usually imagine spending their time in analyze the data.
The reality: it's not unusual to spend 90% of the time and effort on the other steps, especially cleaning.
Remember this?
Refining and understanding the question often falls into the substantive expertise
category: you have to know the problem domain to understand what question is being asked, and how it must be answered.
e.g. The seemingly well-formed question how many of our students have a GPA below 2.4?
our students? CMPT majors? Any CMPT program? Anyone taking many CMPT courses? Anyone who sees FAS advisors?
withdrawnor
graduated?
Practical reality: we won't be spending much time on interpreting the question.
I'll try to stick with problem domains you understand, or are easy to figure out.
Sometimes you'll be handed the perfect data to answer your question.
Sometimes you'll have to find it.
Sometimes you'll have to collect it.
Sometimes you'll have to give the best answer you can with imperfect data.
There are many ways you might get the data you need into your program.
More later.
Just because you have data
doesn't mean it's the data you need. It's very common to have data arrive with…
SFU==
Simon Fraser University==
Simon Fraser?).
It's also possible to have data
that isn't as correct as you might like. You can have…
Getting the data you found into the form you need can be a huge amount of work.
We will talk about some common techniques, but it's often just a lot of work + programming + manual repair.
This is the part you thought you'd spend a lot of time doing.
The question here: how do you take the data, and get out the answers you want?
There are many ways to analyze data to get answers.
We will focus on inferential statistics and machine learning.
That leaves out some important techniques, but it's a good place to start.
Analysis results aren't much use if you don't know what they mean or tell anybody about them. It's not always easy to (correctly) interpret results, or explain them to others.
This is another question best addressed by another course. We will talk a little about it.
The full data analysis is often broken up into many steps.
These could be separate functions in a single program that does the whole thing.
This is often impractical: calls remote API every time you test it; test runs might be time-prohibitive; intermediate results might be meaningful to examine.
Maybe separate programs for each step (or collection of steps)?
Re-running takes more work. Looks a little sloppy.
A well-structured pipeline could be as simple as organizing your code well.
01-scrape-data.py 02-clean-missing-fields.py 03-generate-features.py 04-train-model.py 05-score-model.py 06-generate-plots.py
It's sometimes preferable to have a very interactive way to experiment with your data. Usually called a notebook.
Ones you may have used/heard of: Matlab, Maple, R-Studio, most stats programs.
With Python (and other languages), Jupyter (formerly iPython Notebook
) is very useful.
For this course, I will ask for some combination of these. Always using the same one would be dishonest.
There are perfectly good reasons to have a manual step in the pipeline
.
Some analysis may be best performed by a person and manually entered before running step \(n+1\).
Maybe a step would take 10 minutes to do by hand, or hours to automate.
Maybe it's easy to automatically calculate in 90% of the cases, leaving 10% don't know
for manual intervention.
In this course, we'll automate. In reality, it's not always obvious what's easiest.
The data pipeline
isn't always obvious: it's easy to end up with several somewhat-related programs that need to be run a certain way to get answers.
My suggestion: remember that you're creating some kind of workflow for your data. Organize your code so you can easily document/remember how to get things done.
Don't forget the document
part.