Computational Data Science

CMPT 353, Fall 2024

Greg Baker

https://coursys.sfu.ca/2024fa-cmpt-353-d1/pages/

This Course

It's Computational Data Science. We'll come back to what that is.

Course web site: in CourSys, https://coursys.sfu.ca/2024fa-cmpt-353-d1/pages/ .

This Course

Whoever you are, I'm glad you're here.

Progress Pride flag Every Child Matters logo Black Lives Matter logo

Offering Strategy

My plan is to keep the parts of this course that worked best in-person pre-pandemic, and the parts that worked best online. Basically:

  • Lectures by video.
  • Hands-on help (office hours, etc) in-person.
  • Quizzes/Exams in-person.

Offering Strategy

Lectures will be pre-recorded.

In the lecture time, they will be available as a YouTube Premier (≈ watch party) in ≈50 minute chunks. Greg will be available in YouTube chat to answer questions during that time.

They will be available as regular YouTube videos for viewing later.

Grades

  • Weekly exercises: 12 × 3.5% = 42%
  • Project: 26%
  • Quizzes: 2 × 10% = 20%
  • Final quiz: 12%

Late penalty: 20%/day.

Exercises

Due Fridays. My goal: make sure you actually try out the things we have talked about and see the reality of applying them.

Will contain some short problems to get you used to the tools, expanding to something a more interesting real problem.

Late penalty: 20%/day.

Project

In the lectures/​exercises, I intend to explore what I consider the core of data science.

The project will let you integrate those techniques, and explore ideas on the edges of that, depending what interests you.

Project

I will post project topic ideas that are intended as starting points for your thinking about a project (not as ready-to-go project topics).

We can discuss project ideas in the lab or discussion forum.

Project

A few details:

  • Groups of 2–3.
  • Take the problem. Use the techniques from the course, and explore others to sensibly attack the problem.
  • In a report, summarize your methods, findings, and what worked/​didn't.

Quizzes/Exam

Quizzes: 10% each, in-person. Dates may change if necessary, but planned during lecture times:

  • Oct 16 (Wednesday of week 7),
  • Nov 13 (Wednesday of week 11).

Final Quiz: whenever our final exam is scheduled.

All closed book, on paper. Missing/​excusing exams requires medical documentation.

Us

Instructor: Greg Baker <[email protected]>.

Office hour: Tuesdays and Thursdays 11:00–12:00 in CSIL (ASB 9838 9804).

Us

TAs: to be announced.

Office hours: details later.

Us

Greg's honest order of priority when dealing with student queries:

  1. In-person questions in CSIL, etc.
  2. Public questions in the discussion forum.
  3. Anything else that helps >1 student.
  4. Private questions in the discussion forum.
  5. Email.

Lectures and Labs

Mondays: lectures premier as scheduled, live chat. Regular YouTube videos after that.

Wednesdays: usually no lectures. The TAs and I will all be available for consultation during the lecture time in CSIL (ASB 9838 9804+9840).

References

Textbooks: none.

Possible reference material:

References

Possible reference material (continued):

Programming

Python 3 will be the primary programming language language used in the course. If you aren't comfortable with it, you need to be (very) soon.

StackExchange Data Science tags (as of April 2023):

LanguageTagged Qs
Python6613
R1478
Matlab156
Java58
Scala48

Programming

This will be a programming-heavy course. If you don't really like programming, this might not be the course for you.

The programming style will be very library-heavy, which is realistic in the modern world. We will use many libraries: NumPy, Pandas, matplotlib, scikit-learn, statsmodels, ….

Programming

That means you'll spend a lot of time reading the docs and fighting to make the tools do what you want them to, and less implementing the logic yourself. That's also realistic.

The code you would have written would almost certainly have been slower and worse.

Programming

I will feel free to increase the amount of assignment work a little from my usual level because of the missing hour of lecture.

Expectations

To get credit for this course, I expect you to demonstrate that you know how to use programming techniques to manipulate and analyse data. That means:

  • A pass on the weighted average of the stuff where you demonstrate programming ability: exercises + project.
  • A pass on the weighted average of the quizzes.

Failure to do these may result in failing the course.

Expectations

That rule isn't intended to fail someone just because they get 49% on the exams: it will be applied on an individual basis with a judgement call on the question has this student has demonstrated that they understand the basic concepts of the course?

Expectations

Academic Honesty: it's important, as always.

If you're using an online source, leave a comment.

def this_function(p1, p2):
    # adapted from http://stackoverflow.com/a/21623206/1236542
    ...

That's all I ask, but remember to do it.

Expectations

You are expected to do the work in this course yourself (or as a group for the project). Whenever you submit any work at the University, you're implicitly certifying this is my own independent work.

Expectations

Examples of things that are not okay and will be treated as academic dishonesty:

  • Using a tool to create some code, cleaning it up a little, and submitting it.
  • Finding a solution (online or from your friend), looking at it until you really, really understand it, changing enough you think I won't notice the similarity, and submitting it.
  • Sitting beside your friend and creating a single solution together, even though you're touching different keyboards.

Expectations

The quizzes are regular tests: individual work, closed book.

I will be asking for a grade of FD in the course for any academic dishonesty on quizzes.

Computational Data Science?

Computational Data Science: data science, but with computation as the focus.

But what is data science?

Data Science?

According to Wikipedia: an interdisciplinary academic field that uses… [various disciplines] to extract or extrapolate knowledge and insights from… data.

According to Pat Hanrahan, Tableau Software: [The combination of] business knowledge, analytical skills, and computer science.

According to Daniel Tunkelang, LinkedIn: [The ability to] obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning.

Data Science?

According to Joel Grus: There's a joke that says a data scientist is someone who knows more statistics than a computer scientist and more computer science than a statistician.… We'll says that a data scientist is someone who extracts insights from messy data.

Data Science?

According to Drew Conway, Alluvium:

Data Science?

My definitions:

Data Science
You get some data. Then what do you do to get answers from it? Whatever that is, that's data science.
Computational Data Science
You get some data. You know how to program. Then what do you do?

Why Data Science?

Why is data science so popular? Is it new?

There's more data being collected: web access logs, purchase history, click-through rates, location history, sensor data, ….

Sometimes the volume of data is big: too big to manage easily. That's where big data starts.

Why Data Science?

People want answers/​insights from that data: Is the marketing campaign working? Is the UI actually usable? What if we did X instead of Y?

New techniques: Machine learning lets us attack questions that were previously unanswerable. Computer scientists are realizing that statistics is important; statisticians are realizing that computer science is important.

Topics (1)

  • Data science: what is it? How does data become useful?
  • Data processing tools: Python + NumPy + Pandas; analysis tools in Python.
  • Data aquisition. Or where do we find data?
  • Getting data into shape: cleaning; extract/​transform/​load.

Topics (2)

  • Making sense of data: statistics. Or it turns out that stats course was useful.
  • Making sense of data: machine learning. Or it's like AI, except it works.
  • Data analysis strategies.

Topics (3)

  • Big data tools: Apache Spark and a compute cluster.
  • Data visualization and communicating results.