Programming for Big Data I

CMPT 732, Fall 2024

Greg Baker

https://coursys.sfu.ca/2024fa-cmpt-732-g1/pages/
https://ggbaker.ca/732/

Us

Instructor:

  • Greg Baker

TAs, all previous-cohort Big Data students:

  • Fayad Chowdhury
  • Gitanshu
  • Monica Hao
  • Sai Manthena

Us

Whoever you are, I'm glad you're here.

Progress Pride flag Every Child Matters logo Black Lives Matter logo

This Course

It's “Programming for Big Data I”. So it's about:

  1. programming,
  2. big data.

What is Big Data?

A quote you have probably seen before:

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… Dan Ariely

It's a buzzword. But a useful one.

How big is “Big Data”?

Answer 1: Big enough that traditional techniques can't handle it.

The “traditional techniques” strawman is usually (1) highly structured data (2) in a relational database (3) on a single database server.

PostgreSQL can handle single tables up to 32 TB * (but maybe not as quickly as you'd like). Big data must be bigger than that?

How big is “Big Data”?

Answer 2: Big enough that one computer can't store/​process it.

“One computer” can mean dozens of cores, TBs of RAM and many TB of storage. So bigger than that?

“Big data” isn't always big.

Many describe with “The Four V's” (or 5 Vs' or 7 V's…).

  • Volume: the amount of data.
  • Velocity: the data arrives quickly and constantly.
  • Variety: many differently-structured (or less-structured) input data sets.
  • Veracity: some data might be incorrect, or of unknown correctness.

Honestly, the term big data is often used to mean “modern data processing, after about 2013”.

“Big data” isn't always big.

Even if most people don't work with truly-big data most of the time, it's nice to have the tools to do it when necessary.

Sometimes it's nice to know your computation can scale if a megabyte of data becomes many gigabytes.

“Big data” isn't always big.

Or maybe can't be processed on one computer should be can't be processed in a time I'm willing to wait on one computer.

An overnight calculation that doesn't complete until noon isn't very useful.

“Big data” isn't always big.

Greg's functional definition: people say they're doing Big Data when they have so much data, it's annoying.

What am I going to do with all this data?

Clusters

If our data is going to be too big for one computer, we presumably need many. Each one can store some of the data and do some of the processing and they can work together to generate “final” results.

This is a cluster.

Clusters

Actually managing work on a cluster sucks. You have all of the problems from an OS course (concurrency, interprocess communication, scheduling, …) except magnified by being a distributed system (some computers fail, network latency, …).

The MPI tools are often used to help with this, but are still very manual.

Do you want to worry about all that? Me neither. Obvious solution: let somebody else do it.

Hadoop

We will be using (mostly) the Apache Hadoop ecosystem for storing and processing data on a cluster. This includes:

  • YARN: managing jobs in the cluster.
  • HDFS: Hadoop Distributed File System, for storing data on the cluster's nodes.
  • MapReduce: system to do parallel computation on YARN.
  • Spark: another system to do computation on YARN (or elsewhere).

Our Environment

We have a cluster for use in this course: 4 nodes, each 16 cores and 32 GB memory. We will explore in the assignments.

Not a big cluster, but big enough we can exercise the Hadoop tools, and do calculations that aren't realistic for one computer.

In many ways, a two or three node cluster is enough to learn with: you have many of the same issues as 1000 nodes.

Things you will do

  • Lectures (2 hours/week).
  • Labs (4 hours/week).
  • Assignments: complete on your own, with help in the labs/consultation times. 10 × 7%.
  • Quizzes: during the lecture time. 3 × 3%.
  • Project: a more open-ended big data analysis. 21%.

Lecture and Labs

Lecture: Mondays 10:30–12:20 in AQ3003. (Audio-only recording.)

Labs in Southeast Classroom Block 1010:

  • Tues/Thurs 12:30–14:20.
  • Tues/Thurs 14:30–16:20.

Greg and one TA will be in each lab (or perhaps occasionally two TAs).

Lecture and Labs

Go to the lab section you're registered for.

There are computers in the lab room, or whatever laptop you have should be workable. The compute cluster can be accessed remotely any time.

Assignments

The assignments will be most of your time in the course.

Work on the assignments as you like, and come into the lab during your lab time to work and talk to us talk to us during the lab/​consultation times about the assignments/​course.

Quizzes

There will be three quizzes, tentatively:

  • Oct 14 Oct 21 (lecture time, week 7 8)
  • Nov 4 (lecture time, week 10)
  • Nov 25 (lecture time, week 13)

They aren't worth a lot of marks, and their length will reflect that. (Tentatively, 30 minutes at the end of the lecture time.)

Expectations

  • The assignments and project are your chance to learn things, not a thing you have to do for marks.
  • You are expected to be in the labs during your lab times, and working on the assignments.
  • Everybody be nice to each other. [Code Of Conduct]

Expectations

You are expected to do the work in this course yourself (or as a group for the project). Whenever you submit any work at the University, you're implicitly certifying this is my own independent work.

You cannot copy your solution from somewhere else, even if you understand it, even if you change to try to make it unrecognizable. If you work with another student, we shouldn't be able to tell from the results.

Expectations

I will actively look for academic dishonesty in this course, and deal with it according to University policy when found.

Course Topics

The current plan for the assignments is a good outline:

  1. YARN, HDFS, MapReduce
  2. MapReduce vs Spark
  3. Spark RDDs
  4. RDDs vs DataFrames
  5. Cloud Deployments
  6. Spark DataFrames
  7. DataFrames and ML
  8. NoSQL and Cassandra
  9. Spark + Cassandra
  10. Spark Streaming

Course Topics

The express computation on a cluster tools we'll see: MapReduce, Spark RDDs, Spark DataFrames.

There is a progression from lower-level and more explicit to higher level and more declarative.

Course Topics

Programming languages:

MapReduce will use Java: we will use it relatively little. Suggestion: don't bother setting up an IDE and go command-line-only for the few times you need it.

We'll use Spark with Python. Most of your programming in this course will be in Python.