https://coursys.sfu.ca/2024fa-cmpt-732-g1/pages/
https://ggbaker.ca/732/
Instructor:
TAs, all previous-cohort Big Data students:
Whoever you are, I'm glad you're here.
It's “Programming for Big Data I”. So it's about:
A quote you have probably seen before:
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…Dan Ariely
It's a buzzword. But a useful one.
Answer 1: Big enough that traditional techniques can't handle it.
The “traditional techniques” strawman is usually (1) highly structured data (2) in a relational database (3) on a single database server.
PostgreSQL can handle single tables up to 32 TB * (but maybe not as quickly as you'd like). Big data must be bigger than that?
Answer 2: Big enough that one computer can't store/process it.
“One computer” can mean dozens of cores, TBs of RAM and many TB of storage. So bigger than that?
Many describe with “The Four V's” (or 5 Vs' or 7 V's…).
Honestly, the term big data is often used to mean “modern data processing, after about 2013”.
Even if most people don't work with truly-big data most of the time, it's nice to have the tools to do it when necessary.
Sometimes it's nice to know your computation can scale if a megabyte of data becomes many gigabytes.
Or maybe can't be processed on one computer
should be can't be processed in a time I'm willing to wait on one computer
.
An overnight
calculation that doesn't complete until noon isn't very useful.
Greg's functional definition: people say they're doing Big Data
when they have so much data, it's annoying.
If our data is going to be too big for one computer, we presumably need many. Each one can store some of the data and do some of the processing and they can work together to generate “final” results.
This is a cluster.
Actually managing work on a cluster sucks. You have all of the problems from an OS course (concurrency, interprocess communication, scheduling, …) except magnified by being a distributed system (some computers fail, network latency, …).
The MPI tools are often used to help with this, but are still very manual.
Do you want to worry about all that? Me neither. Obvious solution: let somebody else do it.
We will be using (mostly) the Apache Hadoop ecosystem for storing and processing data on a cluster. This includes:
We have a cluster for use in this course: 4 nodes, each 16 cores and 32 GB memory. We will explore in the assignments.
Not a big cluster, but big enough we can exercise the Hadoop tools, and do calculations that aren't realistic for one computer.
In many ways, a two or three node cluster is enough to learn with: you have many of the same issues as 1000 nodes.
Lecture: Mondays 10:30–12:20 in AQ3003. (Audio-only recording.)
Labs in Southeast Classroom Block 1010:
Greg and one TA will be in each lab (or perhaps occasionally two TAs).
Go to the lab section you're registered for.
There are computers in the lab room, or whatever laptop you have should be workable. The compute cluster can be accessed remotely any time.
The assignments will be most of your time in the course.
Work on the assignments as you like, and come into the lab during your lab time to work and talk to us talk to us during the lab/consultation times about the assignments/course.
There will be three quizzes, tentatively:
They aren't worth a lot of marks, and their length will reflect that. (Tentatively, 30 minutes at the end of the lecture time.)
You are expected to do the work in this course yourself (or as a group for the project). Whenever you submit any work at the University, you're implicitly certifying this is my own independent work
.
You cannot copy your solution from somewhere else, even if you understand it, even if you change to try to make it unrecognizable. If you work with another student, we shouldn't be able to tell from the results.
I will actively look for academic dishonesty in this course, and deal with it according to University policy when found.
The current plan for the assignments is a good outline:
The express computation on a cluster
tools we'll see: MapReduce, Spark RDDs, Spark DataFrames.
There is a progression from lower-level and more explicit
to higher level and more declarative
.
Programming languages:
MapReduce will use Java: we will use it relatively little. Suggestion: don't bother setting up an IDE and go command-line-only for the few times you need it.
We'll use Spark with Python. Most of your programming in this course will be in Python.