All of the TAs are previous-cohort Big Data students.
Greg's orientation summary:
It's “Programming for Big Data I”. So it's about:
A quote you have probably seen before:
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…Dan Ariely
It's a buzzword. But a useful one.
Answer 1: Big enough that traditional techniques can't handle it.
The “traditional techniques” strawman is usually (1) highly structured data (2) in a relational database (3) on a single database server.
PostgreSQL can handle single tables up to 32 TB * (but maybe not as quickly as you'd like). Big data must be bigger than that?
Answer 2: Big enough that one computer can't store/process it.
“One computer” can mean dozens of cores, TBs of RAM and many TB of storage. So bigger than that?
Honestly, the term big data is often used to mean “modern data processing, 2013–2018 edition”.
Even if most people don't work with truly-big data most of the time, it's nice to have the tools to do it when necessary.
Sometimes it's nice to know your computation can scale if a megabyte of data becomes many gigabytes.
can't be processed on one computer should be
can't be processed in a time I'm willing to wait on one computer.
overnight calculation that doesn't complete until noon isn't very useful.
Greg's functional definition: people say they're doing
Big Data when they have so much data, it's annoying.
If our data is going to be too big for one computer, we presumably need many. Each one can store some of the data and do some of the processing and they can work together to generate “final” results.
This is a cluster.
Actually managing work on a cluster sucks. You have all of the problems from an OS course (concurrency, interprocess communication, scheduling, …) except magnified by being a distributed system (some computers fail, network latency, …).
Do you want to worry about all that? Me neither. Obvious solution: let somebody else do it.
We will be using (mostly) the Apache Hadoop ecosystem for storing and processing data on a cluster. This includes:
We have a cluster for use in this course: 6 nodes, each 16 cores and 110 GB memory. We will explore in the assignments.
Not a big cluster, but big enough we can exercise the Hadoop tools, and do calculations that aren't realistic for one computer.
In many ways, a two or three node cluster is enough to learn with: you have many of the same issues as 1000 nodes.
Lecture: Mondays 12:30–14:20.
Labs in ASB 10928:
One instructor and one TA will be in each lab.
There are computers in the lab room, or whatever laptop you have should be workable. Go to the lab section you're registered for.
The current plan for the assignments is a good outline:
express computation on a cluster tools we'll see: MapReduce, Spark RDDs, Spark DataFrames.
There is a progression from
lower-level and more explicit to
higher level and more declarative.
MapReduce will use Java: we will use it relatively little. Suggestion: don't bother setting up an IDE and go command-line-only for the few times you need it.
We'll use Spark with Python. Most of your programming in this course will be in Python.