CMPT 732, Fall 2023
Central/master/core stuff:
These can be on one computer or separated. Separating (and hardware choices) will depend on number of nodes in cluster.
Important: NameNode is a single point of failure, unless configured with high availablility.
Worker nodes:
As an example, for our cluster:
gateway
: you login and start jobs. No Hadoop infrastructure running here.master
: YARN ResourceManager, HDFS NameNode, …./etc/hadoop/conf/
, /home/envmodules/lib/spark-*/conf/
.Are they perfect examples of Hadoop configuration? Probably not, but they work.
In theory: any collection of computers.
In practice: there's no point having slow processors, minimal memory, or a slow network.
In reality: probably a cloud provider like Amazon Web Services (EMR or EC2) or Google Cloud (Hadoop on GCP) or Azure (Azure HDInsight).
Depends on the tasks you're doing, obviously.
Lesson from assignment 10: it's easy to add worker nodes (or replace with faster) if necessary.
In general: bigger nodes are better, until cost becomes prohibitive.
Fewer large nodes means less network usage, better able to run less-parallel tasks, more threads per process (sharing working sets).
e.g. on our cluster, if you want to start workers that need 100GB of memory and 17 cores, you're out of luck.
But note: EC2 instance prices are almost linear in compute power: a c5.18xlarge has 18 times the CPU and memory of a c5.xlarge, and costs almost exactly 18 times as much.
Maybe the easiest thing is one huge node, with Spark running locally on it, and using S3 for input and output.
Or have a small number of permanent nodes, but add spot instances when they are cheap.
There's a lot of stuff to install: why not let somebody else worry about some of the details?