CMPT 732, Fall 2021
We have had a decent look at HDFS, MapReduce, Cassandra, and Spark tools.
That's a drop in the bucket. See:
We can't hope to really see a significant fraction of these in a semester (or in a degree, or even in a career). What follows is a summary of some more.
Proposed take-away messages:
Various technologies, organized by category, with some explanation of what they are for. Trying to balance a reasonable number of technologies with completeness.
The goal: get the compute work to some computer with processor/memory to do it, and get the results back.
Or any other way you can get compute work done…
The goal: Describe your computation in a way that can be run on a cluster.
The goal: Take lots of data from many sources, and do reporting/analysis on it.
The goal: Store files or file-like things in a distributed way.
The goal: store records and access/update them quickly. I don't need SQL/relations.
The goal: store records and access/update them quickly. I want SQL and/or relations.
The goal: data ↔ bits, for memory/disk/network.
The goal: deal with a continuously-arriving stream of data.
The goal: use machine learning algorithms (at scale) without having to implement them.
The goal: take the data you worked so hard to get, and let people understand it and interact with it.
The goal: Extract data from the source(s); transform it into the format we want; load it into the database/data warehouse.
The goal: pass messages between nodes/processes and have somebody else worry about reliability, queues, who will send/receive, etc.
All designed to scale out and handle high volume.
Or other interactions with the queues. Freely switch languages between publisher/consumer too.
These things are fast: RabbitMQ Hits One Million Messages Per Second.
Realistic streaming scenario: Spark streaming takes in the data stream, filters/processes minimally, and puts each record into a queue for more processing. Then many nodes subscribe to the queue and handle the data out of it.
Or without Hadoop, just generate a bunch of work that needs to be done, queue it all, then start consumer processes on each computer you have.
Either way: you can move around the bottleneck (and hopefully then fix it).
Message passing example with RabbitMQ:
Let's try it…
window1> python3 rabbit-receiver.py window2> python3 rabbit-receiver.py window3> python3 rabbit-source.py window4> ruby rabbit-source.rb # kill/restart some and see what happens
Message passing example with Kafka:
Let's try it…
window1> python3 kafka-consumer.py window2> python3 kafka-consumer.py window3> python3 kafka-producer.py
The goal: get some work on a distributed queue. Maybe wait for results, or maybe don't.
With a task queue, you get to just call a function (maybe with slightly different syntax). You can then retrieve the result (or just move on an let the work happen later).
Where the work happened is transparent.
A task with Celery: tasks.py.
Let's try it…
window1> celery -A tasks worker --loglevel=info --hostname=worker1@%h window2> celery -A tasks worker --loglevel=info --hostname=worker2@%h window3> ipython3
from tasks import add result = add.delay(4, 4) result.get(timeout=1)
Need a lot of work done without Hadoop? Run task queue workers on many nodes; make all the asynchronous calls you want; let the workers handle it.
Need nightly batch analysis done? Have a scheduled task start a Spark task.
Have a spike in usage? Let tasks queue up and process as possible. Or add more workers.
The goal: index lots of data so you (or your users) can search for records they want.
All of these are designed to scale out across many nodes.
Indexing and searching with Elasticsearch:
python3 elastic-index.py python3 elastic-search.py curl -XGET 'http://localhost:9200/comments/_search?q=comment' \ | python3 -m json.tool