Other Big Data Tools

CMPT 732, Fall 2021

A Look Back

We have had a decent look at HDFS, MapReduce, Cassandra, and Spark tools.

That's a drop in the bucket. See:

What Else Is There?

We can't hope to really see a significant fraction of these in a semester (or in a degree, or even in a career). What follows is a summary of some more.

Proposed take-away messages:

  • There are tools for big data outside the Hadoop universe.
  • The big data world is evolving extremely fast.
  • There are many things out there, and you should know them enough to think “I'm pretty sure I heard about a tool for that already…”.

The Plan…

Various technologies, organized by category, with some explanation of what they are for. Trying to balance a reasonable number of technologies with completeness.

Part 1: We Have Seen An Example

Doing Computation

The goal: get the compute work to some computer with processor/​memory to do it, and get the results back.

  • Apache YARN: Hadoop's resource manager.
  • Amazon EMR: EC2 + Hadoop + automatic setup.
  • Mesos: resource manager more closely aligned with Spark.

Doing Computation

Or any other way you can get compute work done…

Expressing Computation

The goal: Describe your computation in a way that can be run on a cluster.

Expressing Computation

  • Hive: take varied data and do SQL-like queries.
  • Pig: high-level language to analyze data. Produces MapReduce jobs.
  • Programming. Distributed systems.

Data Warehousing

The goal: Take lots of data from many sources, and do reporting/​analysis on it.

Storing Files

The goal: Store files or file-like things in a distributed way.

Databases

The goal: store records and access/update them quickly. I don't need SQL/​relations.

Databases

The goal: store records and access/update them quickly. I want SQL and/or relations.

Serialization/Storage

The goal: data ↔ bits, for memory/​disk/​network.

  • Parquet: efficient columnar storage representation. Supported by Spark, Pandas, Impala.
  • HDF5: on-disk storage for columnar data.
  • CSV, JSON: well-understood interchange formats.
  • Arrow: in-memory representation for fast processing. Available in Spark 2.3+.
  • Delta Lake.

Streaming

The goal: deal with a continuously-arriving stream of data.

ML Libraries

The goal: use machine learning algorithms (at scale) without having to implement them.

Or at smaller scale, scikit-learn, PyTorch, etc.

Part 2: New (to us) Categories

Visualization

The goal: take the data you worked so hard to get, and let people understand it and interact with it.

Extract-Transform-Load

The goal: Extract data from the source(s); transform it into the format we want; load it into the database/​data warehouse.

Message Queues

The goal: pass messages between nodes/​processes and have somebody else worry about reliability, queues, who will send/receive, etc.

All designed to scale out and handle high volume.

Message Queues

The idea:

  • Some nodes publish messages into a queue.
  • The message queue makes sure that they are queued until they can be processed; ensures each message is processed once (depending on the tool).
  • Some nodes subscribe to the queue(s) and consume messages.

Or other interactions with the queues. Freely switch languages between publisher/consumer too.

Message Queues

These things are fast: RabbitMQ Hits One Million Messages Per Second.

Message Queues

Realistic streaming scenario: Spark streaming takes in the data stream, filters/processes minimally, and puts each record into a queue for more processing. Then many nodes subscribe to the queue and handle the data out of it.

Or without Hadoop, just generate a bunch of work that needs to be done, queue it all, then start consumer processes on each computer you have.

Either way: you can move around the bottleneck (and hopefully then fix it).

Message Queues

Message passing example with RabbitMQ:

Let's try it…


window1> python3 rabbit-receiver.py
window2> python3 rabbit-receiver.py
window3> python3 rabbit-source.py
window4> ruby rabbit-source.rb
# kill/restart some and see what happens

Message Queues

Message passing example with Kafka:

Let's try it…


window1> python3 kafka-consumer.py
window2> python3 kafka-consumer.py
window3> python3 kafka-producer.py

Task Queues

The goal: get some work on a distributed queue. Maybe wait for results, or maybe don't.

Task Queues

With a task queue, you get to just call a function (maybe with slightly different syntax). You can then retrieve the result (or just move on an let the work happen later).

Where the work happened is transparent.

Task Queues

A task with Celery: tasks.py.

Let's try it…


window1> celery -A tasks worker --loglevel=info --hostname=worker1@%h
window2> celery -A tasks worker --loglevel=info --hostname=worker2@%h
window3> ipython3

from tasks import add
result = add.delay(4, 4)
result.get(timeout=1)

Task Queues

Need a lot of work done without Hadoop? Run task queue workers on many nodes; make all the asynchronous calls you want; let the workers handle it.

Need nightly batch analysis done? Have a scheduled task start a Spark task.

Have a spike in usage? Let tasks queue up and process as possible. Or add more workers.

Text Search

Indexing and searching with Elasticsearch:

Let's try it… (See also CourSys search when an instructor.)


python3 elastic-index.py
python3 elastic-search.py
curl -XGET 'http://localhost:9200/comments/_search?q=comment' \
  | python3 -m json.tool