# Other Big Data Tools

CMPT 732, Fall 2021

## A Look Back

We have had a decent look at HDFS, MapReduce, Cassandra, and Spark tools.

That's a drop in the bucket. See:

## What Else Is There?

We can't hope to really see a significant fraction of these in a semester (or in a degree, or even in a career). What follows is a summary of some more.

Proposed take-away messages:

• There are tools for big data outside the Hadoop universe.
• The big data world is evolving extremely fast.
• There are many things out there, and you should know them enough to think “I'm pretty sure I heard about a tool for that already…”.

## The Plan…

Various technologies, organized by category, with some explanation of what they are for. Trying to balance a reasonable number of technologies with completeness.

# Part 1: We Have Seen An Example

## Doing Computation

The goal: get the compute work to some computer with processor/​memory to do it, and get the results back.

• Apache YARN: Hadoop's resource manager.
• Amazon EMR: EC2 + Hadoop + automatic setup.
• Mesos: resource manager more closely aligned with Spark.

## Doing Computation

Or any other way you can get compute work done…

## Expressing Computation

The goal: Describe your computation in a way that can be run on a cluster.

## Expressing Computation

• Hive: take varied data and do SQL-like queries.
• Pig: high-level language to analyze data. Produces MapReduce jobs.
• Programming. Distributed systems.

## Data Warehousing

The goal: Take lots of data from many sources, and do reporting/​analysis on it.

## Storing Files

The goal: Store files or file-like things in a distributed way.

## Databases

The goal: store records and access/update them quickly. I don't need SQL/​relations.

## Databases

The goal: store records and access/update them quickly. I want SQL and/or relations.

## Serialization/Storage

The goal: data ↔ bits, for memory/​disk/​network.

• Parquet: efficient columnar storage representation. Supported by Spark, Pandas, Impala.
• HDF5: on-disk storage for columnar data.
• CSV, JSON: well-understood interchange formats.
• Arrow: in-memory representation for fast processing. Available in Spark 2.3+.
• Delta Lake.

## Streaming

The goal: deal with a continuously-arriving stream of data.

## ML Libraries

The goal: use machine learning algorithms (at scale) without having to implement them.

Or at smaller scale, scikit-learn, PyTorch, etc.

# Part 2: New (to us) Categories

## Visualization

The goal: take the data you worked so hard to get, and let people understand it and interact with it.

The goal: Extract data from the source(s); transform it into the format we want; load it into the database/​data warehouse.

## Message Queues

The goal: pass messages between nodes/​processes and have somebody else worry about reliability, queues, who will send/receive, etc.

All designed to scale out and handle high volume.

## Message Queues

The idea:

• Some nodes publish messages into a queue.
• The message queue makes sure that they are queued until they can be processed; ensures each message is processed once (depending on the tool).
• Some nodes subscribe to the queue(s) and consume messages.

Or other interactions with the queues. Freely switch languages between publisher/consumer too.

## Message Queues

These things are fast: RabbitMQ Hits One Million Messages Per Second.

## Message Queues

Realistic streaming scenario: Spark streaming takes in the data stream, filters/processes minimally, and puts each record into a queue for more processing. Then many nodes subscribe to the queue and handle the data out of it.

Or without Hadoop, just generate a bunch of work that needs to be done, queue it all, then start consumer processes on each computer you have.

Either way: you can move around the bottleneck (and hopefully then fix it).

## Message Queues

Message passing example with RabbitMQ:

Let's try it…


window3> python3 rabbit-source.py
window4> ruby rabbit-source.rb
# kill/restart some and see what happens


## Message Queues

Message passing example with Kafka:

Let's try it…


window1> python3 kafka-consumer.py
window2> python3 kafka-consumer.py
window3> python3 kafka-producer.py


The goal: get some work on a distributed queue. Maybe wait for results, or maybe don't.

With a task queue, you get to just call a function (maybe with slightly different syntax). You can then retrieve the result (or just move on an let the work happen later).

Where the work happened is transparent.

Let's try it…


window1> celery -A tasks worker --loglevel=info --hostname=worker1@%h
window2> celery -A tasks worker --loglevel=info --hostname=worker2@%h
window3> ipython3


result.get(timeout=1)


Need a lot of work done without Hadoop? Run task queue workers on many nodes; make all the asynchronous calls you want; let the workers handle it.

Need nightly batch analysis done? Have a scheduled task start a Spark task.

Have a spike in usage? Let tasks queue up and process as possible. Or add more workers.

## Text Search

Indexing and searching with Elasticsearch: