Other Big Data Tools

CMPT 732, Fall 2024

A Look Back

We have had a decent look at HDFS, MapReduce, Cassandra, and Spark tools.

That's a drop in the bucket. See:

What Else Is There?

We can't hope to really see a significant fraction of these in a semester (or in a degree, or even in a career). What follows is a summary of some more.

Proposed take-away messages:

There are tools for big data outside the Hadoop universe.
The big data world is evolving extremely fast.
There are many things out there, and you should know them enough to think “I'm pretty sure I heard about a tool for that already…”.

The Plan…

Various technologies, organized by category, with some explanation of what they are for. Trying to balance a reasonable number of technologies with completeness.

Part 1: We Have Seen An Example

Doing Computation

The goal: get the compute work to some computer with processor/memory to do it, and get the results back.

Apache YARN: Hadoop's resource manager.
Amazon EMR: EC2 + Hadoop + automatic setup.
Mesos: resource manager more closely aligned with Spark.
SLURM: resource manager designed for supercomputers.

Doing Computation

Or any other way you can get compute work done…

Container orchestration: Kubernetes, Docker Swarm, Nomad, Amazon ECS.
Functions as a Service: Amazon Lambda, Google Cloud Functions.
Infrastructure as a Service: Amazon EC2, Azure Virtual Machines.

Expressing Computation

The goal: Describe your computation in a way that can be run on a cluster.

MapReduce.
Spark.
Flink. Streaming first.
Beam.
Dask.
other DataFrame tools.

Expressing Computation

Hive: take varied data and do SQL-like queries.
Pig: high-level language to analyze data. Produces MapReduce jobs.
SQL + most data warehouse tools.
Programming. Distributed systems.

Data Warehouse/Lake

The goal: Take lots of data from many sources, and do reporting/analysis on it.

Spark DataFrames.
Hive: take varied data and do SQL-like queries.
Apache Impala: massively-parallel SQL queries on Hadoop, against varied inputs.
Apache Drill: take data from many sources (SQL, NoSQL, files, …) and query it.
Amazon RedShift, Spectrum, Athena; Google BigQuery; Azure Synapse Analytics, Azure Data Lake; Snowflake; DataBricks; Teradata.

Storing Files

The goal: Store files or file-like things in a distributed way.

HDFS.
Cloud providers' object stores: Amazon S3, Azure Blob Storage, OpenStack Swift.
Gluster: distributed network filesystem.
Ceph: distributed filesystems and object store.
Files. Filesystems. Disks. NAS.

Databases

The goal: store records and access/update them quickly. I don't need SQL/relations.

Cassandra: Good clustering. Secondary keys, but no joins.
HBase: Good clustering, fast. Otherwise very manual.
MongoDB. Clustered, but questionable reliability. Suggest not using for primary data storage. **
Cloud providers' NoSQL: AWS SimpleDB, AWS DynamoDB, Google Firestore, Google Bigtable, Azure Cosmos.

Databases

The goal: store records and access/update them quickly. I want SQL and/or relations.

Cloud providers' scalable relational databases: AWS Aurora, Google Cloud Spanner.
Other NewSQL databases: CockroachDB, Couchbase.
PostgreSQL, MySQL, etc.

Serialization/Storage

The goal: data ↔ bits, for memory/disk/network.

Parquet: efficient columnar storage representation. Supported by Spark, Pandas, Impala.
HDF5: on-disk storage for columnar data.
CSV, JSON: well-understood interchange formats.
Arrow: in-memory representation for fast processing.
Delta Lake.

Streaming

The goal: deal with a continuously-arriving stream of data.

Spark Streaming (DStreams for RDDs, Structured Streaming for DataFrames).
Apache Storm.
Apache Flume.
Amazon Kinesis.

ML Libraries

The goal: use machine learning algorithms (at scale) without having to implement them.

Or at smaller scale, scikit-learn, PyTorch, etc.

Part 2: New (to us) Categories

Visualization

The goal: take the data you worked so hard to get, and let people understand it and interact with it.

Tableau.
Qlik.
Power BI.
Programming and plotting/drawing libraries like matplotlib or plotly.
The Data Visualisation Catalogue.
The Python Graph Gallery.

Extract-Transform-Load

The goal: Extract data from the source(s); transform it into the format we want; load it into the database/data warehouse.

Apache Sqoop, Apache Airflow, Apache Flume.
NiFi.
Amazon Data Pipeline.
AWS Glue
MapReduce, Spark, programming.

Message Queues

The goal: pass messages between nodes/processes and have somebody else worry about reliability, queues, who will send/receive, etc.

Apache Kafka.
RabbitMQ.
ZeroMQ/ØMQ.
Cloud providers' message queues: Amazon SQS, Azure PubSub, Google Pub/Sub.

All designed to scale out and handle high volume.

Message Queues

The idea:

Some nodes publish messages into a queue.
The message queue makes sure that they are queued until they can be processed; ensures each message is processed once (depending on the tool).
Some nodes subscribe to the queue(s) and consume messages.

Or other interactions with the queues. Freely switch languages between publisher/consumer too.

Message Queues

These things are fast: RabbitMQ Hits One Million Messages Per Second.

Message Queues

Realistic streaming scenario: Spark streaming takes in the data stream, filters/processes minimally, and puts each record into a queue for more processing. Then many nodes subscribe to the queue and handle the data out of it.

Or without Hadoop, just generate a bunch of work that needs to be done, queue it all, then start consumer processes on each computer you have.

Either way: you can move around the bottleneck (and hopefully then fix it).

Message Queues

Message passing example with RabbitMQ:

Let's try it…


window1> python3 rabbit-receiver.py
window2> python3 rabbit-receiver.py
window3> python3 rabbit-source.py
window4> ruby rabbit-source.rb
# kill/restart some and see what happens

Message Queues

Message passing example with Kafka:

Let's try it…


window1> python3 kafka-consumer.py
window2> python3 kafka-consumer.py
window3> python3 kafka-producer.py

Task Queues

The goal: get some work on a distributed queue. Maybe wait for results, or maybe don't.

Celery (Python).
Gearman (multi-language).
Resque, Sidekiq (Ruby).
Google AppEngine Task Queues (Python, Java, Go, PHP).
Amazon Simple Workflow Service.
Any message queue + some code.

Task Queues

With a task queue, you get to just call a function (maybe with slightly different syntax). You can then retrieve the result (or just move on an let the work happen later).

Where the work happened is transparent.

Task Queues

A task with Celery: tasks.py.

Let's try it…


window1> celery -A tasks worker --loglevel=info --hostname=worker1@%h
window2> celery -A tasks worker --loglevel=info --hostname=worker2@%h
window3> ipython3


from tasks import add
result = add.delay(4, 4)
result.get(timeout=1)

Task Queues

Need a lot of work done without Hadoop? Run task queue workers on many nodes; make all the asynchronous calls you want; let the workers handle it.

Need nightly batch analysis done? Have a scheduled task start a Spark task.

Have a spike in usage? Let tasks queue up and process as possible. Or add more workers.

Want to save some cloud costs? Deploy workers on spot instances and process when it's cheap.

Text Search

The goal: index lots of data so you (or your users) can search for records they want.

All of these are designed to scale out across many nodes (except Meili).

Text Search

Indexing and searching with Elasticsearch:

Let's try it… (See also CourSys search when an instructor.)


python3 elastic-index.py
python3 elastic-search.py
curl -XGET 'http://localhost:9200/comments/_search?q=comment' \
  | python3 -m json.tool