Dask & Ray

CMPT 732, Fall 2019

Python Parallel Computing Trends

  • Celery - Distributed task queue
    • Results flow back to central authority
  • Dask
    • Pandas, Sklearn on multiple cores or cluster
    • Workers hold state and can communicate with each other
    • Central scheduler

Python Parallel Computing Trends

  • Ray
    • Functions and Classes become Tasks and Actors
    • An Actor's own scheduler runs on some worker
  • Modin
    • Pandas for Ray

Dask

Docs: https://docs.dask.org/en/latest/

Ray

What is the difference

  • Ray Actors
  • share mutable state (such as ANN activations)
    • can avoid expensive initializations
  • distributed bottom-up scheduling
    • workers submit tasks to local schedulers, who then assign tasks to workers
    • improved task latency and throughput

What is the difference

  • Ray State of the system controlled in sharded DB
  • Dask has high-level collections: dataframes, distributed arrays, etc.

[Ref: Devin Petersohn]

Modin

Lab Setup

  • Locate your scratch space in
    ln -s /usr/shared/CMPT/scratch/<username> ~/scratch
  • Create pip environment
    python -m venv ~/scratch/daskray
  • Activate: source ~/scratch/daskray/bin/activate
  • Install in environment: pip install ...
    e.g. dask, ray, modin, jupyterlab, etc.

Remote lab access

  • ssh -p 24 asb10928-<seatID>.csil.sfu.ca
  • Can use ngrok to make notebook available remotely
  • For that generate a Jypter config with allow_remote_access set to True