CMPT 353 Lecture Notes

  1. Course Introduction [“Course Introduction” slides]
    1. This Course [This Course slides]
    2. Offering Strategy [Offering Strategy slides]
    3. Grades [Grades slides]
    4. Exercises [Exercises slides]
    5. Project [Project slides]
    6. Quizzes/Exam [Quizzes/Exam slides]
    7. Us [Us slides]
    8. Lectures and Labs [Lectures and Labs slides]
    9. References [References slides]
    10. Programming [Programming slides]
    11. Expectations [Expectations slides]
    12. Computational Data Science? [Computational Data Science? slides]
    13. Data Science? [Data Science? slides]
    14. Why Data Science? [Why Data Science? slides]
    15. Topics (1) [Topics (1) slides]
  2. Data Analysis Pipeline [“Data Analysis Pipeline” slides]
    1. Your Question [Your Question slides]
    2. Getting Data [Getting Data slides]
    3. Preparing Data [Preparing Data slides]
    4. Analyzing Data [Analyzing Data slides]
    5. Presenting Results [Presenting Results slides]
    6. Creating a Pipeline [Creating a Pipeline slides]
    7. Manual Pipeline Steps [Manual Pipeline Steps slides]
    8. The Pipeline [The Pipeline slides]
  3. Data In Python [“Data In Python” slides]
    1. Built-In Data Structures [Built-In Data Structures slides]
    2. NumPy [NumPy slides]
    3. Operating on Arrays [Operating on Arrays slides]
    4. Pandas [Pandas slides]
    5. Working With Pandas [Working With Pandas slides]
  4. Getting Data [“Getting Data” slides]
    1. Where Data Comes From [Where Data Comes From slides]
    2. Data from Files [Data from Files slides]
    3. Databases [Databases slides]
    4. Web APIs [Web APIs slides]
    5. Scraping HTML [Scraping HTML slides]
    6. File Formats [File Formats slides]
    7. CSV [CSV slides]
    8. JSON [JSON slides]
    9. XML [XML slides]
    10. Others [Others slides]
  5. Extract-Transform-Load [“Extract-Transform-Load” slides]
    1. Extract [Extract slides]
    2. Transform [Transform slides]
    3. Load [Load slides]
    4. Summary [Summary slides]
  6. Noise Filtering [“Noise Filtering” slides]
    1. Noise [Noise slides]
    2. LOESS Smoothing [LOESS Smoothing slides]
    3. LOESS in Python [LOESS in Python slides]
    4. Kalman Filtering [Kalman Filtering slides]
    5. Probability Distributions [Probability Distributions slides]
    6. Kalman Operation [Kalman Operation slides]
    7. Kalman Predictions [Kalman Predictions slides]
    8. Kalman Variances [Kalman Variances slides]
    9. pykalman [pykalman slides]
    10. Kalman Example [Kalman Example slides]
    11. Kalman Parameters [Kalman Parameters slides]
    12. Kalman Summary [Kalman Summary slides]
    13. Kalman Links [Kalman Links slides]
    14. Other Filtering [Other Filtering slides]
  7. Cleaning Data [“Cleaning Data” slides]
    1. Validity [Validity slides]
    2. Outliers [Outliers slides]
    3. Finding Outliers [Finding Outliers slides]
    4. Handling Outliers [Handling Outliers slides]
    5. Imputation [Imputation slides]
    6. Noise Filtering [Noise Filtering slides]
    7. Entity Resolution [Entity Resolution slides]
    8. Regular Expressions [Regular Expressions slides]
    9. Python re [Python re slides]
    10. Regex Summary [Regex Summary slides]
  8. Stats Review [“Stats Review” slides]
    1. Context [Context slides]
    2. Types of Data [Types of Data slides]
    3. Population and Samples [Population and Samples slides]
    4. Probability Distributions [Probability Distributions slides]
    5. Central Tendancy [Central Tendancy slides]
    6. Dispersion [Dispersion slides]
    7. Relationships [Relationships slides]
    8. Plotting Data [Plotting Data slides]
    9. Specific Distributions [Specific Distributions slides]
    10. Normal Distribution [Normal Distribution slides]
  9. Inferential Stats [“Inferential Stats” slides]
    1. Hypotheses [Hypotheses slides]
    2. T-Test [T-Test slides]
    3. p-values [p-values slides]
    4. Failure to Reject [Failure to Reject slides]
    5. Test Assumptions [Test Assumptions slides]
    6. Testing Normality [Testing Normality slides]
    7. Equal Variance Test [Equal Variance Test slides]
    8. Transforming Data [Transforming Data slides]
  10. Statistical Tests [“Statistical Tests” slides]
    1. Multiple Groups [Multiple Groups slides]
    2. ANOVA [ANOVA slides]
    3. Post Hoc Analysis [Post Hoc Analysis slides]
    4. One- vs Two-Tailed Tests [One- vs Two-Tailed Tests slides]
    5. Hacking p-values [Hacking p-values slides]
    6. Central Limit Theorem [Central Limit Theorem slides]
    7. It's Probably Okay [It's Probably Okay slides]
    8. Mann–Whitney U-test [Mann–Whitney U-test slides]
    9. Chi-Square [Chi-Square slides]
    10. Regression [Regression slides]
    11. Stats Summary [Stats Summary slides]
  11. Machine Learning [“Machine Learning” slides]
    1. What is ML? [What is ML? slides]
    2. Linear Regression [Linear Regression slides]
    3. The Intercept [The Intercept slides]
    4. Polynomial Regression [Polynomial Regression slides]
    5. ML Pipelines [ML Pipelines slides]
    6. Training and Validation [Training and Validation slides]
  12. ML: Classification [“ML: Classification” slides]
    1. Naïve Bayes [Naïve Bayes slides]
    2. Bayesian Classifier [Bayesian Classifier slides]
    3. Checking the Classifier [Checking the Classifier slides]
    4. Bayesian Priors [Bayesian Priors slides]
    5. Bayesian Failures [Bayesian Failures slides]
    6. Nearest Neighbours [Nearest Neighbours slides]
    7. More Than Points [More Than Points slides]
    8. Feature Scaling [Feature Scaling slides]
    9. Feature Engineering [Feature Engineering slides]
    10. Decision Trees [Decision Trees slides]
    11. Decisions [Decisions slides]
    12. Limiting the Tree [Limiting the Tree slides]
    13. Ensembles [Ensembles slides]
    14. Random Forests [Random Forests slides]
    15. Boosting [Boosting slides]
    16. Higher Dimensions [Higher Dimensions slides]
    17. PCA [PCA slides]
    18. Imbalanced Data [Imbalanced Data slides]
    19. Motivating Neural Nets [Motivating Neural Nets slides]
    20. Perceptrons [Perceptrons slides]
    21. Neural Networks [Neural Networks slides]
    22. Deep Learning [Deep Learning slides]
  13. ML: Other Techniques [“ML: Other Techniques” slides]
    1. More Regression [More Regression slides]
    2. Clustering [Clustering slides]
    3. Clustering Colours [Clustering Colours slides]
    4. Anomaly Detection [Anomaly Detection slides]
    5. When Machine Learning? [When Machine Learning? slides]
  14. Big Data and Spark [“Big Data and Spark” slides]
    1. Big Data [Big Data slides]
    2. Compute Clusters [Compute Clusters slides]
    3. Hadoop [Hadoop slides]
    4. Small-Data Spark [Small-Data Spark slides]
    5. First Spark Program [First Spark Program slides]
    6. Spark DataFrames [Spark DataFrames slides]
    7. Inspecting DataFrames [Inspecting DataFrames slides]
    8. Operating on DataFrames [Operating on DataFrames slides]
    9. DataFrames are Partitioned [DataFrames are Partitioned slides]
    10. Spark Input & Output [Spark Input & Output slides]
    11. Hadoop + Spark [Hadoop + Spark slides]
    12. HDFS [HDFS slides]
    13. YARN [YARN slides]
  15. How Spark Calculates [“How Spark Calculates” slides]
    1. Controlling Partitions [Controlling Partitions slides]
    2. Shuffle Operations [Shuffle Operations slides]
    3. Grouping Data [Grouping Data slides]
    4. Execution Plans [Execution Plans slides]
    5. Lazy Evaluation [Lazy Evaluation slides]
    6. Too Lazy [Too Lazy slides]
    7. Caching [Caching slides]
    8. Spark Optimizer [Spark Optimizer slides]
    9. Spark DAG [Spark DAG slides]
    10. Spark Join [Spark Join slides]
  16. Working With Spark [“Working With Spark” slides]
    1. Moving Data [Moving Data slides]
    2. Column Expressions [Column Expressions slides]
    3. Column Functions [Column Functions slides]
    4. Who Calculates? [Who Calculates? slides]
    5. User-Defined Functions [User-Defined Functions slides]
    6. SQL? [SQL? slides]
    7. RDDs [RDDs slides]
    8. Row-Oriented Data [Row-Oriented Data slides]
    9. Spark ↔ Python [Spark ↔ Python slides]
    10. More Spark I/O [More Spark I/O slides]
    11. Big Data is annoying. [Big Data is annoying. slides]
    12. Being Less Annoying [Being Less Annoying slides]
    13. How To Big Data? [How To Big Data? slides]
    14. When to Big Data? [When to Big Data? slides]
    15. More Big Data? [More Big Data? slides]
  17. Other DataFrame Tools [“Other DataFrame Tools” slides]
    1. Pandas [Pandas slides]
    2. Spark [Spark slides]
    3. Polars [Polars slides]
    4. DuckDB [DuckDB slides]
    5. Dask [Dask slides]
    6. Rapids' cuDF [Rapids' cuDF slides]
    7. Summary [Summary slides]
  18. Data Warehouses [“Data Warehouses” slides]
    1. Databases [Databases slides]
    2. Data Warehouse [Data Warehouse slides]
    3. ClickHouse [ClickHouse slides]
    4. Data Lake [Data Lake slides]
    5. Summary [Summary slides]
  19. Aside: NumPy/Pandas Speed [“Aside: NumPy/Pandas Speed” slides]
    1. Why So Slow? [Why So Slow? slides]
    2. NumPy Expression [NumPy Expression slides]
    3. Applying to a Series [Applying to a Series slides]
    4. Vectorizing [Vectorizing slides]
    5. Applying By Row [Applying By Row slides]
    6. Using Python [Using Python slides]
    7. With NumExpr [With NumExpr slides]
    8. Summary [Summary slides]
  20. Communicating [“Communicating” slides]
    1. Asking the Right Question [Asking the Right Question slides]
    2. Communicating Results [Communicating Results slides]
    3. Visualizing Data [Visualizing Data slides]
    4. More Resources [More Resources slides]
  21. More Data Science [“More Data Science” slides]
    1. Other Technologies [Other Technologies slides]
    2. Databases [Databases slides]
    3. Topics We Covered [Topics We Covered slides]
    4. Practical Omissions [Practical Omissions slides]
    5. Other Courses [Other Courses slides]

Course home page.

Schedule, Fall 2024

Week Deliverables (*) Lecture Hour Lecture Date First Slide Video Link
1 1 Sep 4
2 Exer 1 2 Sep 9
3 Sep 9
4 Sep 11
3 Exer 2 5 Sep 16
6 Sep 16
7 Sep 18
4 Exer 3 8 Sep 23
9 Sep 23
10 Sep 25
5 Exer 4 11 Sep 30
12 Sep 30
13 Oct 2
6 Exer 5 14 Oct 7
15 Oct 7
16 Oct 9
7 Exer 6, Quiz 1 17 Oct 15
18 Oct 15
19 Oct 16
8 Exer 7 20 Oct 21
21 Oct 21
22 Oct 23
9 Exer 8 23 Oct 28
24 Oct 28
25 Oct 30
10 Exer 9 26 Nov 4
27 Nov 4
28 Nov 6
11 Exer 10, Quiz 2 29 Nov 11
30 Nov 11
31 Nov 13
12 Exer 11 32 Nov 18
33 Nov 18
34 Nov 20
13 Exer 12 35 Nov 25
36 Nov 25
37 Nov 27
14+ Quiz 3, Project 38 Dec 2
39 Dec 2

* Check CourSys for the actual due dates and times.

Quiz instruction slide.