CMPT 353 Lecture Notes

  1. Course Introduction [“Course Introduction” slides]
    1. This Course [This Course slides]
    2. Computational Data Science? [Computational Data Science? slides]
    3. Data Science? [Data Science? slides]
    4. Why Data Science? [Why Data Science? slides]
    5. Topics (1) [Topics (1) slides]
    6. Grades [Grades slides]
    7. Exercises [Exercises slides]
    8. Project [Project slides]
    9. Quizzes/Exam [Quizzes/Exam slides]
    10. Programming [Programming slides]
    11. Lectures and Labs [Lectures and Labs slides]
    12. References [References slides]
    13. Expectations [Expectations slides]
  2. Data Analysis Pipeline [“Data Analysis Pipeline” slides]
    1. Your Question [Your Question slides]
    2. Getting Data [Getting Data slides]
    3. Preparing Data [Preparing Data slides]
    4. Analyzing Data [Analyzing Data slides]
    5. Presenting Results [Presenting Results slides]
    6. Creating a Pipeline [Creating a Pipeline slides]
    7. Manual Pipeline Steps [Manual Pipeline Steps slides]
    8. The Pipeline [The Pipeline slides]
  3. Data In Python [“Data In Python” slides]
    1. Built-In Data Structures [Built-In Data Structures slides]
    2. NumPy [NumPy slides]
    3. Operating on Arrays [Operating on Arrays slides]
    4. Pandas [Pandas slides]
    5. Working With Pandas [Working With Pandas slides]
  4. Getting Data [“Getting Data” slides]
    1. Where Data Comes From [Where Data Comes From slides]
    2. Data from Files [Data from Files slides]
    3. Databases [Databases slides]
    4. Web APIs [Web APIs slides]
    5. Scraping HTML [Scraping HTML slides]
    6. File Formats [File Formats slides]
    7. CSV [CSV slides]
    8. JSON [JSON slides]
    9. XML [XML slides]
    10. Others [Others slides]
  5. Extract-Transform-Load [“Extract-Transform-Load” slides]
    1. Extract [Extract slides]
    2. Transform [Transform slides]
    3. Load [Load slides]
    4. Summary [Summary slides]
  6. Noise Filtering [“Noise Filtering” slides]
    1. Noise [Noise slides]
    2. LOESS Smoothing [LOESS Smoothing slides]
    3. LOESS in Python [LOESS in Python slides]
    4. Kalman Filtering [Kalman Filtering slides]
    5. Probability Distributions [Probability Distributions slides]
    6. Kalman Operation [Kalman Operation slides]
    7. Kalman Predictions [Kalman Predictions slides]
    8. Kalman Variances [Kalman Variances slides]
    9. pykalman [pykalman slides]
    10. Kalman Example [Kalman Example slides]
    11. Kalman Parameters [Kalman Parameters slides]
    12. Kalman Summary [Kalman Summary slides]
    13. Kalman Links [Kalman Links slides]
    14. Other Filtering [Other Filtering slides]
  7. Cleaning Data [“Cleaning Data” slides]
    1. Validity [Validity slides]
    2. Outliers [Outliers slides]
    3. Finding Outliers [Finding Outliers slides]
    4. Handling Outliers [Handling Outliers slides]
    5. Imputation [Imputation slides]
    6. Noise Filtering [Noise Filtering slides]
    7. Entity Resolution [Entity Resolution slides]
    8. Regular Expressions [Regular Expressions slides]
    9. Python re [Python re slides]
    10. Regex Summary [Regex Summary slides]
  8. Stats Review [“Stats Review” slides]
    1. Context [Context slides]
    2. Types of Data [Types of Data slides]
    3. Population and Samples [Population and Samples slides]
    4. Probability Distributions [Probability Distributions slides]
    5. Central Tendancy [Central Tendancy slides]
    6. Dispersion [Dispersion slides]
    7. Relationships [Relationships slides]
    8. Plotting Data [Plotting Data slides]
    9. Specific Distributions [Specific Distributions slides]
    10. Normal Distribution [Normal Distribution slides]
  9. Inferential Stats [“Inferential Stats” slides]
    1. Hypotheses [Hypotheses slides]
    2. T-Test [T-Test slides]
    3. p-values [p-values slides]
    4. Failure to Reject [Failure to Reject slides]
    5. Test Assumptions [Test Assumptions slides]
    6. Testing Normality [Testing Normality slides]
    7. Equal Variance Test [Equal Variance Test slides]
    8. Transforming Data [Transforming Data slides]
  10. Statistical Tests [“Statistical Tests” slides]
    1. Multiple Groups [Multiple Groups slides]
    2. ANOVA [ANOVA slides]
    3. Post Hoc Analysis [Post Hoc Analysis slides]
    4. One- vs Two-Tailed Tests [One- vs Two-Tailed Tests slides]
    5. Hacking p-values [Hacking p-values slides]
    6. Central Limit Theorem [Central Limit Theorem slides]
    7. It's Probably Okay [It's Probably Okay slides]
    8. Mann–Whitney U-test [Mann–Whitney U-test slides]
    9. Chi-Square [Chi-Square slides]
    10. Regression [Regression slides]
    11. Stats Summary [Stats Summary slides]
  11. Machine Learning [“Machine Learning” slides]
    1. What is ML? [What is ML? slides]
    2. Linear Regression [Linear Regression slides]
    3. The Intercept [The Intercept slides]
    4. Polynomial Regression [Polynomial Regression slides]
    5. ML Pipelines [ML Pipelines slides]
    6. Training and Validation [Training and Validation slides]
  12. ML: Classification [“ML: Classification” slides]
    1. Naïve Bayes [Naïve Bayes slides]
    2. Bayesian Classifier [Bayesian Classifier slides]
    3. Checking the Classifier [Checking the Classifier slides]
    4. Bayesian Priors [Bayesian Priors slides]
    5. Bayesian Failures [Bayesian Failures slides]
    6. Nearest Neighbours [Nearest Neighbours slides]
    7. More Than Points [More Than Points slides]
    8. Feature Scaling [Feature Scaling slides]
    9. Feature Engineering [Feature Engineering slides]
    10. Decision Trees [Decision Trees slides]
    11. Decisions [Decisions slides]
    12. Limiting the Tree [Limiting the Tree slides]
    13. Ensembles [Ensembles slides]
    14. Random Forests [Random Forests slides]
    15. Boosting [Boosting slides]
    16. Higher Dimensions [Higher Dimensions slides]
    17. PCA [PCA slides]
    18. Motivating Neural Nets [Motivating Neural Nets slides]
    19. Perceptrons [Perceptrons slides]
    20. Neural Networks [Neural Networks slides]
    21. Deep Learning [Deep Learning slides]
  13. ML: Other Techniques [“ML: Other Techniques” slides]
    1. Clustering [Clustering slides]
    2. Clustering Colours [Clustering Colours slides]
    3. Anomaly Detection [Anomaly Detection slides]
    4. More Regression [More Regression slides]
    5. When Machine Learning? [When Machine Learning? slides]
  14. Big Data and Spark [“Big Data and Spark” slides]
    1. Big Data [Big Data slides]
    2. Compute Clusters [Compute Clusters slides]
    3. Hadoop [Hadoop slides]
    4. Small-Data Spark [Small-Data Spark slides]
    5. First Spark Program [First Spark Program slides]
    6. Spark DataFrames [Spark DataFrames slides]
    7. Inspecting DataFrames [Inspecting DataFrames slides]
    8. Operating on DataFrames [Operating on DataFrames slides]
    9. DataFrames are Partitioned [DataFrames are Partitioned slides]
    10. Spark Input & Output [Spark Input & Output slides]
    11. Hadoop + Spark [Hadoop + Spark slides]
    12. HDFS [HDFS slides]
    13. YARN [YARN slides]
  15. How Spark Calculates [“How Spark Calculates” slides]
    1. Controlling Partitions [Controlling Partitions slides]
    2. Shuffle Operations [Shuffle Operations slides]
    3. Grouping Data [Grouping Data slides]
    4. Execution Plans [Execution Plans slides]
    5. Lazy Evaluation [Lazy Evaluation slides]
    6. Too Lazy [Too Lazy slides]
    7. Caching [Caching slides]
    8. Spark Optimizer [Spark Optimizer slides]
    9. Spark DAG [Spark DAG slides]
    10. Spark Join [Spark Join slides]
  16. Working With Spark [“Working With Spark” slides]
    1. Moving Data [Moving Data slides]
    2. Column Expressions [Column Expressions slides]
    3. Column Functions [Column Functions slides]
    4. Who Calculates? [Who Calculates? slides]
    5. User-Defined Functions [User-Defined Functions slides]
    6. SQL? [SQL? slides]
    7. RDDs [RDDs slides]
    8. Row-Oriented Data [Row-Oriented Data slides]
    9. Spark ↔ Python [Spark ↔ Python slides]
    10. More Spark I/O [More Spark I/O slides]
    11. Big Data is annoying. [Big Data is annoying. slides]
    12. Being Less Annoying [Being Less Annoying slides]
    13. How To Big Data? [How To Big Data? slides]
    14. When to Big Data? [When to Big Data? slides]
    15. More Big Data? [More Big Data? slides]
  17. Aside: Dask [“Aside: Dask” slides]
    1. Python Data Tools [Python Data Tools slides]
    2. Dask [Dask slides]
    3. Working With Dask [Working With Dask slides]
    4. Scheduling Dask [Scheduling Dask slides]
    5. More Dask Features [More Dask Features slides]
    6. Dask Summary [Dask Summary slides]
  18. Aside: NumPy/Pandas Speed [“Aside: NumPy/Pandas Speed” slides]
    1. Why So Slow? [Why So Slow? slides]
    2. NumPy Expression [NumPy Expression slides]
    3. Applying to a Series [Applying to a Series slides]
    4. Vectorizing [Vectorizing slides]
    5. Applying By Row [Applying By Row slides]
    6. Using Python [Using Python slides]
    7. With NumExpr [With NumExpr slides]
    8. Summary [Summary slides]
    9. Pandas vs Dask [Pandas vs Dask slides]
  19. Communicating [“Communicating” slides]
    1. Asking the Right Question [Asking the Right Question slides]
    2. Communicating Results [Communicating Results slides]
    3. Visualizing Data [Visualizing Data slides]
    4. More Resources [More Resources slides]
  20. More Data Science [“More Data Science” slides]
    1. Other Technologies [Other Technologies slides]
    2. Databases [Databases slides]
    3. Topics We Covered [Topics We Covered slides]
    4. Practical Omissions [Practical Omissions slides]
    5. Other Courses [Other Courses slides]

Course home page.


WeekLecture HourDateStarting Point
11Sept 3Intro
12, 3Sept 5Pipeline: Getting Data
24Sept 10CSIL time
25, 6Sept 12File Formats
37Sept 17CSIL time
38, 9Sept 19pykalman
410Sept 24CSIL time
411, 12Sept 26Specific Distributions
513Oct 1CSIL time
514, 15Oct 3Hacking p-values
616Oct 8Quiz #1, followed by CSIL time
617, 18Oct 10Machine Learning
719Oct 15CSIL time
720, 21Oct 17Bayesian Priors
822Oct 22CSIL time
823, 24Oct 24PCA
925Oct 29CSIL time
926, 27Oct 31Operating on [Spark] DataFrames
1028Nov 5Quiz #2, followed by CSIL time
1029, 30Nov 7Too Lazy
1131Nov 12CSIL time
1132, 33Nov 14Spark RDDs
1234Nov 19CSIL time
1235, 36Nov 21NumPy/Pandas Speed
1337Nov 26CSIL time
1338, 39Nov 28no lecture