Using Machine Learning Tools

24/2/2023
7-minute read

This is my notes for the course COMP SCI 3317 - Using Machine Learning Tools.

Week 3

Data pipeline
- A sequence of data processing components
- There is a lot of data to manipulate and many data transformations to apply
- Components typically run asynchronously
- Each components is self-contained which makes the architecture quite robust
- Data pipelines are easy to understand and enables teams to focus on different components
- On the other hand, a broken component can go unnoticed so the data gets stale and the overall system’s performance drops
Classifier
- Linear model as a binary classifier
  - Using linear model for regression and apply a threshold
- Stochastic gradient descent classifier
  - Binary classifier using a linear model
  - SGD is a fitting algorithm which iteratively follows the derivative of the loss function
- Decision tree
  - Iterative splitting
  - Maximize class separation
  - One feature at a time
  - Up to maximum depth
Inner workings
- Gini impurity metric measures the class distribution in a node (G = 0 for impurity, best)
- Decision boundaries
Performance measure
- Regression
  - RMSE (L²norm): It gives an idea of how much error the system typically makes in its predictions, with a higher weight given to large errors.
  - MAE (L¹norm): also called the average absolute deviation
    - The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than MAE
  - Median / Max absolute error
  - R² / correlation coefficient
- Classification
  - Precision / Recall
    - Precision: the fraction of positive predictions are correct
    - Recall: the fraction of the real positive class are detected
    - Trade-Off
      - Low threshold -> everything is positive -> recall = 1
      - High threshold -> everything is negative -> precision = 1
  - Accuracy
    - (TP + TN) / (TP + TN + FP + FN)
    - Accuracy is not reliable when dealing with imbalanced classes because it may be misleading and biased towards the major class
  - F1-score
    - It is the harmonic mean of precision and recall.
    - Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values.
    - As a result, the classifier will only get a high F1-score if both recall and precision are high.
    - The F1-score may not be the best metric to use when precision and recall are not equally important
  - ROC curve
    - The FPR (false positive rate) is the ratio of negative instances that are incorrectly classified as positive
    - Should use Precision / Recall curve whenever the positive class is rare or when we care more about the false positives than the false negatives
  - AUC curve
  - Confusion matrix

Week 2

A typical workflow for supervised learning
- Check and clean data
- Choose some candidate models based on data and task
- Split data into training and test sets
- Split training data into (reduced) training and validation sets
- Train candidate models on (reduced) training sets
- Select best model based on validation set errors
- Retrain best model on full training set
- Apply best model to test data which gives an unbiased estimate of the generalization error
Missing data
- If there are missing features
  - can remove data, but bad if large amounts of missing data
  - can replace them with estimated / non-informative values
- Using imputers
Scaling data
- Why
  - Often distance is used by the ML method, and without scaling the features with the largest values would dominate
  - Safer to use it in general as it does not harm
- Using MinMaxScaler
- Using StandardScaler
Pipelines in sklearn
- Best practice: chain them together as a Pipeline
Linear regression
- When we fit the model, we optimize the parameter vector to minimize an error/loss/cost function over data points
- For a linear model, there is an analytic formula for the optimal result, otherwise use iterative method to find it
- Linear for each parameter but not for data values
Parameters and hyper-parameters
- Models have parameters that are fit to data internally
- Hyper-parameters are not changed during fitting
- Best hyper-parameter is usually measured by validation error

Week 1

Machine learning: program a computer to learn from the data how to solve the problem.
Machine learning characters:
- Avoid hand turning of parameters
- Adapt to new environments and settings (Model may forget things)
- Complex problem with no known solution
- Analysis and mining of large datasets
Types of machine learning (Supervised or not)
- Supervised learning: your training data includes labels or values (the desired result)
  - Classification: uses discrete labels (integers, set of possible outcomes)
  - Regression: uses real values to predict a real number, based on training data that contains values and possibly other features.
  - Supervised learning algorithms
    - K-nearest-neighbors
    - Linear regression
    - Logistic regression
    - Support Vector Machines (SVMs)
    - Decision Trees and Random Forests
    - Neural networks
- Unsupervised learning: your training data DOES NOT include labels or values
  - Clustering data: market research / Image segmentation
  - Dimension reduction: visualization / preprocessing (Throw some useless features)
  - Anomaly detection: fraud detection / security / pathology in medical data
    - To detect unusual instances based on mostly normal instances during training
  - Association rule learning
    - Dig into large amount of data and discover interesting relations between attributes
  - Unsupervised learning algorithms
    - Clustering
      - K-Means
      - DBSCAN
      - Hierarchical Cluster Analysis (HCA)
    - Anomaly detection and novelty detection
      - One-class SVM
      - Isolation Forest
- Semi-supervised learning: a lot of unlabelled data and a little bit of labelled data
- Self-supervised learning
  - During self-supervised learning, a model is trained to recover the original image by using masked images as input and original images as labels
  - Self-supervised learning is generally used for classification and regression tasks
  - Self-supervised learning generates a fully labelled dataset from an unlabeled one
- Reinforcement learning: algorithm actively collects data, by interacting with the environment (environment provides reward signals, algorithm learns a policy to maximize reward over time)
Batch and online learning
- Batch learning: system cannot learn incrementally (AKA offline learning)
  - Model rot or data drift: A model’s performance tents to decay slowly overtime, simply because the world continues to evolve while the model remains unchanged.
- Online learning: system can be trained incrementally, great to receive data as a continuous flow, also can do out-of-core learning
Approaches to machine learning
- Instanced-based learning
  - K-nearest-neighbors (KNN): follow the Instanced-of nearby training data points
- Model-based learning
  - Fit a model to existing data
    - Fit: optimize model parameters
    - Model: linear model, support vector machine, deep neural network
Problems with data
- Quantity of data
- Range or domain of data (Try to cover more)
- Real data contains noise, errors and outliers
  - Detect, discard, fix, fill
- Some features are less useful than others (Try to select and combine features)
  - Feature engineering
    - Feature selection
    - Feature extraction
    - Create new features
Problem with models
- Overfitting: model is too complex for the data, essentially fitting to noise in the data
  - Solve:
    - Try to collect more data or reduce noise (averaging)
    - Choose a less complex model with fewer parameters
    - Regularization
      - “soft” reduction in parameters
      - penalize parameter values that are far from zero
- Underfitting: model is too simple for the data
  - Solve:
    - Feature reduction, preprocessing
    - Choose a more powerful model
    - Reduce regularization penalty
Selecting a model
- Pick a few likely candidates
- Select the best model by comparing them by measuring generalization error
Training and test sets
- We split data into
  - Training set: used to fit model parameters
  - Test set: used to measure prediction accuracy after fitting
- Train model on training set, measure prediction error on both training and test sets, but separately
  - High training set error: underfitting
  - Low training set error, high test set error: overfitting
Model validation
- By training and testing on the same training / test split, we select for best performance on this split.
- We further split our dataset into 3 parts: training set, validation set, test set.
  - Train candidate models on reduced training set, measure error on validation set.
  - Choose model/settings with lowest validation error
  - Estimate generalization error using test set
Cross validation (K - fold CV, leave one out CV)
- Select multiple validation sets (test set is fixed) and train multiple times
- Take the model with the lowest mean validation error
- Re-train chosen model on the full training set (training set + validation set)
- Test it once to estimate the generalization error
- Trade off
  - Better for smaller training sets, not dependent on a single (small) split.
  - More time consuming

Machine Learning Notes