Using Machine Learning Tools

This is my notes for the course COMP SCI 3317 - Using Machine Learning Tools.


Week 3

  • Data pipeline

    • A sequence of data processing components
    • There is a lot of data to manipulate and many data transformations to apply
    • Components typically run asynchronously
    • Each components is self-contained which makes the architecture quite robust
    • Data pipelines are easy to understand and enables teams to focus on different components
    • On the other hand, a broken component can go unnoticed so the data gets stale and the overall system’s performance drops
  • Classifier

    • Linear model as a binary classifier
      • Using linear model for regression and apply a threshold
    • Stochastic gradient descent classifier
      • Binary classifier using a linear model
      • SGD is a fitting algorithm which iteratively follows the derivative of the loss function
    • Decision tree
      • Iterative splitting
      • Maximize class separation
      • One feature at a time
      • Up to maximum depth
  • Inner workings

    • Gini impurity metric measures the class distribution in a node (G = 0 for impurity, best)
    • Decision boundaries
  • Performance measure

    • Regression
      • RMSE (L2norm): It gives an idea of how much error the system typically makes in its predictions, with a higher weight given to large errors.
      • MAE (L1norm): also called the average absolute deviation
        • The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than MAE
      • Median / Max absolute error
      • R2 / correlation coefficient
    • Classification
      • Precision / Recall
        • Precision: the fraction of positive predictions are correct
        • Recall: the fraction of the real positive class are detected
        • Trade-Off
          • Low threshold -> everything is positive -> recall = 1
          • High threshold -> everything is negative -> precision = 1
      • Accuracy
        • (TP + TN) / (TP + TN + FP + FN)
        • Accuracy is not reliable when dealing with imbalanced classes because it may be misleading and biased towards the major class
      • F1-score
        • It is the harmonic mean of precision and recall.
        • Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values.
        • As a result, the classifier will only get a high F1-score if both recall and precision are high.
        • The F1-score may not be the best metric to use when precision and recall are not equally important
      • ROC curve
        • The FPR (false positive rate) is the ratio of negative instances that are incorrectly classified as positive
        • Should use Precision / Recall curve whenever the positive class is rare or when we care more about the false positives than the false negatives
      • AUC curve
      • Confusion matrix

Week 2

  • A typical workflow for supervised learning

    • Check and clean data
    • Choose some candidate models based on data and task
    • Split data into training and test sets
    • Split training data into (reduced) training and validation sets
    • Train candidate models on (reduced) training sets
    • Select best model based on validation set errors
    • Retrain best model on full training set
    • Apply best model to test data which gives an unbiased estimate of the generalization error
  • Missing data

    • If there are missing features
      • can remove data, but bad if large amounts of missing data
      • can replace them with estimated / non-informative values
    • Using imputers
  • Scaling data

    • Why
      • Often distance is used by the ML method, and without scaling the features with the largest values would dominate
      • Safer to use it in general as it does not harm
    • Using MinMaxScaler
    • Using StandardScaler
  • Pipelines in sklearn

    • Best practice: chain them together as a Pipeline
  • Linear regression

    • When we fit the model, we optimize the parameter vector to minimize an error/loss/cost function over data points
    • For a linear model, there is an analytic formula for the optimal result, otherwise use iterative method to find it
    • Linear for each parameter but not for data values
  • Parameters and hyper-parameters

    • Models have parameters that are fit to data internally
    • Hyper-parameters are not changed during fitting
    • Best hyper-parameter is usually measured by validation error

Week 1

  • Machine learning: program a computer to learn from the data how to solve the problem.

  • Machine learning characters:

    • Avoid hand turning of parameters
    • Adapt to new environments and settings (Model may forget things)
    • Complex problem with no known solution
    • Analysis and mining of large datasets
  • Types of machine learning (Supervised or not)

    • Supervised learning: your training data includes labels or values (the desired result)
      • Classification: uses discrete labels (integers, set of possible outcomes)
      • Regression: uses real values to predict a real number, based on training data that contains values and possibly other features.
      • Supervised learning algorithms
        • K-nearest-neighbors
        • Linear regression
        • Logistic regression
        • Support Vector Machines (SVMs)
        • Decision Trees and Random Forests
        • Neural networks
    • Unsupervised learning: your training data DOES NOT include labels or values
      • Clustering data: market research / Image segmentation
      • Dimension reduction: visualization / preprocessing (Throw some useless features)
      • Anomaly detection: fraud detection / security / pathology in medical data
        • To detect unusual instances based on mostly normal instances during training
      • Association rule learning
        • Dig into large amount of data and discover interesting relations between attributes
      • Unsupervised learning algorithms
        • Clustering
          • K-Means
          • DBSCAN
          • Hierarchical Cluster Analysis (HCA)
        • Anomaly detection and novelty detection
          • One-class SVM
          • Isolation Forest
    • Semi-supervised learning: a lot of unlabelled data and a little bit of labelled data
    • Self-supervised learning
      • During self-supervised learning, a model is trained to recover the original image by using masked images as input and original images as labels
      • Self-supervised learning is generally used for classification and regression tasks
      • Self-supervised learning generates a fully labelled dataset from an unlabeled one
    • Reinforcement learning: algorithm actively collects data, by interacting with the environment (environment provides reward signals, algorithm learns a policy to maximize reward over time)
  • Batch and online learning

    • Batch learning: system cannot learn incrementally (AKA offline learning)
      • Model rot or data drift: A model’s performance tents to decay slowly overtime, simply because the world continues to evolve while the model remains unchanged.
    • Online learning: system can be trained incrementally, great to receive data as a continuous flow, also can do out-of-core learning
  • Approaches to machine learning

    • Instanced-based learning
      • K-nearest-neighbors (KNN): follow the Instanced-of nearby training data points
    • Model-based learning
      • Fit a model to existing data
        • Fit: optimize model parameters
        • Model: linear model, support vector machine, deep neural network
  • Problems with data

    • Quantity of data
    • Range or domain of data (Try to cover more)
    • Real data contains noise, errors and outliers
      • Detect, discard, fix, fill
    • Some features are less useful than others (Try to select and combine features)
      • Feature engineering
        • Feature selection
        • Feature extraction
        • Create new features
  • Problem with models

    • Overfitting: model is too complex for the data, essentially fitting to noise in the data
      • Solve:
        • Try to collect more data or reduce noise (averaging)
        • Choose a less complex model with fewer parameters
        • Regularization
          • “soft” reduction in parameters
          • penalize parameter values that are far from zero
    • Underfitting: model is too simple for the data
      • Solve:
        • Feature reduction, preprocessing
        • Choose a more powerful model
        • Reduce regularization penalty
  • Selecting a model

    • Pick a few likely candidates
    • Select the best model by comparing them by measuring generalization error
  • Training and test sets

    • We split data into
      • Training set: used to fit model parameters
      • Test set: used to measure prediction accuracy after fitting
    • Train model on training set, measure prediction error on both training and test sets, but separately
      • High training set error: underfitting
      • Low training set error, high test set error: overfitting
  • Model validation

    • By training and testing on the same training / test split, we select for best performance on this split.
    • We further split our dataset into 3 parts: training set, validation set, test set.
      • Train candidate models on reduced training set, measure error on validation set.
      • Choose model/settings with lowest validation error
      • Estimate generalization error using test set
  • Cross validation (K - fold CV, leave one out CV)

    • Select multiple validation sets (test set is fixed) and train multiple times
    • Take the model with the lowest mean validation error
    • Re-train chosen model on the full training set (training set + validation set)
    • Test it once to estimate the generalization error
    • Trade off
      • Better for smaller training sets, not dependent on a single (small) split.
      • More time consuming