Using Machine Learning Tools
This is my notes for the course COMP SCI 3317 - Using Machine Learning Tools.
Week 3
-
Data pipeline
- A sequence of data processing components
- There is a lot of data to manipulate and many data transformations to apply
- Components typically run asynchronously
- Each components is self-contained which makes the architecture quite robust
- Data pipelines are easy to understand and enables teams to focus on different components
- On the other hand, a broken component can go unnoticed so the data gets stale and the overall system’s performance drops
-
Classifier
- Linear model as a binary classifier
- Using linear model for regression and apply a threshold
- Stochastic gradient descent classifier
- Binary classifier using a linear model
- SGD is a fitting algorithm which iteratively follows the derivative of the loss function
- Decision tree
- Iterative splitting
- Maximize class separation
- One feature at a time
- Up to maximum depth
- Linear model as a binary classifier
-
Inner workings
- Gini impurity metric measures the class distribution in a node (G = 0 for impurity, best)
- Decision boundaries
-
Performance measure
- Regression
- RMSE (L2norm): It gives an idea of how much error the system typically makes in its predictions, with a higher weight given to large errors.
- MAE (L1norm): also called the average absolute deviation
- The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than MAE
- Median / Max absolute error
- R2 / correlation coefficient
- Classification
- Precision / Recall
- Precision: the fraction of positive predictions are correct
- Recall: the fraction of the real positive class are detected
- Trade-Off
- Low threshold -> everything is positive -> recall = 1
- High threshold -> everything is negative -> precision = 1
- Accuracy
- (TP + TN) / (TP + TN + FP + FN)
- Accuracy is not reliable when dealing with imbalanced classes because it may be misleading and biased towards the major class
- F1-score
- It is the harmonic mean of precision and recall.
- Whereas the regular mean treats all values equally, the harmonic mean gives much more weight to low values.
- As a result, the classifier will only get a high F1-score if both recall and precision are high.
- The F1-score may not be the best metric to use when precision and recall are not equally important
- ROC curve
- The FPR (false positive rate) is the ratio of negative instances that are incorrectly classified as positive
- Should use Precision / Recall curve whenever the positive class is rare or when we care more about the false positives than the false negatives
- AUC curve
- Confusion matrix
- Precision / Recall
- Regression
Week 2
-
A typical workflow for supervised learning
- Check and clean data
- Choose some candidate models based on data and task
- Split data into training and test sets
- Split training data into (reduced) training and validation sets
- Train candidate models on (reduced) training sets
- Select best model based on validation set errors
- Retrain best model on full training set
- Apply best model to test data which gives an unbiased estimate of the generalization error
-
Missing data
- If there are missing features
- can remove data, but bad if large amounts of missing data
- can replace them with estimated / non-informative values
- Using imputers
- If there are missing features
-
Scaling data
- Why
- Often distance is used by the ML method, and without scaling the features with the largest values would dominate
- Safer to use it in general as it does not harm
- Using MinMaxScaler
- Using StandardScaler
- Why
-
Pipelines in sklearn
- Best practice: chain them together as a Pipeline
-
Linear regression
- When we fit the model, we optimize the parameter vector to minimize an error/loss/cost function over data points
- For a linear model, there is an analytic formula for the optimal result, otherwise use iterative method to find it
- Linear for each parameter but not for data values
-
Parameters and hyper-parameters
- Models have parameters that are fit to data internally
- Hyper-parameters are not changed during fitting
- Best hyper-parameter is usually measured by validation error
Week 1
-
Machine learning: program a computer to learn from the data how to solve the problem.
-
Machine learning characters:
- Avoid hand turning of parameters
- Adapt to new environments and settings (Model may forget things)
- Complex problem with no known solution
- Analysis and mining of large datasets
-
Types of machine learning (Supervised or not)
- Supervised learning: your training data includes labels or values (the desired result)
- Classification: uses discrete labels (integers, set of possible outcomes)
- Regression: uses real values to predict a real number, based on training data that contains values and possibly other features.
- Supervised learning algorithms
- K-nearest-neighbors
- Linear regression
- Logistic regression
- Support Vector Machines (SVMs)
- Decision Trees and Random Forests
- Neural networks
- Unsupervised learning: your training data DOES NOT include labels or values
- Clustering data: market research / Image segmentation
- Dimension reduction: visualization / preprocessing (Throw some useless features)
- Anomaly detection: fraud detection / security / pathology in medical data
- To detect unusual instances based on mostly normal instances during training
- Association rule learning
- Dig into large amount of data and discover interesting relations between attributes
- Unsupervised learning algorithms
- Clustering
- K-Means
- DBSCAN
- Hierarchical Cluster Analysis (HCA)
- Anomaly detection and novelty detection
- One-class SVM
- Isolation Forest
- Clustering
- Semi-supervised learning: a lot of unlabelled data and a little bit of labelled data
- Self-supervised learning
- During self-supervised learning, a model is trained to recover the original image by using masked images as input and original images as labels
- Self-supervised learning is generally used for classification and regression tasks
- Self-supervised learning generates a fully labelled dataset from an unlabeled one
- Reinforcement learning: algorithm actively collects data, by interacting with the environment (environment provides reward signals, algorithm learns a policy to maximize reward over time)
- Supervised learning: your training data includes labels or values (the desired result)
-
Batch and online learning
- Batch learning: system cannot learn incrementally (AKA offline learning)
- Model rot or data drift: A model’s performance tents to decay slowly overtime, simply because the world continues to evolve while the model remains unchanged.
- Online learning: system can be trained incrementally, great to receive data as a continuous flow, also can do out-of-core learning
- Batch learning: system cannot learn incrementally (AKA offline learning)
-
Approaches to machine learning
- Instanced-based learning
- K-nearest-neighbors (KNN): follow the Instanced-of nearby training data points
- Model-based learning
- Fit a model to existing data
- Fit: optimize model parameters
- Model: linear model, support vector machine, deep neural network
- Fit a model to existing data
- Instanced-based learning
-
Problems with data
- Quantity of data
- Range or domain of data (Try to cover more)
- Real data contains noise, errors and outliers
- Detect, discard, fix, fill
- Some features are less useful than others (Try to select and combine features)
- Feature engineering
- Feature selection
- Feature extraction
- Create new features
- Feature engineering
-
Problem with models
- Overfitting: model is too complex for the data, essentially fitting to noise in the data
- Solve:
- Try to collect more data or reduce noise (averaging)
- Choose a less complex model with fewer parameters
- Regularization
- “soft” reduction in parameters
- penalize parameter values that are far from zero
- Solve:
- Underfitting: model is too simple for the data
- Solve:
- Feature reduction, preprocessing
- Choose a more powerful model
- Reduce regularization penalty
- Solve:
- Overfitting: model is too complex for the data, essentially fitting to noise in the data
-
Selecting a model
- Pick a few likely candidates
- Select the best model by comparing them by measuring generalization error
-
Training and test sets
- We split data into
- Training set: used to fit model parameters
- Test set: used to measure prediction accuracy after fitting
- Train model on training set, measure prediction error on both training and test sets, but separately
- High training set error: underfitting
- Low training set error, high test set error: overfitting
- We split data into
-
Model validation
- By training and testing on the same training / test split, we select for best performance on this split.
- We further split our dataset into 3 parts: training set, validation set, test set.
- Train candidate models on reduced training set, measure error on validation set.
- Choose model/settings with lowest validation error
- Estimate generalization error using test set
-
Cross validation (K - fold CV, leave one out CV)
- Select multiple validation sets (test set is fixed) and train multiple times
- Take the model with the lowest mean validation error
- Re-train chosen model on the full training set (training set + validation set)
- Test it once to estimate the generalization error
- Trade off
- Better for smaller training sets, not dependent on a single (small) split.
- More time consuming