Machine Learning: definitions
This is a simple list of terms and definitions useful when studying machine learning. I found it useful to create the list for me, but may be useful to others studying the topic.
Term | Definition |
Definition of machine learning (Tom Mitchell) | A computer program is said to learn from experience `E` with respect to some task $T$ and some performance measure $P$, if its performance on $T$, as measured by $P$, improve with experience $E$. |
Training set | Collection of examples the system uses to learn. |
Training instance or sample | Each example in a training set. |
Test set | Collection of examples used to test the system. When the total amount of data is provided for the machine, it is typically divided into the two sets: training set and test set. A reasonable split is 80% training set and 20% test set. |
Attribute | Data type associated to each example in the training set (e.g. mileage of a car). |
Feature | Data type associated to each example in the training set with its value (e.g. mileage = 15000). |
Data mining | Using machine learning techniques to dig into large amount of data to find pattern not previously clear. |
Supervised machine learning | Machine learning system that is trained by providing training data with the desired solutions (labels). |
Unsupervised machine learning | Machine learning system that is trained through unlabeled data. |
Semisupervised machine learning | Machine learning system that is trained using partially labeled data. |
Reinforcement learning | Machine learning system capable of learning by receiving rewards of penalties on performed actions. Its aim is to find the best strategy (policy) to maximize the reward. |
Batch learning or offline learning | Training technique where the training set is entirely provided ahead of time. The system cannot learn incrementally. |
Online learning | Training technique where learning is carried on incrementally. This technique is useful for contexts in which data arrives as a flow (e.g. stock prices). |
Learning rate | Measure of how quickly the system adapts to changes. This measure only makes sense in online learning systems. |
Instance-based learning | Generalization technique where the system learns examples by heart and generalizes to new cases by measuring the distance between them. |
Model-based learning | Generalization is based on a model. The model makes the predictions (e.g. linear regression). |
Sampling noise | Undesirable effect of a training in which training set is too small, and nonrepresentative data sneak in as a result of chance. |
Sampling bias | Undesirable effect of training in which the training set may be large, but nonrepresentative of the entire context. |
Feature engineering | The process of selecting enough relevant and representative features to train the system. |
Feature selection | The process of selecting relevant features to train the machine. |
Feature extraction | The feature of combining, producing or reducing features. |
Overfitting | The situation in which the model performs well on the training data, but does not generalize well. Reducing the risk of overfitting is named regularization. |
Underfitting | The situation in which the model is too simple to learn and then predict. |
Hyperparameter | Parameter applied to the learning algorithm, not part of the model itself. It is applied before the training process to regularize data. |
Degress of freedom | Parameters of the model. |
Generalization error or out-of-sample error | Error rate on new cases on which the system is tested. |
Holdout validation | Technique used to find a proper regularization hyperparameter. Part of the training set is held out to evaluate several candidate models. The new set is called validation set (or development set or dev set). Multiple models are trained on the reduced training set and the one best performing on the validation set is selected. The best model is then trained on the full training set. |
Cross-validation | Technique used to find proper regularization parameter. Holdout validation may not be practical if the validation set is too small or if the reduced training set is much smaller than the full training set. Cross-validation uses multiple validation sets. Each model is evaluated once per validation set, after being trained on the rest of the data. Evaluations of the model on each validation sets are then averaged. |
Pipeline | Sequence of processing components. |
$m$ | Number of examples present in a test set. |
$\boldsymbol{x}^{\left(i\right)}$ | Column vector containing all the features, excluding the label, for the $i$th example in the test set. |
$\boldsymbol{X}$ | matrix containing all the values of $\boldsymbol{x}^{\left(i\right)}$: $$\boldsymbol{X}=\left[\begin{array}{c} \left(\boldsymbol{x}^{\left(1\right)}\right)^{T}\\ \left(\boldsymbol{x}^{\left(2\right)}\right)^{T}\\ \vdots\\ \left(\boldsymbol{x}^{\left(m\right)}\right)^{T} \end{array}\right]$$ |
$h$ | Prediction function or hypothesis. Given am example in the training set $\boldsymbol{x}^{\left(i\right)}$, the prediction is: $$\hat{y}^{\left(i\right)}=h\left(\boldsymbol{x}^{\left(i\right)}\right)$$ $\hat{y}^{\left(i\right)}-y^{\left(i\right)}$ is the prediction error. |
Root mean square error | Measure of the error in a measurement. It can be used to measure the performance. $$RMSE(X,h)=\sqrt{\frac{1}{m}\cdot\sum_{i=1}^{m}\left(h\left(\boldsymbol{x}^{\left(i\right)}\right)-y^{\left(i\right)}\right)^{2}}$$ |
Mean absolute error | Measure of the error in a measurement: $$MAE\left(\boldsymbol{X},h\right)=\frac{1}{m}\cdot\sum_{i=1}^{m}\left|h\left(\boldsymbol{x}^{\left(i\right)}\right)-y^{\left(i\right)}\right|$$ |
Confusion matrix | A confusion matrix is a matrix used to measure the performance of a classification model. In a confusion matrix $C=\left[c_{i,j}\right]_{i=0}^{m-1}$, where $m$ is the number of classes, $c_{i,j}$ represents how many times an instance of class $i$ is predicted of class $j$. |
Accuracy (of a classifier) | The number of samples correctly classified out of all the samples present in the test set. Referring to the confusion matrix $C$: $$A=\frac{\sum_{i=0}^{m-1}c_{i,i}}{\sum_{i=0}^{m-1}\sum_{j=0}^{m-1}c_{i,j}}$$ |
Precision (of a classifier) | The number of samples actually belonging to the $k$ class out of all the samples that were predicted to be of the $k$ class by the model. By referring to the confusion matrix $C$: $$P=\frac{c_{k,k}}{\sum_{i=0}^{m-1}c_{i,k}}$$ |
Recall (of a classifier) | The number of samples predicted correctly to be belonging to the $k$ class out of all the samples that actually belong to the $k$ class. By referring to the confusion matrix $C$: $$R=\frac{c_{k,k}}{\sum_{j=0}^{m-1}c_{k,j}}$$ |
$F_1$ score | $F_1$ is the harmonic mean of recall and precision: $$F_{1}=\frac{2}{\frac{1}{R}+\frac{1}{P}}$$ |