Machine Learning: definitions

August 14, 2024| luca| 0 Comment| 04:18

This is a simple list of terms and definitions useful when studying machine learning. I found it useful to create the list for me, but may be useful to others studying the topic.

Term	Definition
Definition of machine learning (Tom Mitchell)	A computer program is said to learn from experience `E` with respect to some task $T$ and some performance measure $P$, if its performance on $T$, as measured by $P$, improve with experience $E$.
Training set	Collection of examples the system uses to learn.
Training instance or sample	Each example in a training set.
Test set	Collection of examples used to test the system. When the total amount of data is provided for the machine, it is typically divided into the two sets: training set and test set. A reasonable split is 80% training set and 20% test set.
Attribute	Data type associated to each example in the training set (e.g. mileage of a car).
Feature	Data type associated to each example in the training set with its value (e.g. mileage = 15000).
Data mining	Using machine learning techniques to dig into large amount of data to find pattern not previously clear.
Supervised machine learning	Machine learning system that is trained by providing training data with the desired solutions (labels).
Unsupervised machine learning	Machine learning system that is trained through unlabeled data.
Semisupervised machine learning	Machine learning system that is trained using partially labeled data.
Reinforcement learning	Machine learning system capable of learning by receiving rewards of penalties on performed actions. Its aim is to find the best strategy (policy) to maximize the reward.
Batch learning or offline learning	Training technique where the training set is entirely provided ahead of time. The system cannot learn incrementally.
Online learning	Training technique where learning is carried on incrementally. This technique is useful for contexts in which data arrives as a flow (e.g. stock prices).
Learning rate	Measure of how quickly the system adapts to changes. This measure only makes sense in online learning systems.
Instance-based learning	Generalization technique where the system learns examples by heart and generalizes to new cases by measuring the distance between them.
Model-based learning	Generalization is based on a model. The model makes the predictions (e.g. linear regression).
Sampling noise	Undesirable effect of a training in which training set is too small, and nonrepresentative data sneak in as a result of chance.
Sampling bias	Undesirable effect of training in which the training set may be large, but nonrepresentative of the entire context.
Feature engineering	The process of selecting enough relevant and representative features to train the system.
Feature selection	The process of selecting relevant features to train the machine.
Feature extraction	The feature of combining, producing or reducing features.
Overfitting	The situation in which the model performs well on the training data, but does not generalize well. Reducing the risk of overfitting is named regularization.
Underfitting	The situation in which the model is too simple to learn and then predict.
Hyperparameter	Parameter applied to the learning algorithm, not part of the model itself. It is applied before the training process to regularize data.
Degress of freedom	Parameters of the model.
Generalization error or out-of-sample error	Error rate on new cases on which the system is tested.
Holdout validation	Technique used to find a proper regularization hyperparameter. Part of the training set is held out to evaluate several candidate models. The new set is called validation set (or development set or dev set). Multiple models are trained on the reduced training set and the one best performing on the validation set is selected. The best model is then trained on the full training set.
Cross-validation	Technique used to find proper regularization parameter. Holdout validation may not be practical if the validation set is too small or if the reduced training set is much smaller than the full training set. Cross-validation uses multiple validation sets. Each model is evaluated once per validation set, after being trained on the rest of the data. Evaluations of the model on each validation sets are then averaged.
Pipeline	Sequence of processing components.
$m$	Number of examples present in a test set.
$\boldsymbol{x}^{\left(i\right)}$	Column vector containing all the features, excluding the label, for the $i$th example in the test set.
$\boldsymbol{X}$	matrix containing all the values of $\boldsymbol{x}^{\left(i\right)}$: $$\boldsymbol{X}=\left[\begin{array}{c} \left(\boldsymbol{x}^{\left(1\right)}\right)^{T}\\ \left(\boldsymbol{x}^{\left(2\right)}\right)^{T}\\ \vdots\\ \left(\boldsymbol{x}^{\left(m\right)}\right)^{T} \end{array}\right]$$
$h$	Prediction function or hypothesis. Given am example in the training set $\boldsymbol{x}^{\left(i\right)}$, the prediction is: $$\hat{y}^{\left(i\right)}=h\left(\boldsymbol{x}^{\left(i\right)}\right)$$ $\hat{y}^{\left(i\right)}-y^{\left(i\right)}$ is the prediction error.
Root mean square error	Measure of the error in a measurement. It can be used to measure the performance. $$RMSE(X,h)=\sqrt{\frac{1}{m}\cdot\sum_{i=1}^{m}\left(h\left(\boldsymbol{x}^{\left(i\right)}\right)-y^{\left(i\right)}\right)^{2}}$$
Mean absolute error	Measure of the error in a measurement: $$MAE\left(\boldsymbol{X},h\right)=\frac{1}{m}\cdot\sum_{i=1}^{m}\left\|h\left(\boldsymbol{x}^{\left(i\right)}\right)-y^{\left(i\right)}\right\|$$
Confusion matrix	A confusion matrix is a matrix used to measure the performance of a classification model. In a confusion matrix $C=\left[c_{i,j}\right]_{i=0}^{m-1}$, where $m$ is the number of classes, $c_{i,j}$ represents how many times an instance of class $i$ is predicted of class $j$.
Accuracy (of a classifier)	The number of samples correctly classified out of all the samples present in the test set. Referring to the confusion matrix $C$: $$A=\frac{\sum_{i=0}^{m-1}c_{i,i}}{\sum_{i=0}^{m-1}\sum_{j=0}^{m-1}c_{i,j}}$$
Precision (of a classifier)	The number of samples actually belonging to the $k$ class out of all the samples that were predicted to be of the $k$ class by the model. By referring to the confusion matrix $C$: $$P=\frac{c_{k,k}}{\sum_{i=0}^{m-1}c_{i,k}}$$
Recall (of a classifier)	The number of samples predicted correctly to be belonging to the $k$ class out of all the samples that actually belong to the $k$ class. By referring to the confusion matrix $C$: $$R=\frac{c_{k,k}}{\sum_{j=0}^{m-1}c_{k,j}}$$
$F_1$ score	$F_1$ is the harmonic mean of recall and precision: $$F_{1}=\frac{2}{\frac{1}{R}+\frac{1}{P}}$$

Definitions used in machine learning.

Machine Learning: definitions

Leave a Reply Cancel reply

Related Post

Having Fun with OpenGL, Android and Hardware Decoding

Making QtWebKit with WebKit2 in Qt 5.0 work on Raspberry Pi

OpenVPN with docker-openvpn and Internal DNS Server