Machine Learning and Data Mining
2 May 2022
Probability vs Statistics
Sampling is a way to draw conclusions about the population without measuring the whole population. The characteristics of an ideal sampling model include
In statistics, estimation refers to the process by which one makes inferences about a population.
The arithmetic mean is $m = \frac{1}{n}\sum_{i=1}^n x_i$, where the observations are $x_1, x_2, ..., x_n$. If $x_i$ occurs $f_i$ times, and we define relative frequencies as $p_i = \frac{f_i}{n}$, then the mean is $m = \sum_i x_i p_i$. Therefore the expected value of a discrete random variable $X$ is:
$$E(X) = \sum_i x_i P(X=x_i)$$
Variance measures how far a set of random numbers are spread out from their average value. Standard deviation (square root of variance) is estimated as:
$$s = \sqrt{\frac{1}{n-1} \sum_i (x_i - m)^2}$$
We can write this in terms of expected values $E(X)$ as
$$Var(x) = E((X - E(X))^2) = E(X^2) - [E(x)]^2$$
Variance is the "mean of squares minus the squares of means"
Covariance is a measure of relationship between two random variables:
$$cov(x,y) = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{n-1} = \frac{(\sum_i x_i y_i) - n \bar{x} \bar{y}}{n-1}$$
Correlation is a measure to show how strongly a pair of random variables are related:
$$r = \frac{cov(x,y)}{\sqrt{var(x)}\sqrt{var(y)}}$$
This is also called Pearson's correlation coefficient.
Bias is the difference between the average prediction of our model and the correct value.
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.
When we assume $y = f + \epsilon$ and we estimate $f$ with $\hat{f}$, the expectation of error:
$$E[(y - \hat{f})^2] = (f - E[\hat{f}])^2 + Var(\hat{f}) + Var(\epsilon)$$
So, the mean of squared error (MSE) can be written as
$$MSE = Bias^2 + Variance + Irreducible Error$$
When comparing unbiased estimators, we would like select the one with minimum variance
Regression can be used to predict numeric values from numeric attributes. In linear models, the outcome is a linear combination of attributes:
$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n = \sum_{i=0}^n \theta_i x_i = h(x)$$
There is:
The most popular estimation model is "Least Squares" for fitting a linear regression. Error is defined as difference between the predicted and actual value, i.e.
$$J(\theta) = \sum_{y=1}^m \left(y_j - \sum_{i=0}^n \theta_i x_{ji} \right)^2 = \sum_{j=1}^m \left(y_j - x_j^T \theta \right)^2 = (y - X \theta)^T (y - X \theta)$$
We want to minimize the error over all samples.
The goal of Linear Regression is to minimise $J(\theta)$. If we compute $J(\theta)$ for different values of $\theta$, we get a convex function with one global minima. To find the value of $\theta$ that provides this minimum, we can use gradient descent.
Gradient descent starts with some initial $\theta$, and repeatedly performs an update:
$$\theta_i^{(t+1)} := \theta_t^{(t)} - \alpha \frac{\partial}{\partial \theta_i} J \left( \theta_i^{(t)} \right)$$
where $\alpha$ is the learning rate / step size, and each iteration it takes a step in the direction of the steepest decrease in $J(\theta)$.
The partial derivative term for $J(\theta)$ can be determined as
$$\frac{\partial}{\partial \theta_i} J(\theta) = -2 \left( y_j - h_\theta (x_j) \right) x_{ji}$$
So, for a single training sample, the update Least Mean Squares (LMS) rule is:
$$\theta_{i+1} := \theta_i + 2 \alpha \left( y_j - h_\theta (x_j) \right) x_{ji}$$
For other samples, there are two methods.
In stochastic gradient descent, $\theta$ gets updated at any sample separately, so it is faster to calculate, but may never converge to the minimum.
Gradient descent is an iterative algorithm, however, you may find the minimum of $J(\theta)$ by finding and setting it's derivatives to 0. This is also called the exact or closed-form solution.
$$ \frac{\partial}{\partial \theta} J(\theta) = 0 \\ J(\theta) = ( \mathbb{y} - X \theta )^T (\mathbb{y} - X \theta) \\ \frac{\partial}{\partial \theta} J(\theta) = - 2X^T (\mathbb{y} - X \theta) = 0 \\ X^T (\mathbb{y} - X \theta) = 0 \\ \theta = (X^TX)^{-1}X^T \mathbb{y} $$
We can write the relationship between input variable $x$ and output variable $y$ as:
$$y_j = x_j^T \theta + \epsilon_j$$
And $\epsilon_j$ is an error term which might be unmodeled effect or random noise. Assume $\epsilon_j$'s are independent and identically distributed (i.i.d.) according to a Gaussian distribution, i.e.:
$$ \epsilon_j \sim N \left( 0, \sigma^2 \right) \\ p(\epsilon_j) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{e_j^2}{2 \sigma^2} \right) $$
This implies that:
$$p(\epsilon_j) = p(y_j | x_j; \theta) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{e_j^2}{2 \sigma^2} \right)$$
So we want to estimate $\theta$ s.t. we maximise the probability of output $y$ given input $x$ over all $m$ training samples:
$$ \begin{align*} \mathcal{L}(\theta) &= p(\mathbb{y} | X, \theta) \\ &= \prod_{j=1}^m \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{\left( y_j - x_j^T \theta \right)^2}{2 \sigma^2} \right) \end{align*} $$
Note this is called the Likelihood function.
However, to find $\theta$ that maximises $\mathcal{L}(\theta)$, we can also maximise any strictly increasing function of $\mathcal{L}(\theta)$. In this case, we can find maximize the log likelihood $mathcal{l}(\theta)$:
$$ \begin{align*} \mathcal{l}(\theta) &= \log \mathcal{L}(\theta) \\ &= \log \prod_{j=1}^m \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{\left( y_j - x_j^T \theta \right)^2}{2 \sigma^2} \right) \\ &= m \log \left( \frac{1}{\sqrt{2 \pi \sigma^2}} \right) - \frac{1}{\sigma^2} \frac{1}{2} \sum_{j=1}^m \left( y_j - x_j^T \theta \right)^2 \end{align*} $$
So maximizing $\mathcal{l}(\theta)$ is equal to minimising $\sum_{j=1}^m \left( y_j - x_j^T \theta \right)^2$. This means under certain assumptions, the least squared regression is equivalent to finding maximum likelihood estimate of $\theta$.
In univariate regression we aim to find the relationship between $y$ and one independent variable $x$.
As an example, if we want to investigate the relationship between height and weight, we could collect $m$ measurements
$$(h_j, w_j), \quad j = 1, ..., m$$
Univariate linear regression assumes a linear relation, $\hat{w} = \theta_0 + \theta_1 h$.
Here, we model the relationship of $y$ to several other variables.
As an example, say we now want to predict people's weight from their height and body frame size (i.e. wrist size). We collect $m$, height, weight and body frame measurements:
$$(h_j, f_j, w_j), \quad j = 1, ..., m$$
The linear regression model is now:
$$\hat{w} = \theta_0 + \theta_1 h + \theta_2 f$$
We can also apply some changes to produce curves with linear regression, i.e.
To model:
$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2$$
we can treat $x_2 = x_1^2$ and use the model
$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$
These nonlinear models can still be treated like linear regression and can fit curvature. They are still linear in parameters. Non linear regression is not linear in parameters, i.e., $\hat{y} = \frac{\theta_1 x}{\theta_2 + x}$
However, increasing the degree of the model can result in overfitting.
Regularisation is a method to avoid overfitting by adding constraints to the weight vector. An approach is to ensure the weights are, small in magnitude: this is called shrinkage.
The cost function, given data $(x_1, y_1), ..., (x_m, y_m)$ is
$$J(\theta) = \sum_j (y_j - h_\theta (x_j))^2 + \lambda \sum_i \theta_i^2$$
The multiple least-squares regression problem is an optimisation problem, and can be written as:
$$\theta* = \arg \min_\theta (y - X \theta)^T (y - X\theta)$$
The regularised version of this is then as follows:
$$\theta* = \arg \min (y - X \theta)^T (y - X \theta) + \lambda || \theta ||^2$$
Where $||\theta||^2 = \sum_i \theta_i^2$ is the squared norm of the vector $\theta$, or equivalently, the dot product $\theta^T\theta$; $\lambda$ is a scalar determining the amount of regularisation.
This still has a closed form solution
$$\theta = (X^TX + \lambda I)^{-1} X^T y$$
where $I$ is the identity matrix. Regularisation amounts to adding $\lambda$ to the diagonal of $X^TX$, improves the numerical stability of matrix inversion. This form of least-squares regression is known as ridge regression.
An alterative of regularised regression is provided by the lasso. It replaces the term $\sum_i \theta_i^2$ with $\sum_i |\theta_i |$. The result is that some weights are shrunk, but others are set to 0, hence this favours sparse solutions.
Suppose there are a lot of variables / features $(x)$. Taking all the features will lead to an overly complex model. There are 3 ways to reduce complexity.
Classification doesn't have convenient mathematical properties, so in classification, train a classifier, which is usually a function that maps from an input data point to a set of discrete outputs (i.e. the classes).
For a linear classifier of 2 features and 2 classes
If $p$ and $n$ is the respective centers of mass of the positive and negative points, the basic linear classifier is described by the equation $x^Tw = t$ and $w = p - n$.
As we know, $\frac{p+n}{2}$ is on the decision boundary, so we have:
$$t = \left( \frac{p+n}{2} \right)^T \cdot (p-n) = \frac{||p||^2 - ||n||^2}{2}$$
where $||x||$ denotes the length of vector $x$.
Generalisation means how well a trained model can classify or forecast unseen data. There are 3 basic assumptions for generalisation
CV is a validation technique to assess the results of a model to an independent data set. A portion of the dataset is used to train the model, and another is used to test. Types of CV include:
There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead we use a separate validation or development set.
Validation set: To make the hyperparameter tuning and model selection independent from the test set, we define another set within the train set.
If we have a binary classification, we can have a contingency table, AKA confusion matrix:
Predicted Positive | Predicted Negative | |
---|---|---|
Actual Positive | True Positive (TP) | False Negative (FN) |
Actual Negative | False Positive (FN) | True Negative (TN) |
Classification Accuracy on a sample of labelled pairs $(x,c(x))$ given a learned classification model that predicts, for each instance $x$, a class value $\hat{c}(x)$:
$$acc = \frac{1}{|Test|} \sum_{x \in Test} I[ \hat{c}(x) = c(x) ]$$
where $Test$ is a test set, and $I[]$ is the indicator function which is 1 iff its argument evaluates to true and 0 otherwise. The classification Error $= 1 - acc$.
Other evaluation metrics:
Precision / correctness
$$Precision = \frac{TP}{TP + FP}$$
Recall / sensitivity / completeness / true positive rate (TPR) is the number of relevant objects classified correctly divided by total number of relevant / correct objects
$$Recall = \frac{TP}{TP + FN}$$
F1 score is the harmonic mean of precision and recall and is defined as:
$$F_1 = \frac{2 \times precision \times recall}{precision + recall}$$
This measure gives equal importance to precision and recall which is sometime undesirable; so, we have to decide which metric to use depending on the task and what's important for the task.
AUC-ROC (Area Under the Curve - Receiver Operating Characteristics) is an important curve for performance of classification models. It evaluates the model at different threshold settings and can inform the capability of the model in distinguishing between classes.
Often times data may be incomplete / missing labels. To handle this, there are several strategies:
Nearest Neighbour is a regression or classification algorithm that predicts whatever is the output value of the nearest data point to some query.
To find the nearest data point, we have to find the distance between the query and other points. So we have to decide how to define the distance.
If $\mathcal{X} \rightarrow \mathbb{R}^d$, $x, in \in \mathcal{X}$, the Minkowski distance of order $p > 0$ is defined as:
$$Dist_p(x,y) = \left( \sum_{j=1}^d |x_j - y_j|^P \right)^{\frac{1}{p}} = ||x-y||_p$$
Where $||z||_p = \left( \sum_{j=1}^d |z_j|^p \right)^{\frac{1}{p}}$ is the p-norm (sometimes denoted $L_p$ norm) of the vector $z$.
If we let $p$ grow larger, the distance will be more dominated by the largest coordinate-wise distance, from which we can infer that $Dist_\infty = max_j | x_j - y_j|$; this is also called Chebyshev distance.
If the data is not in $\mathcal{R}^d$, but we can turn it into Boolean features / character sequences, we can still apply distance measures, i.e.
Data type | Distance Measure |
---|---|
Equal Length Binary String | Hamming Distance |
Unequal Length Non-Binary String | Levenshtein Distance |
Given an instance space $\mathcal{X}$, a distance metric is a function $Dist : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty)$ such that for any $x,y,z \in \mathcal{X}$
If the second condition is weakened to a non-strict inequality - i.e. $Dist(x,y) = 0, x \neq y$, the function $Dist$ is called a pseudo-metric.
The arithmetic mean $\mu$ of a set of data points $D$ in a Euclidean space is the unique point that minimises the sum of squared Euclidean distances to those data points.
This is a classifier based on minimum distance principle, where the class exemplars are just the centroids (or means).
Pros:
Cons:
This is related to the simplest form of learning
Nearest neighbour:
$k$-Nearest neighbour:
Pros:
Cons:
Note that the data needs to be normalized because different attributes may be measured on different scales, i.e. $$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$$
where $x$ is the actual value of attribute / feature, and $x'$ is the normalized value.
Nearest Neighbour should be considered when
We might want to use the distance function to construct a weight $w_i$. The final line of the classification algorithm can be changed to be
$$\hat{f}(x_q) \leftarrow \arg \max_{v \in V} \sum_{i=1}^k w_i \delta(v,(f(x_j)))$$
where
$$w_i = \frac{1}{Dist(x_q, x_i)^2}$$
For real-valued target functions, the final line can be changed to be
$$\hat{f}(x_q) \leftarrow \frac{\sum_{i=1}^k w_i f(x_i)}{\sum_{i=1}^k w_i}$$
Lazy learners like this do not construct a model, i.e.
The solution is to use Leave-one-out cross-validation (LOOCV), ie. leave out one example and predict it given the rest.
As dimension increases, the effectiveness of distance metrics decrease, and the concept of proximity may not be qualitatively meaningful as all points look equidistant.
Some other problems include:
One approach to overcome the curse of dimensionality:
Inductive Bias is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem. It means the algorithm and model combination you are using to solve the learning problem is appropriate for the task.
Provides practical learning algorithms:
Provides useful conceptual framework
Bayes theorem is stated as
$$P(h|D) = \frac{P(D|h)P(h)}{P(D)}$$
where
If the output belongs to a set of $k$ classes: $y \in \{ C_1, C_2, ..., C_k \}$ for $1 \leq i \leq k$.
Then in Bayesian framework:
$$P(y = C_i | x) = \frac{P(x|C_i)P(C_i)}{P(x)}$$
where $P(x) = \sum_i P(x|C_i)(P(C_i)$
The decision rule is to select a class which maximises the posterior probability for the prediction.
Generally we want the most probable hypothesis given the training data, Maximum a posteriori hypothesis $h_{MAP}$:
$$ \begin{align*} h_{MAP} &= \arg \max_{h \in H} P(h|D) \\ &= \arg \max_{h \in H} \frac{P(D|h)P(h)}{P(D)} \\ &= \arg \max_{h \in H} P(D|h)P(h) \end{align*} $$
To get the posterior probability of a hypothesis $h$
Divide $P$ ($\oplus$) (probability of data) to normalize result for $h$:
$$P(h|D) = \frac{P(D|h) P(h)}{\sum_{h_i \in H}P(D|h_i)P(h_i)}$$
Denominator ensures we obtain posterior probabilities that sum to 1. Sum for all possible numerator values, since hypotheses are mutually exclusive.
Theorem of total probability: if events $A_1, ..., A_n$ are mutually exclusive with $\sum_{i=1}^n P(A_i) = 1$, then:
$$P(B) = \sum_{i=1}^n P(B|A_i)P(A_i)$$
So far, we decide $h_1$ if $P(h_1|D) > P(h_2|D)$ else $h_2$. Alternatively, we can use a loss function $L(h)$, where $L(h)$ is the loss that occurs when decision $h$ is made.
For example, if the cost of misclassifying a patient who has cancer as "not cancer" is 10 times more than classifying a patient who doesn't have cancer as "cancer", who will that affect our decision?
If the cost of misclassification is not the same for different classes, then instead of maximizing a posteriori, we have to minimize the expected loss:
An optimal Bayesian decision strategy is to minimize the expected loss.
Consider any real-valued target function $f$. Training examples $\langle x_i, y_i \rangle$ where $y_i$ is noisy training value
Then the maximum likelihood hypothesis $H_{ML}$ is the one that minimizes the sum of squared errors
$$ \begin{align*} h_{ML} &= \arg \max_{h \in H} P(D|h) \\ &= \arg \max_{h \in H} \prod_{i=1}^m P(y_i|\hat{f}) \\ &= \arg \max_{h \in H} \prod_{i=1}^m \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{1}{2} \left(\frac{y_i - \hat{f}(x_i)}{\sigma} \right)^2 \right) \end{align*} $$
where $\hat{f} = h_{ML}$.
We can maximise the natural log to give a simpler expression:
$$ \begin{align*} h_{ML} &= \arg \max_{h \in H} \sum_{i=1}^m \ln \frac{1}{\sqrt{2 \pi \sigma^2}} - \frac{1}{2} \left( \frac{y_i - \hat{f}(x_i)}{\sigma} \right)^2 \\ &= \arg \max_{h \in H} \sum_{i=1}^m - \frac{1}{2} \left( \frac{y_i - \hat{f}(x_i)}{\sigma} \right)^2 \\ &= \arg \max_{h \in H} \sum_{i=1}^m - \left(y_i - \hat{f}(x_i)\right)^2 \\ \end{align*} $$
Equivalently, we can minimise the positive version of the expression:
$$h_{ML} = \arg \min_{h \in H} \sum_{i=1}^m \left( y_i - \hat{f}(x_i) \right)^2$$
Discriminative models model the posterior probability distribution $P(y|x)$. That is, given $x$ they return a probability distribution over $y$.
Generative models model the joint distribution $P(y,x)$. Once we have the joint distribution, we can derive any conditional or marginal distribution involving the same variables.
Such models are called 'generative' because we can sample from the joint distribution to obtain new data points together with their labels.
In Bayesian optimal classification, the most probable classification is obtained by combining the predictions of ALL hypotheses, weighted by their posterior probabilities:
$$\arg \max_{v_j \in V} \sum_{h_i \in H} P(v_j|h_i)P(h_i|D)$$
where $v_j$ is a class value.
Pros:
Cons:
Bayes rule depends on unknown quantities so we need to use the data to find some approximation of those quantities
Is very inefficient. An alternative is the Gibbs algorithm.
Assuming target concepts are drawn at random from $H$ $$E[error_{Gibbs}] \leq 2 \times E[error_{BayesOptimal}]$$
When to use
Successful applications:
Assume target function $f : X \rightarrow V$, where each instance $x$ described by attributes $(x_1, x_2, ..., x_n)$.
The most probable value of $f(x)$ is:
$$v_{MAP} = \arg \max_{v_j \in V} P(x_1, x_2, ..., x_n|v_j)P(v_j)$$
Naive Bayes assumes attributes are statistically independent i.e.
$$P(x_1, x_2, ..., x_n|v_j) = \prod_i P(x_i|v_j)$$
Which gives Naive Bayes classifier
$$v_{NB} = \arg \max_{v_j \in V} P(v_j) \prod_i P(x_i|v_j)$$
Psuedocode:
$\text{for each target value }v_j:$
If none of the training instances with target value $v_j$ have attribute $x_i$, then:
$$ \hat{P}(x_i|v_j) = 0 \\ \therefore \hat{P}(v_j) \prod_i \hat{P}(x_i|v_j) = 0 $$
Pseudo-counts add 1 to each count (a version of the Laplace Estimator).
Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) $$x|v_j \sim N(\mu,\sigma^2)$$
the probability density function for the normal distribution is defined by two parameters
This gives the density function $f(x)$:
$$f(x) = \frac{1}{\sqrt{2 \pi \sigma}} \exp \left( \frac{(x - \mu)^2}{2 \sigma^2} \right)$$
Categorical variables appear often in ML, i.e. text classification
Pros:
Cons:
Logistic regression can be used for binary classification, where we transform the $y$ values into probability values (in the range $[0,1]$).
We can model this with a sigmoid curve.
Now $f(x)$ can have a value inbetween $-\infty$ and $+\infty$ and in Logistic Regression we estimate $f(x)$ with a line.
$$\hat{f}(x) = x^T\beta \Rightarrow \log \frac{P(y=1|x)}{1-P(y=1|x)}$$
Logistic regression seeks to
$$\hat{P}(y=1|x) = \frac{1}{1+e^{-x^T\beta}}$$
If $P(y=1|x) \geq 0.5$ (same as saying $x^T\beta \geq 0$) then predict as class 1
If $P(y=1|x) < 0.5$ (same as saying $x^T\beta < 0$) then predict as class 0
This is equivalent of having a linear decision boundary separating the two classes (because we have a linear solution to our problem, this is what makes Logistic Regression a linear model)
Cannot use cost function in Linear Regression because it will result in a non-convex function with many local minimums. Instead, the following cost function is used:
Let's define $\hat{P}(y=1|x) = h_\beta(x)$
$$ cost \left( h_\beta(x),y \right) = \begin{cases} -\log \left( h_\beta(x) \right) & \text{if $y = 1$} \\ -\log \left( 1-h_\beta(x) \right) & \text{if $y = 0$} \end{cases} $$
$$J(\beta) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log \left( h_\beta \left( x^{(i)} \right) \right) + \left( 1 - y^{(i)} \right) \log \left( 1 - h_\beta \left( x^{(i)} \right) \right) \right]$$
The values of the parameters that minimise $J(\beta)$ can be found using gradient descent.
Pros:
Cons:
Probably the single most popular tool
Decision tree representation:
Decision Trees can be used to represent Boolean functions like AND ($\land$), OR ($\lor$), XOR ($\oplus$)
$$X \land Y$$
if X == True:
if Y == True: return True
if Y == False: return False
if X == False:
return False
$$X \lor Y$$
if X == True: return True
if X == False:
if Y == True: return True
if Y == False: return True
$$X \oplus Y$$
if X == True:
if Y == True: return False
if Y == False: return True
if X == False:
if Y == True: return True
if Y == False: return False
In general, decisions trees represent a disjunction of conjunctions
When to use decision trees:
The main loop for top-down induction of decision trees (TDIDT)
If we want to determine yes or no, is outlook
or wind
a better attribute
def use_outlook(outlook: attribute) -> dict:
match outlook:
case sunny: return {"yes": 2, "no": 3}
case overcast: return {"yes": 4, "no": 0}
case rain: return {"yes": 3, "no": 2}
def use_wind(wind: attribute) -> dict:
match wind:
weak: return {"yes": 6, "no": 2}
strong: return {"yes": 3, "no": 3}
Entropy
Entropy measures the "impurity" of $S$.
$$Entropy(S) = H(S) = -p_\oplus \log_2 p_\oplus - p_\ominus \log_2 p_\ominus$$
Interpretation: if item $x$ belongs to $S$, how many bits are needed to tell if $x$ is positive or negative?
$Gain(S,A)$ is the expected reduction in entropy due to sorting on $A$
$$Gain(S,A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|}Entropy(S_v)$$
We want to find the attribution which maximizes the gain (this is also called "mutual information" between attribute $A$ and class labels of $S$).
To select the best attribute at each branch:
Limitations
Gain Ratio
$$SplitEntropy(S,A) = - \sum_{v \in Values(A)} \frac{S_v}{S} \log_2 \frac{S_v}{S}$$ Where:
$$GainRatio(S,A) = \frac{Gain(S,A)}{SplitEntropy(S,A)}$$
Consider error of hypothesis $h$ over
Definition Hypothesis $h \in H$ overfits training data if there is an alternative hypothesis $h' \in H$ such that
$$error_{train}(h) < error_{train}(h') \land error_D(h) > error_D(h')$$
To avoid overfitting in decision trees, we can use pruning
Pros:
Cons:
Mention things here
Task | Label and Output Space | Learning Problem |
---|---|---|
Classification | $\mathcal{L} = \mathcal{C}$, $\mathcal{Y} = \mathcal{C}$ | Learn an approximation $\hat{c} : \mathcal{X} \rightarrow \mathcal{C}$ to the true labelling function $c$ |
Scoring and ranking | $\mathcal{L} = \mathcal{C}$, $\mathcal{Y} = \mathbb{R}^{|\mathcal{C}|}$ | Learn a model that outputs a score vector over classes |
Probability estimation | $\mathcal{L} = \mathcal{C}$, $\mathcal{Y} = [0,1]^{|\mathcal{C}|}$ | Learn a model that outputs a probability vector over classes |
Regression | $\mathcal{L} = \mathbb{R}$, $\mathcal{Y} = \mathbb{R}$ | Learn an approximation $\hat{f} : \mathcal{X} \rightarrow \mathbb{R}$ to the true labelling function $f$ |
Perceptron is an algorithm for binary classification that uses a linear prediction function. Note the perceptron predicts a binary class label, but linear regression predicts a real value.
For a general case with $n$ attributes
$$ f(x) = \begin{cases} +1 & \text{if $w_0 + w_1x_1 + ... + w_nx_n > 0$} \\ -1 & \text{otherwise} \end{cases} $$
If we add $x_0 = 1$ to the feature vector:
$$ f(x) = \begin{cases} +1 & \text{if $\sum_{i=0}^n w_ix_i > 0$} \\ -1 & \text{otherwise} \end{cases} $$
$$\hat{y} = f(x) = sign(w \cdot x)$$
where $sign$ is the sign function. Now to find a good set of weights using our training set.
The perceptron algorithm initializes all weights $w_i$ to zero, and learns the weights using the following update rule:
$$w := w + \frac{1}{2} \left( y_j - f(x_j) \right)x_j$$
There are 4 cases:
Algorithm Perceptron($D$) / perceptron training for linear classification
Input labelled training data $D$ in homogeneous coordinates
Output weight vector $w$
Linear classification cannot model nonlinear class boundaries. However we can map attributes into new space consisting combination of attribute values.
E.g. for 2 attributes
$$y = w_1x_1^3 + w_2x_1^2x_2 + w_3x_1x_2^2 + w_4x_2^3$$
$y$ is predicted output for instances with two attributes $x_1$ and $x_2$.
However, there are issues with this approach
With an optimisation problem, we can construct another optimisation problem which is called dual problem and is related to our original problem (primal problem).
After training a perceptron, each example has been misclassified zero or more times. Denoting this number as $\alpha_i$ for example $x_i$, the weight vector for $m$ observations can be expressed as
$$w = \sum_{i=1}^m \alpha_iy_ix_i$$
In the dual instance-based view, we are learning instance weights $\alpha_i$ rather than feature weights $w_j$. An instance $x$ is classified as
$$\hat{y} = f(x) = sign(w \cdot x)$$ $$\hat{y} = sign \left( \sum_{i=1}^m a_iy_i(x_i \cdot x) \right)$$
Algorithm Dual-Perceptron($D$) / perceptron training for linear classification in dual form
Input labelled training data $D$ in homogeneous coordinates
Output coefficients $\alpha_i$ defining weight vector $W = \sum_{i=1}^{|D|} \alpha_i y_i x_i$
Let $x = (x_1, x_2)$ and $x' = (x_1', x_2')$ be two data points, and consider the following mapping to a three-dimensional feature space:
$$(x_1, x_2) \rightarrow (x_1^2, x_2^2, \sqrt{2}x_1x_2)$$
$$\text{(original feature space) } \mathcal{X} \rightarrow \mathcal{Z} \text{ (new feature space)}$$
The points in feature space corresponding to $x$ and $x'$ are
$z = (x_1^2, x_2^2, \sqrt{2}x_1x_2)$ and $z' = (x_1'^2, x_2'^2, \sqrt{2}x_1'x_2')$
The dot product of these two feature vectors is
$$z \cdot z' = x_1^2x_1'^2 + x_2^2x_2'^2 + 2x_1x_1'x_2x_2' = (x_1x_1' + x_2x_2')^2 = (x \cdot x')^2$$
By squaring in original space, we obtain the dot product in the new space. A function that directly calculates the dot product in the new feature space from vectors in the original space is called a kernel - here the kernel is $K(x_1,x_2) = (x_1 \cdot x_2)^2$.
A valid kernel function is equivalent to a dot product in some space.
$$K(x,x') = \varphi(x) \cdot \varphi(x')$$
Using the kernel trick, the nonlinear perceptron can be solved using the dual form
$$\hat{y} = sign\left( \sum_{i=1}^m a_i y_i (\varphi(x_i) \cdot \varphi(x)) \right) = sign\left( \sum_{i=1}^m a_i y_i K(x_i, x) \right)$$
Let $x_s$ be the closest point to the separating hyperplane (line in 2D) with the following equation:
$$w \cdot x = t$$
Let's have 2 minor technicalities to simply the math later:
Pull out $w_0 : w = [w_0, ..., w_n]$ and $w_0 = -t$, therefore we will have: $$w \cdot x - t = 0$$
Normalize $w$:
We know that $|w \cdot x_s - t| > 0$ and we know that we can sacle $w$ and $t$ together without having any effect on the hyperplane, so we choose the scale such that: $$|w \cdot x_s - t| = 1$$
This means $m = 1$.
$w$ is perpendicular to the line (hyperplane).
For every two points $x'$ and $x''$ on the line (hyperplane), we can write:
$$ w \cdot x' - t = 0 \text{ and } w \cdot x'' - t = 0 \\ \Rightarrow \quad w \cdot (x' - x'') = 0 $$
Since their dot product is 0, it means they are perpendicular.
Since, distance between the point $x_s$ and the hyperplane can be found as
$$distance = \frac{|w \cdot x_s - t|}{||w||}$$
And we have $|w \cdot x_s - t|=1$, so:
$$distance = \frac{1}{||w||}$$
This distance is the margin of our classifier which we want to minimize:
$$\max \frac{1}{||w||} \text{ subject to } \min_{t=1,...,m} |w \cdot x_i - t| = 1$$
(This is not a friendly optimisation as it has "min" in the constraint).
We can transform the maximization problem into the following minimization problem:
$$ \min_w \frac{1}{2} ||w||^2 \text{ subject to } y_i(w \cdot x_i - t) \geq 1 \\ \text{ for } i = 1,...,m, w \in \mathbb{R}^n, t \in \mathbb{R} $$
This can be solved using Lagrangian multipliers
In Lagrangian form, the optimization problem becomes:
$$\max_{\alpha_1,...,\alpha_m} \min_{x_1,...,x_n} \mathcal{L}(x_1,...,x_n,\alpha_1,...,\alpha_m) \text{ s.t. } \alpha_j \geq 0 \forall j$$
Adding the constraints with multiplier $\alpha_i$ for each training example gives the Lagrange function:
$$ \begin{align*} \mathcal{w,t,\alpha_1,...,\alpha_m} &= \frac{1}{2} ||w||^2 - \sum_{i=1}^m \alpha_i (y_i (w \cdot x_i - t) - 1) \\ &= \frac{1}{2} ||w||^2 - \sum_{i=1}^m \alpha_iy_i(w \cdot x_i) + \sum_{i=1}^m \alpha_i y_i t + \sum_{i=1}^m \alpha_i \\ &= \frac{1}{2} w \cdot w - w \cdot \left( \sum_{i=1}^m \alpha_iy_ix_i \right) + t \left( \sum_{i=1}^m \alpha_i y_i \right) + \sum_{i=1}^m \alpha_i \end{align*} $$
The dual optimization problem for SVMs is to maximize the dual Lagrangian under positive constraints and one equality constraint:
$$\alpha_i*, ..., \alpha_m* = \arg \max_{\alpha_1, ..., \alpha_m} - \frac{1}{2}\sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j x_i \cdot x_j + \sum_{i=1}^m \alpha_i $$
$$\text{subject to } \alpha_i \geq 0, 1 \leq i \leq m, \sum_{i=1}^m \alpha_i y_i = 0$$
Solving for $\alpha_1*, ..., \alpha_m*$, they will be mostly zero, except for points that are closest to the hyperplane. These points are called the support vectors.
$$w = \sum_{X_i \in \{ support : vectors\}} \alpha_i y_i x_i$$
Solve for $t$ using any of support vectors:
$$y_i (w \cdot x_i - t) = 1$$
Pros:
Cons:
Ensemble learning is a form of multi-level learning: learning a number of base-level models from the data, and learning to combine these models as an ensemble.
It can help to reduce bias (expected error due to mismatch between learner's hypothesis space and space of target concepts) and variance (expected error due to differences in the training sets used).
Three commonly used methods to find a good bias-variance tradeoff are:
However, it is difficult to compute big data efficiently.
The follow scenarios happen in real-world AI
For some given data $\mathcal{D}$, train an algorithm $L$ on training sets $S_1$, $S_2$ sampled from $\mathcal{D}$. If the model from $L$ is very similar on both $S_1$ and $S_2$, $L$ is a stable learning algorithm, otherwise it is unstable.
typical stable algorithm | typical unstable algorithm | |
---|---|---|
Example model | $k$NN (for some $k$) | decision-tree learning |
Cons | high bias | high variance |
Note parameters have effect on stability, i.e. in $k$NN:
Ensemble methods are meta-algorithms that combine different models into one model and they can:
This idea relates to the "wisdom of crowd" phenomenon.
There are different ways of combining predictors
Simple ensembles like majority vote or unweighted average
Weighted averages / weighted votes: Every model gets a weight (i.e. depending on its performance)
In practice:
Learning algorithms may not be independent
Some better fit the data so make less error
We can define weights in different ways:
Decrease weight of correlated learners
Increase weight of good learners
Treat the output of each model as a feature and train a model on that
Mixture of experts
"Bagging" method: ("Bootstrap Aggregation")
Boostrap:
Bagging:
Error of any model has two components:
Bagging is applied on a collection of low-bias high-variance models and by averaging them, the bias is unaffected, but the variance reduces.
To calculate bagging error:
The probability that $k'$ out of $k$ learners make an error is:
$$\binom{k}{k'}p^{k'}(1-p)^{k-k'}$$
If we use majority voting to decide the output, then the error happens if more than $\frac{k}{2}$ of learners make an error, so the error for majority voting is:
$$\sum_{k' > \frac{k}{2}} \binom{k}{k'}p^{k'}(1-p)^{k-k'}$$
Advantage of bagging:
Tree models can be bagged.
Pros
Cons
The solution for ensemble decision trees with lots of data is random forests.
To produce a random forest
A problem with parallel learners is that they can all be mistaken in the same region. Boosting tries to solve this:
We always aim to minimize some cost function:
Unweighted average loss | Weighted average loss |
---|---|
$J(\theta) = \frac{1}{N} \sum_i J_i (\theta, x_i)$ | $J(\theta) = \sum_i w_iJ_i (\theta, x_i)$ |
Boosting works well as long as we use weak learners (weak learner is a model that is slightly better than random), i.e.
For boosting:
Pros:
Cons:
Gradient boosting is apply similar ideas for regression (but can be used for classification as well).
Simple linear regression or simple regression trees can be used as weak learners.
Bagging | Boosting | |
---|---|---|
Purpose | Variance reduction | Bias reduction |
Models Used | Used with high variance models | Used with high bias models |
Model Examples | Random Forests | Linear classifiers / univariate decision trees (decision stumps) |
Artificial Neural Networks (NN) are inspired by human nervous systems. NNs are composed of a large number of interconnected processing elements known as neurons. They use supervised error correcting rules with back-propagation to learn a specific task.
A single perceptron has multiple inputs $x_1, ..., x_n$ and a binary output, where the output $o$ can be modeled as
$$ o(x_1, ..., x_n = \begin{cases} +1 & \text{if } w_0 + w_1x_1 + ... + x_0x_0 > 0 \\ -1 & \text{otherwise} \end{cases} $$
Or in vector notation:
$$ o(\mathbf{x} = \begin{cases} +1 & \text{if } \mathbf{w} \cdot \mathbf{x} > 0 -1 & \text{otherwise} \end{cases} $$
As a result a perceptron is able to represent some useful functions which are linearly separable
Perceptron learning is simply an iterative weight-update scheme to find a good set of weights.
$$w_{i+1} \leftarrow w_i + \Delta w_i$$
where the weight update $\Delta w_i$ depends only on misclassified examples and is modulated by a "smoothing" parameter $\eta$ AKA learning rate.
For example, the update rule can be written as
$$w_{i+1} \leftarrow w_i + \eta y_ix_i$$
where
Perception training will converge (under some mild assumptions) for linearly separable classification problems.
However, with a relatively minor modification many perceptrons can be combined together to form one model
Gradient descent / ascent is an optimisation algorithm that seeks to minimize / maximize a function.
For a simple linear unit, where
$$o = w_0 + w_1x_1 + ... + x_nx_n$$
Let's learn $w_i$ that minimizes the squared error
$$E[w] = \frac{1}{2} \sum_{d \in D} (t_d - o_d)^2$$
where $D$ is the set of training samples.
The gradient is given by
$$\nabla E[w] = \left[ \frac{\partial E}{\partial w_0}, \frac{\partial E}{\partial w_1}, ..., \frac{\partial E}{\partial w_n} \right]$$
The gradient vector gives the direction of steepest increase in error $E$. Negative of the gradient, i.e. steepest decrease, is what we want.
Training rule: $\Delta \mathbf{w} = -\eta \nabla E[\mathbf{w}]$, i.e. $\Delta w_i = - \eta \frac{\partial E}{\partial w_i}$.
$$ \begin{align*} \frac{\partial E}{\partial w_i} &= \frac{\partial}{\partial w_i} \frac{1}{2} \sum_{d \in D} (t_d - o_d)^2 \\ &= \frac{1}{2} \sum_d \frac{\partial}{\partial w_i} (t_d - o_d)^2 \\ &= \frac{1}{2} \sum_d 2(t_d - o_d) \frac{\partial}{\partial w_i}(t_d - o_d) \\ &= \sum_d (t_d - o_d) \frac{\partial}{\partial w_i} (t_d - \mathbf{w} \cdot \mathbf{x}_d) \\ &= \sum_d (t_d - o_d)(-x_{i,d}) \end{align*} $$
Perceptron training rule guaranteed to succeed if:
Linear unit training rule uses gradient descent
Batch mode Gradient Descent
Stochastic (incremental) mode Gradient Descent
SGD can approximate Batch Gradient Descent arbitrarily closely, if $\eta$ made small enough
Multilayer networks can represent arbitrary functions. It typically consists of an input, hidden, and output layer, each fully connected to the next. The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer.
Properties of Artificial Neural Networks (ANN's):
Consider when
Examples include speech recognition, image classification, and many others.
Same as a perceptron except step function is replaced by a nonlinear sigmoid function.
$$\sigma(x) = \frac{1}{1+e^{-x}}$$ Nonlinearity makes it easy for the model to generalise or adapt with variety of data and to differentiate between the output.
The sigmoid function is used because its derivative has a nice property
$$\frac{d \sigma(x)}{dx} = \sigma(x)(1 - \sigma(x))$$
We can derive gradient descent rules to train
Note: in practice, particularly for deep networks, sigmoid functions are less common than other non-linear activation functions that are easier to train, however sigmoids are mathematically convenient.
To determine the error gradient of a sigmoid unit
Start by assuming we want to minimize squared error over a set of training examples $D$.
$$ \begin{align*} \frac{\partial E}{\partial w_i} &= \frac{\partial}{\partial w_i} \frac{1}{2} \sum_{d \in D}(t_d - o_d)^2 \\ &= \frac{1}{2} \sum_{d \in D} \frac{\partial}{\partial w_i} (t_d - o_d)^2 \\ &= \frac{1}{2} \sum_{d \in D} 2(t_d - o_d)\frac{\partial}{\partial w_i} (t_d - o_d) \\ &= \sum_{d \in D} 2(t_d - o_d) \left( \frac{\partial o_d}{\partial w_i} \right) \\ &= -\sum_{d \in D} 2(t_d - o_d) \frac{\partial o_d}{\partial net_d} \frac{\partial net_d}{\partial w_i}\\ \end{align*} $$
We know:
$$\frac{\partial o_d}{\partial net_d} = \frac{\partial \sigma(net_d)}{\partial net_d} = o_d(1 - o_d)$$
$$\frac{\partial net_d}{\partial w_i} = \frac{\partial(\mathbf{w} \cdot \mathbf{x})d)}{\partial w_i} = x_{i,d}$$
so:
$$\frac{\partial E}{\partial w_i} = - \sum_{d \in D}(t_d - o_d) o_d (1 - o_d) x_{i,d}$$
A solution for learning highly complex models:
Minimises error over all training examples
Will converge to a local, not necessarily global, error minimum
Many ways to regularise network, making it less likely to overfit
Minimizing square error does not work so well for classification.
If we take the output $o(x)$ as the probability of the class $\mathbf{x}$ being 1, the preferred loss function is the cross-entropy
$$- \sum_{d \in D} t_d \log o_d + (1 - t_d) \log (1 - o_d)$$
where:
$t_d \in \{0,1\}$ is the class label for training example $d$, and $o_d$ is the output of the sigmoid unit, interpreted as the probability of the class of training example $d$ being 1.
To train sigmoid units for classification using this setup, one can use gradient descent and backpropagation algorithm - this will yield the maximum likelihood solution.
CNNs are very similar to regular Neural Networks
CNN architecture assumes that inputs are images
The problem with regular NNs is that they do not scale well with dimensions. In contrast, CNNS, consider 3D volumes of neurons and propose a parameter sharing scheme that minimises the number of params required by the network.
CNN neurons are arranged in 3 dimensions: Width, Height, and Depth. Neurons in a layer are only connected to a small region of the layer before it (hence not fully connected).
Main layers:
Supervised learning
Unsupervised learning
Clustering algorithms form two broad categories:
Let $X = \{x_1, ..., x_m \}$ be a set of instances
Let $C = (C_1, ..., C_k)$ be a partition of $m$ elements into $k$ subsets.
Each subset is called a cluster, and $C$ is called a clustering.
Set value for $k$, th number of clusters
Initialize: choose points for centres (means) of $k$ clusters (at random)
Procedure:
Despite weaknesses (suffering from outliers, or trapped in local minimum), $k$-means is still the most popular algorithm due to simplicity and efficiency. There is no clear evidence that any other clustering algorithm performs better in general.
For $k = 2$, think of the full description of each instance as $y_i = (z_i, z_{i1}, z_{i2})$, where
Given:
Determine Maximum likelihood estimates of
Initialize: Pick random initial $h = (\mu_1, ..., \mu_k$
Iterate: E step: Calculate expected value $E[z_{ij}$ of each hidden variable $z_{ij}$, assuming current hypothesis $h = (\mu_1, ..., \mu_k)$ holds:
$$ \begin{align*} E[z_{ij}] &= \frac{P(x=x_i|\mu=\mu_j)}{\sum_{n=1}^k p(x=x_i|\mu=\mu_n)} \\ &= \frac{\exp \left( - \frac{1}{2 \sigma^2} (x_i - \mu_j)^2 \right)}{\sum_{n=1}^k \exp \left( - \frac{1}{2 \sigma^2} (x_i - \mu_n)^2 \right)} \end{align*} $$
M step: Calculate new maximum likelihood hypthesis $h' = (\mu_1', ..., \mu_k')$ assuming value taken on each hidden variable $z_{ij}$ is the expected value $E[z_{ij}]$ calculated before. Replace $h = (\mu_1, ..., \mu_k)$ by $h' = (\mu_1', ..., \mu_k')$.
$$\mu_j \leftarrow \frac{\sum_{i=1}^m E[z_{ij}] x_i}{\sum_{i=1}^m E[z_{ij}]}$$
Converges to local maximum likelihood $h$ and provides estimates of hidden variables $z_{ij}$.
In fact, local maximum in $E[\ln P(Y|h)]$
Bottom up: at each step join two closest clusters (starting with single-instance clusters)
Top down: find two clusters and then proceed recursively for the two subsets
Elbow method:
Gap statistics:
Key idea: look for features in a transformed space so that each dimension in the new space captures the most variation in the original data when it is projected onto that dimension.
Given: labelled data $(x,y)$ and unlabelled data $(x)$
Repeat:
Assumes: classifications by $h$ will tend to be correct (especially high probability ones)
Key idea: two views of an instance, $f_1$ and $f_2$
Multi-view learning
Machine learning: Have a computer solve problems by learning from data rather than being explicitly programmed.
Inductive learning: learning from examples and all machine learning algorithms are a kind of inductive learning and there are some questions that we are interested to be able to answer in such framework:
Instead of focusing on particular algorithms, learning theory aims to characterize classes of algorithms.
Probably Approximately Correct (PAC) is a framework for mathematical analysis of learning which was proposed by Valiant in 1984.
The framework of PAC learning can be used to address questions such as:
Given:
Learner observes a sequence $D$ of training examples of form $(x,c(x))$, for some target concept $c \in C$
Learner must output a hypothesis $h$ estimating $c$
Note: randomly drawn instances
True Error of a Hypothesis: The true error (denoted $error_{\mathcal{D}}(h)$) of hypothesis $h$ with respect to target concept $c$ and distribution $\mathcal{D}$ is the probability that $h$ will misclassify an instance drawn at random according to $\mathcal{D}$.
$$error_{\mathcal{D}}(h) \equiv \text{Pr}_{X \in \mathcal{D}} [c(x) \neq h(x)]$$