Writings

COMP9417

Machine Learning and Data Mining

2 May 2022

Statistical Techniques for Data Analysis

Probability vs Statistics

Probability: reasons from populations to samples (deductive reasoning)
Statistics: reasons from samples to populations (inductive reasoning)

Sampling is a way to draw conclusions about the population without measuring the whole population. The characteristics of an ideal sampling model include

No systematic bias
The chance of obtaining an unrepresentative sample can be calculated
The chance of obtaining an unrepresentative sample decreases with the size of the sample

Estimation

In statistics, estimation refers to the process by which one makes inferences about a population.

Mean

The arithmetic mean is $m = \frac{1}{n}\sum_{i=1}^n x_i$, where the observations are $x_1, x_2, ..., x_n$. If $x_i$ occurs $f_i$ times, and we define relative frequencies as $p_i = \frac{f_i}{n}$, then the mean is $m = \sum_i x_i p_i$. Therefore the expected value of a discrete random variable $X$ is:

$$E(X) = \sum_i x_i P(X=x_i)$$

Variance

Variance measures how far a set of random numbers are spread out from their average value. Standard deviation (square root of variance) is estimated as:

$$s = \sqrt{\frac{1}{n-1} \sum_i (x_i - m)^2}$$

We can write this in terms of expected values $E(X)$ as

$$Var(x) = E((X - E(X))^2) = E(X^2) - [E(x)]^2$$

Variance is the "mean of squares minus the squares of means"

Covariance and Correlation

Covariance is a measure of relationship between two random variables:

$$cov(x,y) = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{n-1} = \frac{(\sum_i x_i y_i) - n \bar{x} \bar{y}}{n-1}$$

Correlation is a measure to show how strongly a pair of random variables are related:

$$r = \frac{cov(x,y)}{\sqrt{var(x)}\sqrt{var(y)}}$$

This is also called Pearson's correlation coefficient.

The correlation coefficient is a number between $-1$ and $+1$ that shows whether a pair of variables $x$ and $y$ are associated or not and whether their scatter in the association is high or low:
- A value close to 1 shows signifies strong positive association between $x$ and $y$, while a value close to $-1$ shows a strong inverse association
- A value near 0 indicates there is no association and there is a large scatter
This is only approximate when $x$ and $y$ are roughly linearly associated (does not work well with association is curved)
Do not use correlation to imply $x$ causes $y$ or the other way around

Bias and Variance

Bias is the difference between the average prediction of our model and the correct value.

Models with high bias pays very little attention to the training data and oversimplifies the model.
It always leads to high error on training and test data.

Variance is the variability of model prediction for a given data point or a value which tells us spread of our data.

Models with high variances pays a lot of attention to training data and does not generalize unseen data.
As a result, such models perform very well on training data but has high error rates on test data.

Bias-Variance Decomposition

When we assume $y = f + \epsilon$ and we estimate $f$ with $\hat{f}$, the expectation of error:

$$E[(y - \hat{f})^2] = (f - E[\hat{f}])^2 + Var(\hat{f}) + Var(\epsilon)$$

So, the mean of squared error (MSE) can be written as

$$MSE = Bias^2 + Variance + Irreducible Error$$

Irreducible error is associated with a natural variability in a system (noise). It can not be reduced since it is due to unknown factors or due to chance.
Reducible error, as the name suggests, can be and should be minimized further by adjustments to the model.

Bias-Variance Tradeoff

When comparing unbiased estimators, we would like select the one with minimum variance

If our model is too simple and has very few parameters, then it may have high bias and low variance.
If it has a large number of parameters, it may have high variance and low bias.

Regression

Regression can be used to predict numeric values from numeric attributes. In linear models, the outcome is a linear combination of attributes:

$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n = \sum_{i=0}^n \theta_i x_i = h(x)$$

There is:

Univariate regression: one input variable/feature/attribute is used to predict one output variable
Multiple regression / multivariable regression: more than one variable/features/attributes are used to predict one output variable

Least Squares

The most popular estimation model is "Least Squares" for fitting a linear regression. Error is defined as difference between the predicted and actual value, i.e.

$$J(\theta) = \sum_{y=1}^m \left(y_j - \sum_{i=0}^n \theta_i x_{ji} \right)^2 = \sum_{j=1}^m \left(y_j - x_j^T \theta \right)^2 = (y - X \theta)^T (y - X \theta)$$

We want to minimize the error over all samples.

Gradient Descent

The goal of Linear Regression is to minimise $J(\theta)$. If we compute $J(\theta)$ for different values of $\theta$, we get a convex function with one global minima. To find the value of $\theta$ that provides this minimum, we can use gradient descent.

Gradient descent starts with some initial $\theta$, and repeatedly performs an update:

$$\theta_i^{(t+1)} := \theta_t^{(t)} - \alpha \frac{\partial}{\partial \theta_i} J \left( \theta_i^{(t)} \right)$$

where $\alpha$ is the learning rate / step size, and each iteration it takes a step in the direction of the steepest decrease in $J(\theta)$.

The partial derivative term for $J(\theta)$ can be determined as

$$\frac{\partial}{\partial \theta_i} J(\theta) = -2 \left( y_j - h_\theta (x_j) \right) x_{ji}$$

So, for a single training sample, the update Least Mean Squares (LMS) rule is:

$$\theta_{i+1} := \theta_i + 2 \alpha \left( y_j - h_\theta (x_j) \right) x_{ji}$$

For other samples, there are two methods.

Batch Gradient Descent: $$\theta_i^{t+1} = \theta_i^{(t)} + \alpha \frac{2}{m} \sum_{j=1}^m \left( y_j - h_{\theta^{(t)}} (x_j) \right) x_{ji}$$ Replace the gradient with the sum of gradient for all samples and continue until convergence (when the estimated $\theta$ is stabilized).
Stochastic Gradient Descent: $$ \text{ for } j = 1 \text{ to } m \left\{ \theta _ i = \theta_i + 2 \alpha \left( y_jj - h _ \theta (x _ j) \right) x _ ji \text{ for every } i \right\}$$

In stochastic gradient descent, $\theta$ gets updated at any sample separately, so it is faster to calculate, but may never converge to the minimum.

Minimizing Squared Error (normal equations)

Gradient descent is an iterative algorithm, however, you may find the minimum of $J(\theta)$ by finding and setting it's derivatives to 0. This is also called the exact or closed-form solution.

$$ \frac{\partial}{\partial \theta} J(\theta) = 0 \\ J(\theta) = ( \mathbb{y} - X \theta )^T (\mathbb{y} - X \theta) \\ \frac{\partial}{\partial \theta} J(\theta) = - 2X^T (\mathbb{y} - X \theta) = 0 \\ X^T (\mathbb{y} - X \theta) = 0 \\ \theta = (X^TX)^{-1}X^T \mathbb{y} $$

Minimizing Squared Error (normal equations)

We can write the relationship between input variable $x$ and output variable $y$ as:

$$y_j = x_j^T \theta + \epsilon_j$$

And $\epsilon_j$ is an error term which might be unmodeled effect or random noise. Assume $\epsilon_j$'s are independent and identically distributed (i.i.d.) according to a Gaussian distribution, i.e.:

$$ \epsilon_j \sim N \left( 0, \sigma^2 \right) \\ p(\epsilon_j) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{e_j^2}{2 \sigma^2} \right) $$

This implies that:

$$p(\epsilon_j) = p(y_j | x_j; \theta) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{e_j^2}{2 \sigma^2} \right)$$

So we want to estimate $\theta$ s.t. we maximise the probability of output $y$ given input $x$ over all $m$ training samples:

$$ \begin{align*} \mathcal{L}(\theta) &= p(\mathbb{y} | X, \theta) \\ &= \prod_{j=1}^m \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{\left( y_j - x_j^T \theta \right)^2}{2 \sigma^2} \right) \end{align*} $$

Note this is called the Likelihood function.

However, to find $\theta$ that maximises $\mathcal{L}(\theta)$, we can also maximise any strictly increasing function of $\mathcal{L}(\theta)$. In this case, we can find maximize the log likelihood $mathcal{l}(\theta)$:

$$ \begin{align*} \mathcal{l}(\theta) &= \log \mathcal{L}(\theta) \\ &= \log \prod_{j=1}^m \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{\left( y_j - x_j^T \theta \right)^2}{2 \sigma^2} \right) \\ &= m \log \left( \frac{1}{\sqrt{2 \pi \sigma^2}} \right) - \frac{1}{\sigma^2} \frac{1}{2} \sum_{j=1}^m \left( y_j - x_j^T \theta \right)^2 \end{align*} $$

So maximizing $\mathcal{l}(\theta)$ is equal to minimising $\sum_{j=1}^m \left( y_j - x_j^T \theta \right)^2$. This means under certain assumptions, the least squared regression is equivalent to finding maximum likelihood estimate of $\theta$.

Linear Regression Assumptions

Linearity: The relationship between $x$ and the mean of $y$ is linear
Homoscedasticity: The variance of residual is the same of any value of $x$
Independence: Observations are independent of each other
Normality: For any fixed value of $x$, $y$ is normally distributed

Univariate Linear Regression

In univariate regression we aim to find the relationship between $y$ and one independent variable $x$.

As an example, if we want to investigate the relationship between height and weight, we could collect $m$ measurements

$$(h_j, w_j), \quad j = 1, ..., m$$

Univariate linear regression assumes a linear relation, $\hat{w} = \theta_0 + \theta_1 h$.

Multiple Linear Regression

Here, we model the relationship of $y$ to several other variables.

As an example, say we now want to predict people's weight from their height and body frame size (i.e. wrist size). We collect $m$, height, weight and body frame measurements:

$$(h_j, f_j, w_j), \quad j = 1, ..., m$$

The linear regression model is now:

$$\hat{w} = \theta_0 + \theta_1 h + \theta_2 f$$

Linear Regression for curves

We can also apply some changes to produce curves with linear regression, i.e.

To model:

$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2$$

we can treat $x_2 = x_1^2$ and use the model

$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$

These nonlinear models can still be treated like linear regression and can fit curvature. They are still linear in parameters. Non linear regression is not linear in parameters, i.e., $\hat{y} = \frac{\theta_1 x}{\theta_2 + x}$

However, increasing the degree of the model can result in overfitting.

Regularisation

Regularisation is a method to avoid overfitting by adding constraints to the weight vector. An approach is to ensure the weights are, small in magnitude: this is called shrinkage.

The cost function, given data $(x_1, y_1), ..., (x_m, y_m)$ is

$$J(\theta) = \sum_j (y_j - h_\theta (x_j))^2 + \lambda \sum_i \theta_i^2$$

Parameter estimation by optimisation wil attempt to find values for $\theta_0, ..., \theta_n$ s.t. $J(\theta)$ is a minimum
Similar to before, this can be solved by gradient descent or finding the closed form solution

The multiple least-squares regression problem is an optimisation problem, and can be written as:

$$\theta* = \arg \min_\theta (y - X \theta)^T (y - X\theta)$$

The regularised version of this is then as follows:

$$\theta* = \arg \min (y - X \theta)^T (y - X \theta) + \lambda || \theta ||^2$$

Where $||\theta||^2 = \sum_i \theta_i^2$ is the squared norm of the vector $\theta$, or equivalently, the dot product $\theta^T\theta$; $\lambda$ is a scalar determining the amount of regularisation.

This still has a closed form solution

$$\theta = (X^TX + \lambda I)^{-1} X^T y$$

where $I$ is the identity matrix. Regularisation amounts to adding $\lambda$ to the diagonal of $X^TX$, improves the numerical stability of matrix inversion. This form of least-squares regression is known as ridge regression.

An alterative of regularised regression is provided by the lasso. It replaces the term $\sum_i \theta_i^2$ with $\sum_i |\theta_i |$. The result is that some weights are shrunk, but others are set to 0, hence this favours sparse solutions.

Model Selection

Suppose there are a lot of variables / features $(x)$. Taking all the features will lead to an overly complex model. There are 3 ways to reduce complexity.

Subset-selection, by search over subset lattice. Each subset results in a new model, and the problem is so select one of the models.
Shrinkage, or regularization of coefficients to zero, by optimization. There is a single model, and unimportant variables have near-zero coefficients.
Dimensionality-reduction, by projecting points into a lower dimensional space (this is different to subset-selection)

Classification

Classification doesn't have convenient mathematical properties, so in classification, train a classifier, which is usually a function that maps from an input data point to a set of discrete outputs (i.e. the classes).

Generative algorithm: builds some models for each of the classes and then makes classification predictions based on looking at the test example see it is more similar to which of the models
- Learns $p(x|y)$
- So, we can get $p(x,y) = p(x|y)p(y)$
- It learns the mechanism by which the data has been generated
Discriminative algorithm: Do not build models for difference classes, but rather focuses on finding a decision boundary that separates classes
- Learns $p(y|x)$

Linear Classification

For a linear classifier of 2 features and 2 classes

We a line that separates the two classes: $ax_1 + bx_2 + c = 0$.
We define a weight vector $w^T = [a,b]$, $x^T = [x_1,x_2]$.
So, the line can be defined by $x^Tw = -c = t$
$w$ is perpendicular to the decision boundary
$t$ is the decision threshold (if $x^Tw > t$, $x$ belongs to $P$ but if $x^Tw < t$, $x$ belongs to $N$)

If $p$ and $n$ is the respective centers of mass of the positive and negative points, the basic linear classifier is described by the equation $x^Tw = t$ and $w = p - n$.

As we know, $\frac{p+n}{2}$ is on the decision boundary, so we have:

$$t = \left( \frac{p+n}{2} \right)^T \cdot (p-n) = \frac{||p||^2 - ||n||^2}{2}$$

where $||x||$ denotes the length of vector $x$.

Generalisation

Generalisation means how well a trained model can classify or forecast unseen data. There are 3 basic assumptions for generalisation

Examples are drawn independent and identically (i.i.d) at random for the distribution
The distribution is stationary; that is the distribution doesn't change within the data set
We always pull from the same distribution (for training, validation and test samples)

Cross-validation

CV is a validation technique to assess the results of a model to an independent data set. A portion of the dataset is used to train the model, and another is used to test. Types of CV include:

Holdout method Randomly assign data points to two sets $d_0$ (training set) and $d_1$ (test set).
Leave-One-Out Cross validation (LOOCV) A variation of the leave-p-out cross validation where 1 sample is left out, to test, whilst the rest is used for training.
K-fold Cross Validation Partition the dataset into $k$ equally sized subsamples. Of the $k$ subsamples, one subsample is used as for testing, and the rest for training. Repeat with all subsets to produce a single estimation

There are certain parameters that need to be estimated during learning. We use the data, but NOT the training set, OR the test set. Instead we use a separate validation or development set.

Validation set: To make the hyperparameter tuning and model selection independent from the test set, we define another set within the train set.

Evaluation of error

If we have a binary classification, we can have a contingency table, AKA confusion matrix:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FN)	True Negative (TN)

Classification Accuracy on a sample of labelled pairs $(x,c(x))$ given a learned classification model that predicts, for each instance $x$, a class value $\hat{c}(x)$:

$$acc = \frac{1}{|Test|} \sum_{x \in Test} I[ \hat{c}(x) = c(x) ]$$

where $Test$ is a test set, and $I[]$ is the indicator function which is 1 iff its argument evaluates to true and 0 otherwise. The classification Error $= 1 - acc$.

Other evaluation metrics:

Precision / correctness

is the number of relevant objects classified correctly divided byt he total number of relevant objects classified

$$Precision = \frac{TP}{TP + FP}$$

Recall / sensitivity / completeness / true positive rate (TPR) is the number of relevant objects classified correctly divided by total number of relevant / correct objects

$$Recall = \frac{TP}{TP + FN}$$

F1 score is the harmonic mean of precision and recall and is defined as:

$$F_1 = \frac{2 \times precision \times recall}{precision + recall}$$

This measure gives equal importance to precision and recall which is sometime undesirable; so, we have to decide which metric to use depending on the task and what's important for the task.

AUC-ROC (Area Under the Curve - Receiver Operating Characteristics) is an important curve for performance of classification models. It evaluates the model at different threshold settings and can inform the capability of the model in distinguishing between classes.

$TPR = \frac{TP}{TP+FN}$
$FPR = \frac{FP}{FP+TN}$
A good model has AUC close to 1, a very poor model has AUC close to 0, if the AUC = 0.5, it means there is no class separation

Missing Values

Often times data may be incomplete / missing labels. To handle this, there are several strategies:

Delete samples with missing values
- Pros: A robust and probably more accurate model.
- Cons: Loss of information and data. Becomes a poorly trained model if percentage of missing values is high.
Replace missing value with mean/median/mode
- Pros: When data size is small, it is better than deleting and can prevent data loss.
- Cons: Inputting the approximations add bias to the model (it reduces the variance). Also works poorly compared to other models.
If categorical, assigning a unique category or the most frequent category
- Pros: Works well with small datasets and easy to implement and results in no loss of data.
- Cons: Unique category works only for categorical features. Adding another feature (e.g., a new unique category) to the model may result in high variance in the model. Adding the most frequent category can increase the bias in the model.
Predicting the missing values
- Pros: Inputting the missing variable is an improvement as long as the bias from it is smaller than the omitted variable bias. Also yields unbiased estimates of the model parameters.
- Cons: Bias also arises when an incomplete conditioning set is used for a categorical variable. Considered only as a proxy for the true values.
Using algorithms that support missing values
- Pros: Does not require creation of a predictive model however, correlation of the data is neglected.
- Cons: Some of these algorithms are very time-consuming and it can be critical in data mining where large databases are being extracted.

Nearest Neighbour

Nearest Neighbour is a regression or classification algorithm that predicts whatever is the output value of the nearest data point to some query.

To find the nearest data point, we have to find the distance between the query and other points. So we have to decide how to define the distance.

Minkowski Distance

If $\mathcal{X} \rightarrow \mathbb{R}^d$, $x, in \in \mathcal{X}$, the Minkowski distance of order $p > 0$ is defined as:

$$Dist_p(x,y) = \left( \sum_{j=1}^d |x_j - y_j|^P \right)^{\frac{1}{p}} = ||x-y||_p$$

Where $||z||_p = \left( \sum_{j=1}^d |z_j|^p \right)^{\frac{1}{p}}$ is the p-norm (sometimes denoted $L_p$ norm) of the vector $z$.

The 2-norm refers to the euclidean distance
The 1-norm denotes manhattan distance, also called cityblock distance

If we let $p$ grow larger, the distance will be more dominated by the largest coordinate-wise distance, from which we can infer that $Dist_\infty = max_j | x_j - y_j|$; this is also called Chebyshev distance.

If the data is not in $\mathcal{R}^d$, but we can turn it into Boolean features / character sequences, we can still apply distance measures, i.e.

Data type	Distance Measure
Equal Length Binary String	Hamming Distance
Unequal Length Non-Binary String	Levenshtein Distance

Distance Metrics

Given an instance space $\mathcal{X}$, a distance metric is a function $Dist : \mathcal{X} \times \mathcal{X} \rightarrow [0, \infty)$ such that for any $x,y,z \in \mathcal{X}$

Distances between a point and itself are zero
All other distances are larger than zero
Distances are symmetric
Detours can not shorten the distance (triangle inequality)

If the second condition is weakened to a non-strict inequality - i.e. $Dist(x,y) = 0, x \neq y$, the function $Dist$ is called a pseudo-metric.

The arithmetic mean $\mu$ of a set of data points $D$ in a Euclidean space is the unique point that minimises the sum of squared Euclidean distances to those data points.

Note minimising the sum of squared Euclidean distances of a given set of points is the same as minimising the average squared Euclidean distance
If we drop the squares, the point is known as the geometric median
- For univariate data, it corresponds to the median / middle value of a set of numbers
- For multivariate data, there is no closed-form expression, and needs to be calculated by successive approximation

Nearest Centroid Classifier

This is a classifier based on minimum distance principle, where the class exemplars are just the centroids (or means).

Pros:

Simple and fast
Works well when classes are compact and far from each other

Cons:

For complex classes (e.g. Multimodal, non-spherical) may give very poor results
Cannot handle outliers or noisy data well and cannot handle missing data

Nearest Neighbour Classification

This is related to the simplest form of learning

Training instances are searched for instance that most closely resembles new or query instances
The instances themselves represent the knowledge
Called: instance based, memory based learning or case based learning; often a form of local learning

Nearest neighbour:

Given query instance $x_q$, first locate nearest training example $x_n$, then estimate $\hat{f}(x_q) \leftarrow f(x_n)$

$k$-Nearest neighbour:

Given $x_q$ take vote among its $k$ nearest neighbours (if discrete-valued targe function)
Take mean of $f$ values of $k$ nearest neighours (if real valued) $$\hat{f}(x_q) \leftarrow \frac{\sum_{j=1}^kf(x_j)}{k}$$

Pros:

Can be very accurate
Training is very fast
can learn complex target functions

Cons:

Slow at query time: basic algorithm scans entire training data to derive a prediction, and suffers from "curse of dimensionality"
Assumes all attributes are equally important, so easily fooled by irrelevant attributes (can be remedied by attribute selection or weights)
Problem of noisy instances (can be remedied by removing those from data set, but it's difficult to determine which are noisy)
Finding the optimal $k$ can be challenging
- 1NN perfectly separates training data, so low bias but high variance
- By increasing $k$, we increase bias and decrease variance
Needs homogenous feature type and scale

Note that the data needs to be normalized because different attributes may be measured on different scales, i.e. $$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$$

where $x$ is the actual value of attribute / feature, and $x'$ is the normalized value.

Nearest Neighbour should be considered when

Instances map to points in $\mathbb{R}^d$
There are less than 20 attributes per instance (or number of attributes can be reduced)
There is lots of training data
No requirement for "explanatory" model to be learned

Distance-Weighted KNN

We might want to use the distance function to construct a weight $w_i$. The final line of the classification algorithm can be changed to be

$$\hat{f}(x_q) \leftarrow \arg \max_{v \in V} \sum_{i=1}^k w_i \delta(v,(f(x_j)))$$

where

$$w_i = \frac{1}{Dist(x_q, x_i)^2}$$

For real-valued target functions, the final line can be changed to be

$$\hat{f}(x_q) \leftarrow \frac{\sum_{i=1}^k w_i f(x_i)}{\sum_{i=1}^k w_i}$$

Note the denominator normalizes the contribution of individual weights
Now we can consider using all the training examples.
- Using all examples (i.e. when $k = m$) with the rule above is called Shepard's method

Lazy learners like this do not construct a model, i.e.

1-NN: training set error is always 0
$k$-NN: overfitting may be hard to detect

The solution is to use Leave-one-out cross-validation (LOOCV), ie. leave out one example and predict it given the rest.

Curse of Dimensionality

As dimension increases, the effectiveness of distance metrics decrease, and the concept of proximity may not be qualitatively meaningful as all points look equidistant.

Some other problems include:

It becomes polynomially harder to estimate many parameters (e.g. covariances)
It becomes more difficult to visualize data
Enormous amount of data is needed to train a model
Number of "cells" / data points in the instance space grows exponentially in the number of features

One approach to overcome the curse of dimensionality:

Stretch $j^{th}$ axis by weight $z_j$, where $z_1, ..., z_d$ is chosen to minimize prediction error
Use cross-validation to automatically choose weights $z_1, ..., z_d$
Note setting $z_j$ to zero eliminates this dimension altogether

Inductive Bias

Inductive Bias is the combination of assumptions and restrictions placed on the models and algorithms used to solve a learning problem. It means the algorithm and model combination you are using to solve the learning problem is appropriate for the task.

Bayesian Methods

Provides practical learning algorithms:

Naive Bayes classifier learning
Bayesian networking learning, etc
Combines prior knowledge (prior probabilities) with observed data

Provides useful conceptual framework

Provides a "gold standard" for evaluating other learning algorithms

Bayes theorem is stated as

$$P(h|D) = \frac{P(D|h)P(h)}{P(D)}$$

where

$P(h)$ = prior probability of hypothesis $h$
$P(D)$ = prior probability of training data $D$
$P(h|D)$ = probability of $h$ given $D$
$P(D|h)$ = probability of $D$ given $h$

Choosing Hypotheses

If the output belongs to a set of $k$ classes: $y \in \{ C_1, C_2, ..., C_k \}$ for $1 \leq i \leq k$.

Then in Bayesian framework:

$$P(y = C_i | x) = \frac{P(x|C_i)P(C_i)}{P(x)}$$

where $P(x) = \sum_i P(x|C_i)(P(C_i)$

The decision rule is to select a class which maximises the posterior probability for the prediction.

Generally we want the most probable hypothesis given the training data, Maximum a posteriori hypothesis $h_{MAP}$:

$$ \begin{align*} h_{MAP} &= \arg \max_{h \in H} P(h|D) \\ &= \arg \max_{h \in H} \frac{P(D|h)P(h)}{P(D)} \\ &= \arg \max_{h \in H} P(D|h)P(h) \end{align*} $$

To get the posterior probability of a hypothesis $h$

Divide $P$ ($\oplus$) (probability of data) to normalize result for $h$:

$$P(h|D) = \frac{P(D|h) P(h)}{\sum_{h_i \in H}P(D|h_i)P(h_i)}$$

Denominator ensures we obtain posterior probabilities that sum to 1. Sum for all possible numerator values, since hypotheses are mutually exclusive.

Product Rule: probabiltiy $P(A \land B)$ of conjunction of two events $A$ and $B$: $$P(A \land B) = P(A|B)P(B) = P(B|A)P(A)$$
Sum rule: probability of disjunction of two events $A$ and $B$: $$P(A \lor B) = P(A) + P(B) - P(A \land B)$$

Theorem of total probability: if events $A_1, ..., A_n$ are mutually exclusive with $\sum_{i=1}^n P(A_i) = 1$, then:

$$P(B) = \sum_{i=1}^n P(B|A_i)P(A_i)$$

Bayesian Expected Loss

So far, we decide $h_1$ if $P(h_1|D) > P(h_2|D)$ else $h_2$. Alternatively, we can use a loss function $L(h)$, where $L(h)$ is the loss that occurs when decision $h$ is made.

For example, if the cost of misclassifying a patient who has cancer as "not cancer" is 10 times more than classifying a patient who doesn't have cancer as "cancer", who will that affect our decision?

If the cost of misclassification is not the same for different classes, then instead of maximizing a posteriori, we have to minimize the expected loss:

So, if we define the loss associated to action $\alpha_i$ as $\lambda(\alpha_i|h)$
Then the expected loss associated to action $\alpha_i$ si: $$E[L(\alpha_i)] = R(\alpha_i | x) = \sum_{h \in H} \lambda (a_i|h) P(h|x)$$

An optimal Bayesian decision strategy is to minimize the expected loss.

Learning a real valued function

Consider any real-valued target function $f$. Training examples $\langle x_i, y_i \rangle$ where $y_i$ is noisy training value

$y_i = \hat{f}(x_i) + \epsilon_i$
$\epsilon_i$ is random variable (noise) drawn independently fore ach $x_i$ according to some Gaussian (normal) distribution with mean zero

Then the maximum likelihood hypothesis $H_{ML}$ is the one that minimizes the sum of squared errors

$$ \begin{align*} h_{ML} &= \arg \max_{h \in H} P(D|h) \\ &= \arg \max_{h \in H} \prod_{i=1}^m P(y_i|\hat{f}) \\ &= \arg \max_{h \in H} \prod_{i=1}^m \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{1}{2} \left(\frac{y_i - \hat{f}(x_i)}{\sigma} \right)^2 \right) \end{align*} $$

where $\hat{f} = h_{ML}$.

We can maximise the natural log to give a simpler expression:

$$ \begin{align*} h_{ML} &= \arg \max_{h \in H} \sum_{i=1}^m \ln \frac{1}{\sqrt{2 \pi \sigma^2}} - \frac{1}{2} \left( \frac{y_i - \hat{f}(x_i)}{\sigma} \right)^2 \\ &= \arg \max_{h \in H} \sum_{i=1}^m - \frac{1}{2} \left( \frac{y_i - \hat{f}(x_i)}{\sigma} \right)^2 \\ &= \arg \max_{h \in H} \sum_{i=1}^m - \left(y_i - \hat{f}(x_i)\right)^2 \\ \end{align*} $$

Equivalently, we can minimise the positive version of the expression:

$$h_{ML} = \arg \min_{h \in H} \sum_{i=1}^m \left( y_i - \hat{f}(x_i) \right)^2$$

Discriminative vs Generative Probabilistic Models

Discriminative models model the posterior probability distribution $P(y|x)$. That is, given $x$ they return a probability distribution over $y$.

Generative models model the joint distribution $P(y,x)$. Once we have the joint distribution, we can derive any conditional or marginal distribution involving the same variables.

Such models are called 'generative' because we can sample from the joint distribution to obtain new data points together with their labels.

Bayesian Optimal Classifier

In Bayesian optimal classification, the most probable classification is obtained by combining the predictions of ALL hypotheses, weighted by their posterior probabilities:

$$\arg \max_{v_j \in V} \sum_{h_i \in H} P(v_j|h_i)P(h_i|D)$$

where $v_j$ is a class value.

Pros:

Mathematically shown no other classification method using the same hypothesis space and prior knowledge outperforms the Bayes Optimal Classifier method on average.

Cons:

Bayes rule depends on unknown quantities so we need to use the data to find some approximation of those quantities
Is very inefficient. An alternative is the Gibbs algorithm.
1. Choose one hypothesis at random, according to $P(h|D)$
2. Use this to classify new instance
Assuming target concepts are drawn at random from $H$ $$E[error_{Gibbs}] \leq 2 \times E[error_{BayesOptimal}]$$

Naive Bayes Classifier

When to use

Moderate or large training set available
Attributes that describe instances are conditionally independent given classification

Successful applications:

Classifying text documents
Gaussian Naive Bayes for real-valued data

Assume target function $f : X \rightarrow V$, where each instance $x$ described by attributes $(x_1, x_2, ..., x_n)$.

The most probable value of $f(x)$ is:

$$v_{MAP} = \arg \max_{v_j \in V} P(x_1, x_2, ..., x_n|v_j)P(v_j)$$

Naive Bayes assumes attributes are statistically independent i.e.

$$P(x_1, x_2, ..., x_n|v_j) = \prod_i P(x_i|v_j)$$

Which gives Naive Bayes classifier

$$v_{NB} = \arg \max_{v_j \in V} P(v_j) \prod_i P(x_i|v_j)$$

Psuedocode:

$\text{for each target value }v_j:$

$\hat{P}(v_j) \leftarrow \text{estimate}P(v_j)$
$\text{for each attribute value } x_i:$
- $\hat{P}(x_i|v_j) \leftarrow$ $\text{ estimate }$ $P(x_i|v_j)$

If none of the training instances with target value $v_j$ have attribute $x_i$, then:

$$ \hat{P}(x_i|v_j) = 0 \\ \therefore \hat{P}(v_j) \prod_i \hat{P}(x_i|v_j) = 0 $$

Pseudo-counts add 1 to each count (a version of the Laplace Estimator).

Naive Bayes: numeric attributes

Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) $$x|v_j \sim N(\mu,\sigma^2)$$
the probability density function for the normal distribution is defined by two parameters
- The sample mean $\mu$: $$\mu = \frac{1}{n}\sum_{i=1}^n x_i$$
- The standard deviation $\sigma$: $$\sigma = \frac{1}{n-1} \sum_{i=1}^n(x_i - \mu)^2$$

This gives the density function $f(x)$:

$$f(x) = \frac{1}{\sqrt{2 \pi \sigma}} \exp \left( \frac{(x - \mu)^2}{2 \sigma^2} \right)$$

Categorical Random Variables

Categorical variables appear often in ML, i.e. text classification

The most common form is the Bernoulli distribution model; whether or not a word occurs in a document.
- For the $i^{th}$ word, we have a random variable $X_i$ governed by a Bernoulli distribution.
- The joint distribution over the bit vector $X = (X_1, ..., X_k)$ is called a multivariate Bernoulli distribution which shows whether a word occurs or not
Variables with more than two outcomes are also common.
- The multinomial distribution manifests itself as a count vector: a histogram of the number of occurrences of all vocabulary words

Pros:

Easy and fast at prediction. Also does well in multi class prediction.
When assumption of independence holds, performs better than other models like logistic regression with less training data.
Performs well in case of categorical input variables compared to numerical.

Cons:

If zero-frequency case happens, it is unable to make a prediction (solution: use the smoothing technique).
On the other side Naive Bayes is known as a bad estimator, so the probability outputs cannot be taken too seriously.
Strong assumption of independent attributes. In real life, it is almost impossible we get a set of attributes which are completely independent.

Logistic Regression

Logistic regression can be used for binary classification, where we transform the $y$ values into probability values (in the range $[0,1]$).

We can model this with a sigmoid curve.

Now $f(x)$ can have a value inbetween $-\infty$ and $+\infty$ and in Logistic Regression we estimate $f(x)$ with a line.

$$\hat{f}(x) = x^T\beta \Rightarrow \log \frac{P(y=1|x)}{1-P(y=1|x)}$$

Logistic regression seeks to

Model the probability of a class given the values of independent input variables
Estimate the probability that a class occurs for a random observation
Classify an observation based on the probability estimations

$$\hat{P}(y=1|x) = \frac{1}{1+e^{-x^T\beta}}$$

If $P(y=1|x) \geq 0.5$ (same as saying $x^T\beta \geq 0$) then predict as class 1
If $P(y=1|x) < 0.5$ (same as saying $x^T\beta < 0$) then predict as class 0
This is equivalent of having a linear decision boundary separating the two classes (because we have a linear solution to our problem, this is what makes Logistic Regression a linear model)

Logistic Regression Parameter Estimation

Cannot use cost function in Linear Regression because it will result in a non-convex function with many local minimums. Instead, the following cost function is used:

Let's define $\hat{P}(y=1|x) = h_\beta(x)$

$$ cost \left( h_\beta(x),y \right) = \begin{cases} -\log \left( h_\beta(x) \right) & \text{if $y = 1$} \\ -\log \left( 1-h_\beta(x) \right) & \text{if $y = 0$} \end{cases} $$

$$J(\beta) = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log \left( h_\beta \left( x^{(i)} \right) \right) + \left( 1 - y^{(i)} \right) \log \left( 1 - h_\beta \left( x^{(i)} \right) \right) \right]$$

The values of the parameters that minimise $J(\beta)$ can be found using gradient descent.

Pros:

Relatively easy to implement and interpret and relatively fast at training and testing
Can easily extend to multi-classes
Provides probabilistic predictions

Cons:

Prone ot overfitting in high-dimensional data (one remedy: regularization)
Provides linear decision boundary
Requires moderate or no correlation (collinearity) between input variables and may lead to poor model, and also sensitive to outliers

Tree Learning

Probably the single most popular tool

Easy to understand, implement and use
Computationally cheap (efficient, even on big data)

Decision tree representation:

Each internal node tests an attribute
Each branch corresponds to attribute value
Each leaf node assigns a classification

Decision Tree: Expressiveness

Decision Trees can be used to represent Boolean functions like AND ($\land$), OR ($\lor$), XOR ($\oplus$)

$$X \land Y$$

if X == True:
    if Y == True: return True
    if Y == False: return False
if X == False:
    return False

$$X \lor Y$$

if X == True: return True
if X == False:
    if Y == True: return True
    if Y == False: return True

$$X \oplus Y$$

if X == True:
    if Y == True: return False
    if Y == False: return True
if X == False:
    if Y == True: return True
    if Y == False: return False

In general, decisions trees represent a disjunction of conjunctions

When to use decision trees:

Instance described by a mix of numeric features and discrete attribute value pairs
Target function is discrete valued (otherwise use regression trees)
Possibly noisy training data
Interpretability is an advantage

The main loop for top-down induction of decision trees (TDIDT)

$A \leftarrow$ the "best" decision attribute for the next node to split examples
Assign $A$ as decision attribute for node
For each value of $A$, create new descendant of node (child node)
Split training examples to child nodes
If training examples perfectly classified (pure subset), then STOP ,else iterate over new child nodes

Entropy

If we want to determine yes or no, is outlook or wind a better attribute

def use_outlook(outlook: attribute) -> dict:
    match outlook:
        case sunny: return {"yes": 2, "no": 3}
        case overcast: return {"yes": 4, "no": 0}
        case rain: return {"yes": 3, "no": 2}

def use_wind(wind: attribute) -> dict:
    match wind:
        weak: return {"yes": 6, "no": 2}
        strong: return {"yes": 3, "no": 3}

We are using a split with higher "purity"
We need to measure the purity of the split
- More certain about our classes after a split
  - A set with all examples belonging to one class is 100% pure
  - A set with 50% examples in one class and 50% in the other is 100% uncertain and impure

Entropy

From a statistical point of view
From information theory point of view: The amount of information (in the Shannon sense) needed to specify the full state of a system

Entropy measures the "impurity" of $S$.

$$Entropy(S) = H(S) = -p_\oplus \log_2 p_\oplus - p_\ominus \log_2 p_\ominus$$

$S$ is subset of training examples
$p_\oplus$, $p_\ominus$ are the portion (%) of positive and negative examples in $S$

Interpretation: if item $x$ belongs to $S$, how many bits are needed to tell if $x$ is positive or negative?

"High Entropy" / "impure set" means $X$ is very uniform and boring
- E.g. (3 samples from $\oplus$, 3 samples from $\ominus$)
- $H(S) = - \frac{3}{6}\log_2\frac{3}{6} - \frac{3}{6}\log_2\frac{3}{6} = 1$ (can be interpreted as 1 bits)
"Low Entropy" / "pure set" means $X$ is not uniform and interesting
- E.g. (6 samples from $\oplus$, 0 samples from $\ominus$)
- $H(S) = - \frac{6}{6}\log_2\frac{6}{6} - \frac{0}{6}\log_2 \frac{0}{6} = 0$ (can be interpreted as 0 bits)

Information Gain

$Gain(S,A)$ is the expected reduction in entropy due to sorting on $A$

$$Gain(S,A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|}Entropy(S_v)$$

$v$ is the possible values of attribute $A$
$S$ is the set of examples we want to split
$S_v$ is the subset of examples where $X_A = v$

We want to find the attribution which maximizes the gain (this is also called "mutual information" between attribute $A$ and class labels of $S$).

To select the best attribute at each branch:

take every/remaining attributes in your data
Compute information gain for that attribute
Select the attribute that has the highest information gain

Limitations

Information gain is more biased towards attributes with large numbers of values / categories
- Subsets are more likely to be pure if there is a large number of values
- This can result in overfitting which doe snot generalize well to unseen data
Suggested solution: gain ratio
- A modification of information gain that reduces the bias
- Takes number and size of branches into account

Gain Ratio

$$SplitEntropy(S,A) = - \sum_{v \in Values(A)} \frac{S_v}{S} \log_2 \frac{S_v}{S}$$ Where:

$A$: candidate attribute
$v$: possible values of $A$
$S$: Set of examples ($X$) at the node
$S_v$: subset where $X_A = v$

$$GainRatio(S,A) = \frac{Gain(S,A)}{SplitEntropy(S,A)}$$

Overfitting in Decision Trees

Can always classify training examples perfectly
- If necessary, keep splitting until each node contains 1 example
- Singleton subset (leaf nodes with one example) which is by definition pure
But this is not always ideal because on leaf nodes with singleton subsets, you have no confidence in your decision

Consider error of hypothesis $h$ over

Training data: $error_{train}(h)$
Entire distribution $D$ of data $error_{D}(h)$

Definition Hypothesis $h \in H$ overfits training data if there is an alternative hypothesis $h' \in H$ such that

$$error_{train}(h) < error_{train}(h') \land error_D(h) > error_D(h')$$

To avoid overfitting in decision trees, we can use pruning

pre-pruning: Stop growing when data split not statistically significant.
- Some stopping conditions:
  - Lower than some lower-bound on the number of examples in a leaf
  - Stop when Entropy changes is smaller than a lower-bound
post-pruning: Grow full tree, then remove sub-trees which are overfitting (based on validation set). This avoids the problem of "early stopping"
- Methods of finding subtrees:
  - Split data into training and validation set, and prune subtrees until further pruning is harmful

Pros:

Easy to interpret
Can handle irrelevant attributes (Gain = 0)
Can handle categorical and numerical data, along with missing data
Can handle missing data
Very compact (number of nodes << number of examples)
Very fast at testing

Cons:

Only axis-aligned splits of the data
Tend to overfit
Greedy (may not find the best tree)
- Exponentially many possible trees

Regression Tree

Mention things here

Kernel Methods

Task	Label and Output Space	Learning Problem
Classification	$\mathcal{L} = \mathcal{C}$, $\mathcal{Y} = \mathcal{C}$	Learn an approximation $\hat{c} : \mathcal{X} \rightarrow \mathcal{C}$ to the true labelling function $c$
Scoring and ranking	$\mathcal{L} = \mathcal{C}$, $\mathcal{Y} = \mathbb{R}^{\|\mathcal{C}\|}$	Learn a model that outputs a score vector over classes
Probability estimation	$\mathcal{L} = \mathcal{C}$, $\mathcal{Y} = [0,1]^{\|\mathcal{C}\|}$	Learn a model that outputs a probability vector over classes
Regression	$\mathcal{L} = \mathbb{R}$, $\mathcal{Y} = \mathbb{R}$	Learn an approximation $\hat{f} : \mathcal{X} \rightarrow \mathbb{R}$ to the true labelling function $f$

Perceptron

Perceptron is an algorithm for binary classification that uses a linear prediction function. Note the perceptron predicts a binary class label, but linear regression predicts a real value.

For a general case with $n$ attributes

$$ f(x) = \begin{cases} +1 & \text{if $w_0 + w_1x_1 + ... + w_nx_n > 0$} \\ -1 & \text{otherwise} \end{cases} $$

If we add $x_0 = 1$ to the feature vector:

$$ f(x) = \begin{cases} +1 & \text{if $\sum_{i=0}^n w_ix_i > 0$} \\ -1 & \text{otherwise} \end{cases} $$

$$\hat{y} = f(x) = sign(w \cdot x)$$

where $sign$ is the sign function. Now to find a good set of weights using our training set.

The perceptron algorithm initializes all weights $w_i$ to zero, and learns the weights using the following update rule:

$$w := w + \frac{1}{2} \left( y_j - f(x_j) \right)x_j$$

There are 4 cases:

$y=+1, f(x)=+1 \Rightarrow (y-f(x))=0$
$y=+1, f(x)=-1 \Rightarrow (y-f(x))=+2$
$y=-1, f(x)=+1 \Rightarrow (y-f(x))=-2$
$y=-1, f(x)=-1 \Rightarrow (y-f(x))=0$

Perceptron training algorithm

Algorithm Perceptron($D$) / perceptron training for linear classification

Input labelled training data $D$ in homogeneous coordinates

Output weight vector $w$

$w \leftarrow 0$
$converged \leftarrow false$
$\textbf{while } coverged = false \textbf{ do}$
- $converged \leftarrow true$
- $\textbf{for } i = 1 ..|D| \textbf{ do}$
  - $\textbf{if } y_iw \cdot x_i \leq 0 \textbf{ do}$
    - $w \leftarrow w + y_ix_i$
    - $converged \leftarrow false$
  - $\textbf{end}$
- $\textbf{end}$
$\textbf{end}$

Extending Linear Classification

Linear classification cannot model nonlinear class boundaries. However we can map attributes into new space consisting combination of attribute values.

E.g. for 2 attributes

$$y = w_1x_1^3 + w_2x_1^2x_2 + w_3x_1x_2^2 + w_4x_2^3$$

$y$ is predicted output for instances with two attributes $x_1$ and $x_2$.

However, there are issues with this approach

Efficiency:
- With 10 attributes and polynomial function with order 5, more than 2000 coefficients have to be learned
Overfitting:
- "Too nonlinear" - number of coefficients large relative to number of training instances
- Curse of dimensionality

Dual Form

With an optimisation problem, we can construct another optimisation problem which is called dual problem and is related to our original problem (primal problem).

After training a perceptron, each example has been misclassified zero or more times. Denoting this number as $\alpha_i$ for example $x_i$, the weight vector for $m$ observations can be expressed as

$$w = \sum_{i=1}^m \alpha_iy_ix_i$$

In the dual instance-based view, we are learning instance weights $\alpha_i$ rather than feature weights $w_j$. An instance $x$ is classified as

$$\hat{y} = f(x) = sign(w \cdot x)$$ $$\hat{y} = sign \left( \sum_{i=1}^m a_iy_i(x_i \cdot x) \right)$$

Perceptron training algorithm in dual form

Algorithm Dual-Perceptron($D$) / perceptron training for linear classification in dual form

Input labelled training data $D$ in homogeneous coordinates

Output coefficients $\alpha_i$ defining weight vector $W = \sum_{i=1}^{|D|} \alpha_i y_i x_i$

$\alpha_i \leftarrow 0$
$converged \leftarrow false$
$\textbf{while } coverged = false \textbf{ do}$
- $converged \leftarrow true$
- $\textbf{for } i = 1 ..|D| \textbf{ do}$
  - $\textbf{if } y_i \sum_{j=1}^{|D|} a_j y_j x_j \cdot x_i\leq 0 \textbf{ do}$
    - $\alpha_i \leftarrow \alpha_i + 1$
    - $converged \leftarrow false$
  - $\textbf{end}$
- $\textbf{end}$
$\textbf{end}$

Nonlinear dual perceptron

We can use nonlinear mapping to map attributes into new space consisting of combinations of attribute values. $$x \rightarrow \varphi(x)$$
The perceptron decision will be: $$\hat{y} = sign \left( \sum_{i=1}^m a_j y_i (\varphi(x_i) \cdot \varphi(x))\right)$$
So the only thing we need is the dot product in the new feature space $(\varphi(x_i) \cdot \varphi(x))$ or $\langle \varphi(x_i), \varphi(x) \rangle$

Let $x = (x_1, x_2)$ and $x' = (x_1', x_2')$ be two data points, and consider the following mapping to a three-dimensional feature space:

$$(x_1, x_2) \rightarrow (x_1^2, x_2^2, \sqrt{2}x_1x_2)$$

$$\text{(original feature space) } \mathcal{X} \rightarrow \mathcal{Z} \text{ (new feature space)}$$

The points in feature space corresponding to $x$ and $x'$ are

$z = (x_1^2, x_2^2, \sqrt{2}x_1x_2)$ and $z' = (x_1'^2, x_2'^2, \sqrt{2}x_1'x_2')$

The dot product of these two feature vectors is

$$z \cdot z' = x_1^2x_1'^2 + x_2^2x_2'^2 + 2x_1x_1'x_2x_2' = (x_1x_1' + x_2x_2')^2 = (x \cdot x')^2$$

By squaring in original space, we obtain the dot product in the new space. A function that directly calculates the dot product in the new feature space from vectors in the original space is called a kernel - here the kernel is $K(x_1,x_2) = (x_1 \cdot x_2)^2$.

A valid kernel function is equivalent to a dot product in some space.

$$K(x,x') = \varphi(x) \cdot \varphi(x')$$

A kernel function is a similarity function that corresponds to a dot product in some expanded feature space.
Some very useful kernels in machine learning are polynomial kernel and radial basis function kernel (RBF kernel)
Polynomial kernel is defined as: $$K(x,x') = (x \cdot x' + c)^q$$
RBF kernel is defined as: $$K(x,x') = \exp \left( - \frac{||x-x'||^2}{2 \sigma^2} \right)$$ (Using Taylor expansion, it can be shown that RBF kernel is equivalent of mapping features into infinite dimensions)

Using the kernel trick, the nonlinear perceptron can be solved using the dual form

$$\hat{y} = sign\left( \sum_{i=1}^m a_i y_i (\varphi(x_i) \cdot \varphi(x)) \right) = sign\left( \sum_{i=1}^m a_i y_i K(x_i, x) \right)$$

Support Vector Machine

Support Vector Machines (SVMs) can find the optimal linear classification by fitting the maximum margin hyperplane that has the greatest separation between classes
Can avoid overfitting - learn a form of decision boundary called the maximum margin hyperplane
Fast for mappings to nonlinear spaces
- employ a mathematical trick (kernel) to avoid the actual creation of new "pseudo-attributes" in transformed instance spaces
- i.e. the nonlinear space is created implicitly

Let $x_s$ be the closest point to the separating hyperplane (line in 2D) with the following equation:

$$w \cdot x = t$$

Let's have 2 minor technicalities to simply the math later:

Pull out $w_0 : w = [w_0, ..., w_n]$ and $w_0 = -t$, therefore we will have: $$w \cdot x - t = 0$$
Normalize $w$:

We know that $|w \cdot x_s - t| > 0$ and we know that we can sacle $w$ and $t$ together without having any effect on the hyperplane, so we choose the scale such that: $$|w \cdot x_s - t| = 1$$

This means $m = 1$.

$w$ is perpendicular to the line (hyperplane).

For every two points $x'$ and $x''$ on the line (hyperplane), we can write:

$$ w \cdot x' - t = 0 \text{ and } w \cdot x'' - t = 0 \\ \Rightarrow \quad w \cdot (x' - x'') = 0 $$

Since their dot product is 0, it means they are perpendicular.

Since, distance between the point $x_s$ and the hyperplane can be found as

$$distance = \frac{|w \cdot x_s - t|}{||w||}$$

And we have $|w \cdot x_s - t|=1$, so:

$$distance = \frac{1}{||w||}$$

This distance is the margin of our classifier which we want to minimize:

$$\max \frac{1}{||w||} \text{ subject to } \min_{t=1,...,m} |w \cdot x_i - t| = 1$$

(This is not a friendly optimisation as it has "min" in the constraint).

We can transform the maximization problem into the following minimization problem:

$$ \min_w \frac{1}{2} ||w||^2 \text{ subject to } y_i(w \cdot x_i - t) \geq 1 \\ \text{ for } i = 1,...,m, w \in \mathbb{R}^n, t \in \mathbb{R} $$

This can be solved using Lagrangian multipliers

Lagrangian multipliers

In Lagrangian form, the optimization problem becomes:

$$\max_{\alpha_1,...,\alpha_m} \min_{x_1,...,x_n} \mathcal{L}(x_1,...,x_n,\alpha_1,...,\alpha_m) \text{ s.t. } \alpha_j \geq 0 \forall j$$

first minimizing with respect to $x_1,...,x_n$
then maximizing with respect to $\alpha_1,...,\alpha_m$

Adding the constraints with multiplier $\alpha_i$ for each training example gives the Lagrange function:

$$ \begin{align*} \mathcal{w,t,\alpha_1,...,\alpha_m} &= \frac{1}{2} ||w||^2 - \sum_{i=1}^m \alpha_i (y_i (w \cdot x_i - t) - 1) \\ &= \frac{1}{2} ||w||^2 - \sum_{i=1}^m \alpha_iy_i(w \cdot x_i) + \sum_{i=1}^m \alpha_i y_i t + \sum_{i=1}^m \alpha_i \\ &= \frac{1}{2} w \cdot w - w \cdot \left( \sum_{i=1}^m \alpha_iy_ix_i \right) + t \left( \sum_{i=1}^m \alpha_i y_i \right) + \sum_{i=1}^m \alpha_i \end{align*} $$

First we have to minimize $\mathcal{L}$ with respect to $w$ and $t$
By taking the partial derivative of the Lagrange function with respect to $t$ and setting it to 0, we find: $$\sum_{i=1}^m\alpha_iy_i = 0$$
Similarly, by taking the partial derivative of the Lagrange function with respect to $w$ and setting to 0 we obtain: $$w = \sum_{i=1}^m \alpha_i y_i x_i$$
- The same expression as we derived for the perceptron
These expressions allows us to eliminate $w$ and $t$ and lead to the dual Lagrangian $$ \begin{align*} \mathcal{L}(\alpha_1, ..., \alpha_n) &= - \frac{1}{2} \left( \sum_{i=1}^m \alpha_i y_i x_i \right) \cdot \left( \sum_{i=1}^m \alpha_i y_i x_i \right) + \sum_{i=1}^m \alpha_i \\ &= - \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j x_i \cdot x_j + \sum_{i=1}^m \alpha_i \end{align*} $$

The dual optimization problem for SVMs is to maximize the dual Lagrangian under positive constraints and one equality constraint:

$$\alpha_i*, ..., \alpha_m* = \arg \max_{\alpha_1, ..., \alpha_m} - \frac{1}{2}\sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j x_i \cdot x_j + \sum_{i=1}^m \alpha_i $$

$$\text{subject to } \alpha_i \geq 0, 1 \leq i \leq m, \sum_{i=1}^m \alpha_i y_i = 0$$

Solving for $\alpha_1*, ..., \alpha_m*$, they will be mostly zero, except for points that are closest to the hyperplane. These points are called the support vectors.

$$w = \sum_{X_i \in \{ support : vectors\}} \alpha_i y_i x_i$$

Solve for $t$ using any of support vectors:

$$y_i (w \cdot x_i - t) = 1$$

Pros:

Effective in high dimensional space and when dimension > num samples
Memory efficient
Works very well if classes are separable even if non linearly (through kernels)

Cons:

Doesn't scale well to big data (by increasing the number of samples, the number of support vector grows and the prediction will be slow)
Doesn't directly provide probabilistic explanation
Hard to interpret, especially when using kernels

Ensemble Learning

Ensemble learning is a form of multi-level learning: learning a number of base-level models from the data, and learning to combine these models as an ensemble.

It can help to reduce bias (expected error due to mismatch between learner's hypothesis space and space of target concepts) and variance (expected error due to differences in the training sets used).

Three commonly used methods to find a good bias-variance tradeoff are:

Regularization
Bagging
Boosting

Bias-variance with "Big Data"

High bias algorithms are often used for efficiency
- Usually simpler to compute
Big data can reduce variance
- "small" concepts will occur more frequently
- low bias algorithms can be applied

However, it is difficult to compute big data efficiently.

The follow scenarios happen in real-world AI

Training-set error is observed to be high compared to human-level
- Bias is too high - solution: move to a more expressive (lower bias) model
Training-set error is observed to be similar to human-level, but validation set error is high compared to human-level
- Variance is too high - solution: get more data, try regularization, ensembles, move to a different model architecture

Stability

For some given data $\mathcal{D}$, train an algorithm $L$ on training sets $S_1$, $S_2$ sampled from $\mathcal{D}$. If the model from $L$ is very similar on both $S_1$ and $S_2$, $L$ is a stable learning algorithm, otherwise it is unstable.

	typical stable algorithm	typical unstable algorithm
Example model	$k$NN (for some $k$)	decision-tree learning
Cons	high bias	high variance

Note parameters have effect on stability, i.e. in $k$NN:

1NN perfectly separates training data, so low bias but high variance
By increasing $k$, we increase bias and decrease variance

Supervised Learning

Ensemble methods are meta-algorithms that combine different models into one model and they can:

Decrease variance
Decrease bias
Improve performance

This idea relates to the "wisdom of crowd" phenomenon.

There are different ways of combining predictors

Simple ensembles like majority vote or unweighted average
Weighted averages / weighted votes: Every model gets a weight (i.e. depending on its performance)

In practice:
- Learning algorithms may not be independent
- Some better fit the data so make less error
  
  We can define weights in different ways:
- Decrease weight of correlated learners
- Increase weight of good learners
Treat the output of each model as a feature and train a model on that
- If the task is binary classification and we choose the fusion model to be a linear model, then this will become a weighted vote
- We train the fusion model on unseen data otherwise it will be biased towards the models that performed better on the training data
Mixture of experts
- Weight $\alpha_i(x)$ indicates "expertise"
- It divides the feature space into homogeneous regions
- It may use a weighted average or just pick the model with largest expertise
- It is a kind of local learning
"Bagging" method: ("Bootstrap Aggregation")
- Training many classifiers, but each with only a portion of the data
- Then aggregate through model averaging / majority voting

Bagging

Boostrap:

Create a random subset of data by sampling with replacement
Draw $m'$ samples from $m$ sample with replacement $(m' \leq m)$

Bagging:

Repeat $k$ times to generate $k$ subsets
- Some of the samples get repeated and some will not be left out
Train one classifier on each subset
To test, aggregate the output of $k$ classifiers that you trained in the previous step using either majority vote / unweighted average

Error of any model has two components:

Bias: due to model choice which
- can be reduced by increasing complexity
Variance: due to small sample size or high complexity of the model
- Can be reduced by increasing the data or reducing the complexity

Bagging is applied on a collection of low-bias high-variance models and by averaging them, the bias is unaffected, but the variance reduces.

To calculate bagging error:

If learners are independent
If each learner makes an error with probability $p$

The probability that $k'$ out of $k$ learners make an error is:

$$\binom{k}{k'}p^{k'}(1-p)^{k-k'}$$

If we use majority voting to decide the output, then the error happens if more than $\frac{k}{2}$ of learners make an error, so the error for majority voting is:

$$\sum_{k' > \frac{k}{2}} \binom{k}{k'}p^{k'}(1-p)^{k-k'}$$

Advantage of bagging:

Reduces overfitting (harder for aggregated model to memorize full dataset)
This improves performance in almost all cases esp if learning scheme is unstable
Can be applied to numeric prediction and classification
Can help a lot if data is noisy

Random Forests

Tree models can be bagged.

Pros

Reduces the overfitting
Generalizes better

Cons

When we bag a model, any simple structure is lost
- This is because a bagged tree is no longer a tree, but a forest so this reduces claim to interpretability
- stable models like nearest neighbor not very affected by bagging but unstable models like trees most affected by bagging
With lots of data, we usually learn the same classifier, so averaging them does not help

The solution for ensemble decision trees with lots of data is random forests.

Have extra variability in the learners and introduce more randomness to the procedure.

To produce a random forest

Select a random subsample from the dataset with replacement
Select a subset of features randomly
Build a full tree without pruning using the selected features and samples
Repeat previous steps $k$ times

Boosting

A problem with parallel learners is that they can all be mistaken in the same region. Boosting tries to solve this:

Uses "weak" learners which are trained sequentially
New learners focus on errors of earlier learners
New learners try to get these "hard" examples right by operating on a weighted train set in favor of misclassified instances
- Start with the same weight for all the instances
- Misclassified instances gain higher weights: so the next classifier is more likely to classify it correctly
  - Give different weights to the loss function for different instances
  - Create a new collection of data with multiple copies of samples with higher weight
- Correctly classified instances lose weight
Combine all learners in the end

set $w_i = 1/m$ for $i = 1,...,m$
Repeat until sufficient number of hypothesis
- Train model $L_j$ using the dataset with weight $w$
- Increase $w_i$ for misclassified instances of $L_j$
Ensemble hypothesis is the weighted majority / weighted average of $k$ learners $L_1, ..., L_k$ with weight $\lambda_1, ..., \lambda_k$ which are proportional to the accuracy of $L_j$

We always aim to minimize some cost function:

Unweighted average loss	Weighted average loss
$J(\theta) = \frac{1}{N} \sum_i J_i (\theta, x_i)$	$J(\theta) = \sum_i w_iJ_i (\theta, x_i)$

Boosting works well as long as we use weak learners (weak learner is a model that is slightly better than random), i.e.

Perceptron
Decision stumps (trees with one node)

AdaBoost (Adaptive Boosting)

AdaBoost usually uses stump trees (trees with one node and two leaves) as the base learner
- Not very accurate at classification on their own
- AdaBoost combines stumps to boost the performance so it creates a forest of stumps instead of forest of trees
- In AdaBoost, stumps are created sequentially
- The error of each stump affects the training data weight in the next stump
- Depending on the performance, each stump gets different weight ($\lambda_i$) in the final classification decision

For boosting:

Pros:

No need to use complex models (can boost performance of any weak learner)
Very simple to implement
Decreases the bias and variance

Cons:

Lack of interpretability
Slow during training and potentially testing

Gradient Boosting

Gradient boosting is apply similar ideas for regression (but can be used for classification as well).

Simple linear regression or simple regression trees can be used as weak learners.

Learn a regression predictor
Compute the error residual
Learn to predict residual

	Bagging	Boosting
Purpose	Variance reduction	Bias reduction
Models Used	Used with high variance models	Used with high bias models
Model Examples	Random Forests	Linear classifiers / univariate decision trees (decision stumps)

Neural Learning

Artificial Neural Networks (NN) are inspired by human nervous systems. NNs are composed of a large number of interconnected processing elements known as neurons. They use supervised error correcting rules with back-propagation to learn a specific task.

A single perceptron has multiple inputs $x_1, ..., x_n$ and a binary output, where the output $o$ can be modeled as

$$ o(x_1, ..., x_n = \begin{cases} +1 & \text{if } w_0 + w_1x_1 + ... + x_0x_0 > 0 \\ -1 & \text{otherwise} \end{cases} $$

Or in vector notation:

$$ o(\mathbf{x} = \begin{cases} +1 & \text{if } \mathbf{w} \cdot \mathbf{x} > 0 -1 & \text{otherwise} \end{cases} $$

As a result a perceptron is able to represent some useful functions which are linearly separable

Perceptron Learning

Perceptron learning is simply an iterative weight-update scheme to find a good set of weights.

$$w_{i+1} \leftarrow w_i + \Delta w_i$$

where the weight update $\Delta w_i$ depends only on misclassified examples and is modulated by a "smoothing" parameter $\eta$ AKA learning rate.

For example, the update rule can be written as

$$w_{i+1} \leftarrow w_i + \eta y_ix_i$$

where

$w_t$ is the weight at iteration $t$
$\eta$ is the learning rate
$y_i \in \{+1, 0, -1\}$ acts to chance the sign (so if the current weights resulted in a missclassification, $y_i$ would reflect the update that needs to be applied to the weight)
$x_i$ is the instance value

Perception training will converge (under some mild assumptions) for linearly separable classification problems.

However, with a relatively minor modification many perceptrons can be combined together to form one model

Multilayer perceptrons, the classic "neural network"

Gradient Descent

Gradient descent / ascent is an optimisation algorithm that seeks to minimize / maximize a function.

For a simple linear unit, where

$$o = w_0 + w_1x_1 + ... + x_nx_n$$

Let's learn $w_i$ that minimizes the squared error

$$E[w] = \frac{1}{2} \sum_{d \in D} (t_d - o_d)^2$$

where $D$ is the set of training samples.

The gradient is given by

$$\nabla E[w] = \left[ \frac{\partial E}{\partial w_0}, \frac{\partial E}{\partial w_1}, ..., \frac{\partial E}{\partial w_n} \right]$$

The gradient vector gives the direction of steepest increase in error $E$. Negative of the gradient, i.e. steepest decrease, is what we want.

Training rule: $\Delta \mathbf{w} = -\eta \nabla E[\mathbf{w}]$, i.e. $\Delta w_i = - \eta \frac{\partial E}{\partial w_i}$.

$$ \begin{align*} \frac{\partial E}{\partial w_i} &= \frac{\partial}{\partial w_i} \frac{1}{2} \sum_{d \in D} (t_d - o_d)^2 \\ &= \frac{1}{2} \sum_d \frac{\partial}{\partial w_i} (t_d - o_d)^2 \\ &= \frac{1}{2} \sum_d 2(t_d - o_d) \frac{\partial}{\partial w_i}(t_d - o_d) \\ &= \sum_d (t_d - o_d) \frac{\partial}{\partial w_i} (t_d - \mathbf{w} \cdot \mathbf{x}_d) \\ &= \sum_d (t_d - o_d)(-x_{i,d}) \end{align*} $$

Perceptron training rule guaranteed to succeed if:

Training examples are linearly separable
Sufficiently small learn rate $\eta$

Linear unit training rule uses gradient descent

Guaranteed to converge to hypothesis with minimum squared error
Given sufficiently small learning rate $\eta$
Even when training data contains noise and or not separable by $H$

Stochastic Gradient Descent

Batch mode Gradient Descent

Do until satisfied
- Compute the gradient $\nabla E_D[\mathbf{w}] = \frac{1}{2} \sum_{d \in D} (t_d - o_d)^2$
- $\mathbf{w} \leftarrow \mathbf{w} \eta \nabla E_D[\mathbf{w}]$

Stochastic (incremental) mode Gradient Descent

Do until satisfied
- For each training example $d \in D$
  - Compute the gradient $\nabla E_d[\mathbf{w}] = \frac{1}{2}(t_d - o_d)^2$
  - $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla E_d[\mathbf{w}]$

SGD can approximate Batch Gradient Descent arbitrarily closely, if $\eta$ made small enough

Very useful for training large networks, or online learning from data streams
Stochastic implies examples should be selected at random

Multilayer Networks

Multilayer networks can represent arbitrary functions. It typically consists of an input, hidden, and output layer, each fully connected to the next. The weights determine the function computed. Given an arbitrary number of hidden units, any boolean function can be computed with a single hidden layer.

Properties of Artificial Neural Networks (ANN's):

Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process

Consider when

Input is high-dimensional
Output can be a vector of values or discrete, or real valued
Form of target function is unknown
Interpretability of result is not important

Examples include speech recognition, image classification, and many others.

Sigmoid Unit

Same as a perceptron except step function is replaced by a nonlinear sigmoid function.

$$\sigma(x) = \frac{1}{1+e^{-x}}$$ Nonlinearity makes it easy for the model to generalise or adapt with variety of data and to differentiate between the output.

The sigmoid function is used because its derivative has a nice property

$$\frac{d \sigma(x)}{dx} = \sigma(x)(1 - \sigma(x))$$

We can derive gradient descent rules to train

One sigmoid unit
Multilayer networks of sigmoid units -> Backpropagation

Note: in practice, particularly for deep networks, sigmoid functions are less common than other non-linear activation functions that are easier to train, however sigmoids are mathematically convenient.

To determine the error gradient of a sigmoid unit

Start by assuming we want to minimize squared error over a set of training examples $D$.

$$ \begin{align*} \frac{\partial E}{\partial w_i} &= \frac{\partial}{\partial w_i} \frac{1}{2} \sum_{d \in D}(t_d - o_d)^2 \\ &= \frac{1}{2} \sum_{d \in D} \frac{\partial}{\partial w_i} (t_d - o_d)^2 \\ &= \frac{1}{2} \sum_{d \in D} 2(t_d - o_d)\frac{\partial}{\partial w_i} (t_d - o_d) \\ &= \sum_{d \in D} 2(t_d - o_d) \left( \frac{\partial o_d}{\partial w_i} \right) \\ &= -\sum_{d \in D} 2(t_d - o_d) \frac{\partial o_d}{\partial net_d} \frac{\partial net_d}{\partial w_i}\\ \end{align*} $$

We know:

$$\frac{\partial o_d}{\partial net_d} = \frac{\partial \sigma(net_d)}{\partial net_d} = o_d(1 - o_d)$$

$$\frac{\partial net_d}{\partial w_i} = \frac{\partial(\mathbf{w} \cdot \mathbf{x})d)}{\partial w_i} = x_{i,d}$$

so:

$$\frac{\partial E}{\partial w_i} = - \sum_{d \in D}(t_d - o_d) o_d (1 - o_d) x_{i,d}$$

Backpropagation Algorithm

Initialize all weights to small random numbers
Until satisified, do
- For each training example, do
  - Input the training example to the network and compute the network outputs
  - For each output unit $k$ $$\delta_k \leftarrow o_k(1-o_k)(t_k-o_k)$$
  - For each hidden unit $h$ $$\delta_h \leftarrow o_h(1-o_h) \sum_{k \in outputs} w_{kh} \delta_k$$
  - Update each network weight $w_{ji}$ $$w_{ji} \leftarrow w_{ji} + \Delta w_{ji}$$ where $$\Delta w_{ji} = \eta \delta_j x_{ji}$$

A solution for learning highly complex models:

gradient descent over entire network weight vector
Easily generalised to arbitrary directed graphs
Can learn probabilistic models by maximising likelihood

Minimises error over all training examples

Training can take thousands of iterations -> slow
Using network after training is very fast

Will converge to a local, not necessarily global, error minimum

Might exist many such local minima, but in practice often works well
Often include weight momentum $a$ $$\Delta w_{ji}(n) = \eta \delta_j x_{ji} + \alpha \Delta w_{ji} (n-1)$$
Stochastic gradient descent using "mini-batches"

Many ways to regularise network, making it less likely to overfit

Add term to error that increases with magnitude of weight vector $$E(\mathbf{w}) = \frac{1}{2} \sum_{d \in D} \sum_{k \in outputs} (t_{kd} - o_{kd})^2 + \gamma \sum_{i,j} w_{ji}^2$$
Other ways to penalise large weights, e.g. weight decay
Using "tied" or shared set of weights, e.g. by setting all weights to their mean after computing the weight updates

Neural Networks for Classification

Minimizing square error does not work so well for classification.

If we take the output $o(x)$ as the probability of the class $\mathbf{x}$ being 1, the preferred loss function is the cross-entropy

$$- \sum_{d \in D} t_d \log o_d + (1 - t_d) \log (1 - o_d)$$

where:

$t_d \in \{0,1\}$ is the class label for training example $d$, and $o_d$ is the output of the sigmoid unit, interpreted as the probability of the class of training example $d$ being 1.

To train sigmoid units for classification using this setup, one can use gradient descent and backpropagation algorithm - this will yield the maximum likelihood solution.

Convolutional Neural Networks

CNNs are very similar to regular Neural Networks

Made up of neurons with learnable weights

CNN architecture assumes that inputs are images

So we have local features
Which allows us to
- encode certain properties in the architecture that makes the forward pass more efficient nad
- significantly reduces the number of parameters need for the network

The problem with regular NNs is that they do not scale well with dimensions. In contrast, CNNS, consider 3D volumes of neurons and propose a parameter sharing scheme that minimises the number of params required by the network.

CNN neurons are arranged in 3 dimensions: Width, Height, and Depth. Neurons in a layer are only connected to a small region of the layer before it (hence not fully connected).

Main layers:

Convolutional
- Applies convolution operation (converts all pixels in its receptive field into a single value)
Pooling
- the purpose is to reduce the spacial size of the representation to reduce the number of params and computation in the network and control overfitting
Rectified Linear Unit (ReLU)
- Really an activation function ($f(x) = \max(0,x)$) favoured over traditional activation functions like Sigmoid or Tanh
  - To accelerate the convergence of stochastic gradient descent
  - Be computationally inexpensive compared to traditional ones
Fully-connected
- A fully connected layer to all activations
Drop-out
- A layer where each forward pass, some neurons are randomly set to zero
- This reduces overfitting
Output layers

Unsupervised Learning

Supervised learning

Classes are known and need a "definition" in terms of the data
Methods are known as:
- classification
- discriminant analysis
- class prediction
- supervised pattern recognition

Unsupervised learning

Classes are initially unknown and need to be discovered with their definitions from the data
Methods are known as:
- cluster analysis
- class discovery
- unsupervised pattern recognition

Clustering

Clustering algorithms form two broad categories:

Partitioning methods
- Typically require specification of the number of clusters
Hierarchical methods
- Hierarchical algorithms are either agglomerative i.e., bottom-up or divisive i.e., top-down
- In practice, hierarchical agglomerative methods are often used
- But more importantly to users, the dendrogram, or tree, which can be visualized in hierarchical methods

Let $X = \{x_1, ..., x_m \}$ be a set of instances

Let $C = (C_1, ..., C_k)$ be a partition of $m$ elements into $k$ subsets.

Each subset is called a cluster, and $C$ is called a clustering.

K-means Clustering

Set value for $k$, th number of clusters

Initialize: choose points for centres (means) of $k$ clusters (at random)

Procedure:

Assign each instance $x$ to the closest of the $k$ points to form $k$ clusters
Re-assign the $k$ points to be the means of each of the $k$ clusters
Repeat 1 and 2 until convergence to a reasonably stable clustering

Despite weaknesses (suffering from outliers, or trapped in local minimum), $k$-means is still the most popular algorithm due to simplicity and efficiency. There is no clear evidence that any other clustering algorithm performs better in general.

Expectation Maximization

For $k = 2$, think of the full description of each instance as $y_i = (z_i, z_{i1}, z_{i2})$, where

$z_{ij}$ is 1 if $x_i$ generated by $j^{th}$ Gaussian, otherwise zero
$x_i$ is observable, from instance set $x_1, x_2, ..., x_m$
$z_{ij}$ is unobservable

Given:

Instances from $X$ generated by mixture of $k$ Gaussian distributions
Unknown means of the $k$ Gaussians
Assuming identity covariance matrix
Don't know whcih instance $x_i$ was generated by which Guassian

Determine Maximum likelihood estimates of

$\text{means}(\mu_1, ...,\mu_k)$

Initialize: Pick random initial $h = (\mu_1, ..., \mu_k$

Iterate: E step: Calculate expected value $E[z_{ij}$ of each hidden variable $z_{ij}$, assuming current hypothesis $h = (\mu_1, ..., \mu_k)$ holds:

$$ \begin{align*} E[z_{ij}] &= \frac{P(x=x_i|\mu=\mu_j)}{\sum_{n=1}^k p(x=x_i|\mu=\mu_n)} \\ &= \frac{\exp \left( - \frac{1}{2 \sigma^2} (x_i - \mu_j)^2 \right)}{\sum_{n=1}^k \exp \left( - \frac{1}{2 \sigma^2} (x_i - \mu_n)^2 \right)} \end{align*} $$

M step: Calculate new maximum likelihood hypthesis $h' = (\mu_1', ..., \mu_k')$ assuming value taken on each hidden variable $z_{ij}$ is the expected value $E[z_{ij}]$ calculated before. Replace $h = (\mu_1, ..., \mu_k)$ by $h' = (\mu_1', ..., \mu_k')$.

$$\mu_j \leftarrow \frac{\sum_{i=1}^m E[z_{ij}] x_i}{\sum_{i=1}^m E[z_{ij}]}$$

Converges to local maximum likelihood $h$ and provides estimates of hidden variables $z_{ij}$.

In fact, local maximum in $E[\ln P(Y|h)]$

$Y$ is complete (observable plus unobservable variables) data
Expected value taken over possible values of unobserved variables in $Y$

Hierarchical Clustering

Bottom up: at each step join two closest clusters (starting with single-instance clusters)

Design decision: distance between clusters E.g., two closest instances in clusters vs distance between means

Top down: find two clusters and then proceed recursively for the two subsets

Can be very fast

Determining number of clusters

Elbow method:

Measure within-cluster dispersion (sum of squared distances from each point to cluster centroid)
Compute this for various $k$ choices
Choose the $k$ that doesn't improve the dispersion that much

Gap statistics:

Cluster observed data and compute the corresponding total with-cluster variation $W_k$
Generate $b$ reference data sets with random uniform distribution. Cluster each of these reference data sets and compute the corresponding total within-cluster variation $W_{kb}$
Compute the estimated gap statistic as the deviation of the observed $W_k$ value from its expected value $W_{kb} : Gap(k) = \frac{1}{B} \sum_{b} \log(W_{kb}) - \log(W_k)$. Compute also the std dev of the statistics
Choose $k$ as the smallest value s.t. the gap statistic is within 1 std dev of the gap at $k + 1$ $$Gap(k) \geq Gap(k + 1) - s_{k+1}$$

Principal Component Analysis (PCA)

Key idea: look for features in a transformed space so that each dimension in the new space captures the most variation in the original data when it is projected onto that dimension.

Take the data as an $m \times n$ matrix $X$
"centre" the data by subtracting the mean of each column
Construct covariance matrix $C$ from centred matrix
Compute eigenvector matrix $V$ (rotation) and eigenvalue matrix $S$ (scaling) such that $V^{-1}CV = S$ and $S$ is a diagonal $n \times n$ matrix
Sort columns of $S$ in decreasing order (decreasing variance) and their corresponding eigenvectors
Remove columns of $S$ and $V$ that the eigenvalues are below some minimum threshold
Transform the original data using the remaining eigenvectors

PCA complexity is cubic in number of original features (not feasible for high-dimensional datasets)
Alternatively can use random projections to approximate the sort of projection found by PCA

Semi-supervised Learning

Learn initial classifier using labelled set
Apply classifier to unlabelled set
The most confident predictions of each classifier on the unlabelled data are retained
Learn new classifier from now-labelled data
Repeat until convergence

Self-training algorithm

Given: labelled data $(x,y)$ and unlabelled data $(x)$

Repeat:

Train classifier $H$ from labelled data using supervised learning
Label unlabelled data using classifier $h$

Assumes: classifications by $h$ will tend to be correct (especially high probability ones)

Co-Training

Key idea: two views of an instance, $f_1$ and $f_2$

assume $f_1$ and $f_2$ independent and compatible
If we hve a good attribute set, leverage similarity between attribute values in each view, assuming they predict the class, to classify the unlabelled data

Multi-view learning

Given two (or more) perspectives on data, e.g., different attribute sets
Train separate models for each perspective on small set of labelled data
Use models to label a subset of the unlabelled data
Repeat until no more unlabelled examples

Learning Theory

Computational Learning Theory

Machine learning: Have a computer solve problems by learning from data rather than being explicitly programmed.

Regression
Classification
Clustering
Ranking
Reinforcement Learning
...

Inductive learning: learning from examples and all machine learning algorithms are a kind of inductive learning and there are some questions that we are interested to be able to answer in such framework:

Probability of successful learning
Number of training examples
Complexity of hypothesis space
Time complexity of learning algorithm
Accuracy to which target concept is approximated
Manner in which training examples presented
...

Instead of focusing on particular algorithms, learning theory aims to characterize classes of algorithms.

Probably Approximately Correct

Probably Approximately Correct (PAC) is a framework for mathematical analysis of learning which was proposed by Valiant in 1984.

The framework of PAC learning can be used to address questions such as:

How many training examples?
How much computational effort required?
How complex a hypothesis class needed?
How to quantify hypothesis complexity?
How many mistakes will be made?

Sample Complexity

Given:

set of instances $X$
set of hypotheses $H$
set of possible target concepts $C$
training instances generated by a fixed, unknown probability distribution $\mathcal{D}$ over $X$

Learner observes a sequence $D$ of training examples of form $(x,c(x))$, for some target concept $c \in C$

Instances $x$ are drawn from distribution $\mathcal{D}$
teacher provides target value $c(x)$ for each $x$

Learner must output a hypothesis $h$ estimating $c$

$h$ is evaluated by its performance on subsequent instances drawn according to $\mathcal{D}$

Note: randomly drawn instances

True Error of a Hypothesis: The true error (denoted $error_{\mathcal{D}}(h)$) of hypothesis $h$ with respect to target concept $c$ and distribution $\mathcal{D}$ is the probability that $h$ will misclassify an instance drawn at random according to $\mathcal{D}$.

$$error_{\mathcal{D}}(h) \equiv \text{Pr}_{X \in \mathcal{D}} [c(x) \neq h(x)]$$