COMP4121

Advanced Algorithms

26 November 2021

Stats

Expectation and Variance of a Random Variable

The expected value of a random variable is its mean. Assuming that the expected value converges, the expected value can be calculated as shown below.

Discrete random variable Continuous random variable
$E(X) = \sum_{i = 1}^{\infty} v_i \cdot p_i$ $E(X) = \int_{-\infty}^{\infty}x \cdot f(x) dx$

The variance of a random variable is defined as $V(X) = E(X - E(X))^2$ assuming that both expectations involved are finite; the standard deviation of a random variable $X$ is given by $\sigma = \sqrt{V(X)}$.

Simple inequalities

$$1 + x \leq e^x \text{ for all } x \in \mathbb{R}$$ $$\left(1 - \frac{1}{n}\right)^n \leq \frac{1}{e} \leq \left(1 - \frac{1}{n} \right)^{n - 1} \text{ for all } n \in \mathbb{N}$$

Probability Inequalities

The Markov Inequality Let $X > 0$ be a non-negative random variable. Then for all $t > 0$, $$P(X \geq t) \leq \frac{E(X)}{t}$$
Chebyshev Inequality Let $X > 0$ be a random variable with the expected value $\mu = E(X)$ and standard deviation $\sigma = \sqrt{E((X - \mu)^2)}$. Then for all $\lambda > 0$, $$P(| X - \mu | \geq \lambda \sigma) \leq \frac{1}{\lambda^2}$$
Chernoff Bound Let $X = \sum_{k = 1}^{n} X_k$, where $X_k, 1 \leq k \leq n$, are independent Bernoulli trials with the probability of success $P(X_k = 1) = p_k$, where $0 < p_k < 1$. Thus, $\mu = E(X) = \sum_{k = 1}^{n} p_k$. Let $\sigma > 0$ be any positive real. Then, $$P(X > (1 + \sigma) \mu) < \left( \frac{e^{\sigma}}{(1 + \sigma)^{1 + \sigma}}\right)^{\mu}$$

Order Statistics (QuickSelect)

Can we find the $i^{th}$ largest or smallest item in linear time?

Non-Deterministic Algorithm

In the quicksort algorithm, we note that after a partition, the pivot element ends up being in its correct index. Therefore, we can perform a combination of partitioning and binary search to find the item at the $i^{th}$ index.

For example, we partition first using a random pivot, and as a result we will know the index of our pivot. If the pivot is index $i$, we've found the element. Else if the pivot is at an index greater than $i$, then we partition the smaller side, else we partition the bigger side and continue until we've found the $i^{th}$ element.

# Implementation of rand-select

def partition(A: list[int], lo: int, hi: int) -> int:
    # Partition the array A[lo..hi] and return partition index
    i = lo - 1
    pivot = A[hi]
    for j in range(lo, hi):
        if A[j] <= pivot:
            i += 1
            A[i], A[j] = A[j], A[i]
    A[i + 1], A[hi] = A[hi], A[i + 1]
    return i + 1


def rand_select(A: list[int], i: int, lo: int, hi: int) -> int:
    # Return the ith largest index in A
    if lo <= hi:
        pi = partition(A, lo, hi)
        if i < pi:
            return rand_select(A, i, lo, pi - 1)
        return rand_select(A, i, pi + 1, hi)
    return A[hi]

Runtime

The runtime can be represented through the recurrence relation $T(n) = T\left( \frac{n}{2} \right) + O(n)$ which results in a $O(n)$ runtime. However, it can be analysed more in-depth statistically.

Worst Case

The worst case runtime is $\Theta(n^2)$ which happens when the smallest or largest element is picked as the pivot - resulting in an unbalanced partition.

Average Case

However (assuming all elements are distinct), let us call a partition a balanced partition if the ratio between the ratio smaller side and larger side is less than 9 (9 is arbitrary, any small number > 2 would do). Then the probability of having a balanced partition is 1 - (chance of choosing smallest 1/10 or biggest 1/10) = 1 - 2/10 = 8/10.

Then let us find the expected number of partitions between two consecutive balanced partitions. In general, the probability that you need $k$ partitions to end up with another balanced partition is $\left( \frac{2}{10} \right)^{k - 1} \cdot \frac{8}{10}$.

Thus, the expected number of partitions between two balanced partitions is

$$ \begin{align*} E &= 1 \cdot \frac{8}{10} + 2 \cdot \left( \frac{2}{10} \right) \cdot \frac{8}{10} + 3 \cdot \left( \frac{2}{10} \right)^2 \cdot \frac{8}{10} + ... \\ &= \frac{8}{10} \cdot \sum_{k = 0}^{\infty}(k + 1) \left( \frac{2}{10} \right)^k \\ &= \frac{8}{10}S. \end{align*} $$

  • To find $S$, note that

    $$\sum_{k = 0}^{\infty} q^k = \frac{1}{1 - q}.$$

    By differentiating both sides with respect to $q$ we get

    $$\sum_{q = 1}^{\infty} k q^{k - 1} = \frac{1}{(1 - q)^2}.$$

    Substituting $q = \frac{2}{10}$ we get that $S = \left(\frac{10}{8} \right)^2$.

Therefore,

$$E = \frac{8}{10} \cdot \left( \frac{8}{10} \right)^2 = \frac{5}{4}.$$

And so there are only 5/4 partitions between two balanced partitions.

  • Note after 1 balanced partition, the size of the array is $\leq 9/10 n$, after the second balanced partition, it is $\leq (9/10)^2n$ and so on.

Therefore, the total average (expected) run time satisfies

$$ \begin{align*} T(n) &< 5/4n + 5/4 \left( \frac{9}{10} \right)n + 5/4 \left( \frac{9}{10} \right)^2 n + 5/4 \left( \frac{9}{10} \right)^3 n + ... \\ &= \frac{5/4 n}{1 - \frac{9}{10}} \\ &= 12.5n. \end{align*} $$

Overall, the expected runtime of this algorithm is linear.

Database access

Assume that $n$ processes want to access a database, and that the time $t$ is discrete. If two processes simultaneously request access, there is a conflict and all processes are locked out of access.

Assume that processes cannot communicate with each other on when to access. One possible for a process to determine if it should access the database at time $t$ is to "request access" with probability $p$ and "do not request access" with probability $(1 - p)$.

What should $p$ be to maximise the probability of a successful access to the database for a process at any instant $t$?

Efficient selection of p

The probability of success of process $i$ at any instant $t$ is

$$P(S(i, t)) = p(1 - p)^{n - 1},$$

because a process $i$ requests access with probability $p$ and the probability that no other process has requested access is $(1 - p)^{n - 1}$.

Visualization of Success Probability

TikZ Diagram

The graph clearly shows:

  • Higher nLower optimal p: As processes increase, optimal request probability decreases
  • Peak success probability decreases: More processes lead to lower maximum success rates
  • Sharper curves for large n: The optimal region becomes narrower as n increases

The extreme points of this is found by solving

$$\frac{d}{dp}P(S(i, t)) = (1 - p)^{n - 1} - p(n - 1)(1 - p)^{n - 2} = 0$$

which gives $p = 1/n$, which gives a probability of success of

$$P(S(i, t)) = p(1 - p)^{n - 1} = \frac{1}{n} \left( 1 - \frac{1}{n} \right)^{n - 1}.$$

However,

$$\lim_{n \rightarrow \infty} \left(1 - \frac{1}{n} \right)^n = e$$

and hence $P(S(i, t)) = \Theta \left( \frac{1}{n} \right)$. Thus, the probability of failure after $t$ instances is

$$ \begin{align*} P(\text{failure after $t$ instants}) &= \left( 1 - \frac{1}{n} \left( 1 - \frac{1}{n} \right)^{n - 1} \right)^t \\ &\approx \left( 1 - \frac{1}{e n}\right)^t \end{align*} $$

We observe that

P(failure after $t = en$ instances) P(failure after $t = en \cdot 2 \ln n$ instances)
$$\left( 1 - \frac{1}{en}\right)^{en} \approx \frac{1}{e}$$ $$\left(1 - \frac{1}{en} \right)^{en \cdot 2 \ln n} \approx \frac{1}{n^2}$$

Thus, a small increase in the number of time instants, from $en$ to $en \cdot 2 \ln n$ caused a dramatic reduction in the probability of failure.

After $en \cdot 2 \ln n$ instances, the probability of failure of each process $\leq 1/n^2$ and since there are $n$ processes, then the probability that at least one process failed is $\leq 1/n$. Thus after $en \cdot 2 \ln n$ instances all processes succeeded to access the database with probability at least $1 - 1/n$.

Comparison with centralised algorithm

If the processes could communicate, then it would take $n$ instances for all of them to access the database.

If they can't communicate, then the above method will allow them to access the database with probability $1 - 1/n$ time, which is larger only by a relatively small factor of $2e \ln n$.

Skip Lists

A skip list is a probabilistic data structure that functions similarly to a binary search tree.

# Example structure of a skip list

[H]--------------------------[35]----------------[T]
[H]----------------[21]------[35]----------------[T]
[H]------[12]------[21]-[24]-[35]------[55]------[T]
[H]-[02]-[12]-[17]-[21]-[24]-[35]-[43]-[55]-[62]-[T]

Operations

  • Search of $k$

    1. Start from the highest level of head H and go as far right without exceeding $k$
    2. Drop one level down and repeat the procedure using lower level links until you find $k$
  • Insertion of $k$

    1. Search for the correct location
    2. Toss a coin until you get a head, and count the number of tails $t$ you got
    3. Insert $k$ and link it at levels $0 - t$ from the bottom up
  • Deletion

    Deleting an element is just like in a standard doubly linked list

Analysis

Expected Levels

The probability of getting $i$ consecutive tails when flipping a coin $i$ times is $1/2^i$. Thus, an $n$ element Skip List has on average $n/2^i$ elements with links on level $i$. Since an element has links only on levels $0 - i$ with probability $1/2^{i + 1}$, the total expected number of link levels per element is

$$\sum_{i = 0}^{\infty} \frac{i + 1}{2^{i + 1}} = \sum_{i = 1}^{\infty} \frac{i}{2^i} = 2.$$

Let $\#(i)$ be the number of elements on level $i$.

Then, $E[\#(i)] = \frac{n}{2^i}$, and by the Markov inequality, the probability of having at least one element at level $i$ satisfies

$$P(\#(i) \geq 1) \leq \frac{E[\#(i)]}{1} = \frac{n}{2^i}.$$

Thus, the probability to have an element on level $2 \log n$ is smaller than $n/2^{2 \log n} = 1/n.$

More generally, the probability to have an element (be nonempty) on level $k \log n$ $< n /2^{k \log n} = 1/n^{k - 1}$. The expected value $E(k)$ such that $k$ is the lowest integer so that the number of levels is $\leq k \log n$ is

$$E(k) \leq \sum_{k = 1}^{\infty} \frac{k}{n^{k - 1}} = \left( \frac{n}{n - 1} \right)^2.$$

Thus, the expected number of levels is barely larger than $\log n$.

Expected search per level

Thus, the expected number of elements between any two consecutive elements with a link on level $i + 1$ which have links only up to level $i$ is smaller than

$$\frac{0}{2} + \frac{1}{2^2} + \frac{2}{2^3} + \frac{3}{2^4} + ... = 1.$$

So once on level $i$, on average we will have to inspect only two elements on that level before going to a lower level.

Search Time Complexity

On average, levels $< 2 \log n$, and 2 elements are visited per level. Therefore, on average, the search will be in time $O(4 \log n) = O(\log n)$.

Space Complexity

For an element on levels $0 - i$, we store $O(i + 1)$ pointers, and expected number of elements with highest link on level $i$ is $O(n/2^{i + 1})$. Thus, total expected space for is

$$O \left( \sum_{i = 0}^{\infty}2(i + 1) \frac{n}{2^{i + 1}}\right) = O \left( 2n \sum_{i = 0}^{\infty} \frac{i + 1}{2^{i + 1}} \right) = O(4n) = O(n).$$

Karger's Min Cut Algorithm

Given a graph $G = (V, E)$, Karger's Min Cut algorithm finds a cut $T$ that partitions vertices $V$ into two non empty disjoint subsets $X$ and $Y$, with the lowest capacity of edges which have one edge in $X$ and the other in $Y$.

A deterministic way to solve this is through max flow from one vertex to all other vertices, however this runs in $O(|V|^4)$.

Procedure

Contraction

The algorithm makes use of contracting edges in a graph. To contract an edge $e(u, v)$, fuse $u$ and $v$ into a single vertex $[uv]$ and replace edges $e(u, x)$ and $e(v, x)$ with a single edge $e([uv], x)$ of weight $w([uv], x) = w(u, x) + w(v, x)$. The obtained graph after this is called $G_{uv}$.

Claims

After collapsing $u$ and $v$ into a single vertex

  • If $u$ and $v$ belong to the same side of a min cut, the capacity of the min cut in $G_{uv}$ is the same as that of $G$.
  • If $u$ and $v$ belong to opposite sides of a min cut, the capacity of the min cut in $G_{uv}$ is larger or equal to the capacity of the min cut in $G$.

Therefore,

  • Let $T_1 = (X_1, Y_1)$ be a min cut in $G_{uv}$

  • Split $[uv]$ back into $u$ and $v$, but keep them on the same side of the cut $T_1$. This produces a cut $T_2$ in $G$ of the same capacity as the min cut $T_1$ in $G_{uv}$.

    Thus, the capacity of the min cut in $G$ can only be smaller than the min cut $T_1$ in $G_{uv}$

Algorithm

  1. While there are more than 2 vertices
    1. Pick an edge to with probability proportional to the weight of that edge $$P(e(u, v)) = \frac{w(u, v)}{\sum_{e(p, q) \in E} w(p, q)}$$
    2. Contract the edge, and remove self loops
  2. Take the capacity of that last edge to be the estimate of the capacity of the min cut in $G$

Theorems

Let $M(G)$ represent the min cut capacity of $G$.

Theorem 1

The probability that the capacity of a min cut in $G_{uv}$ is larger than the capacity of a min cut in $G$ is smaller than $2/n$ where $n = |V|$.

$$P(M(G_{uv}) > M(G)) < \frac{2}{n}.$$

This is because this probability is less than or equal to the probability that the edge $e(u, v)$ belonged in the set of edges along the min cut $M$. The probability of the $e(u, v)$ in $M$ is less than or equal to the final min cut capacity divided by the total capacity of the graph (which is equal to $\frac{n}{2}$), resulting in the final probability of $\frac{2}{n}$.

Theorem 2

If we run edge contraction until there is 1 edge, then the probability $\pi$ that the capacity of that edge is equal to the capacity of the min cut in G is $\Omega \left( \frac{1}{n^2} \right)$.

From the first theorem

$$ \begin{align*} \pi &= P(M(G) = M(G_{n-2})) \\ &= \prod_{i = 1}^{n - 2} P(M(G_i) = M(G_{i - 1})) \\ &\leq \left(1 - \frac{2}{n} \right) \left(1-\frac{2}{n-1}\right) \left(1-\frac{2}{n-2}\right) ... \left(1-\frac{2}{3} \right) \\ &= \frac{n - 2}{n} \times \frac{n - 3}{n - 1} \times \frac{n - 4}{n - 2} \times ... \times \frac{1}{3} \\ &= \frac{2}{n(n-1)}. \end{align*} $$

Since the contraction runs in $O(n^2)$, and has a $\Omega \left( \frac{1}{n^2} \right)$ chance of being correct, it needs to be run $\Theta(n^2)$ times, resulting in a final runtime of $O(n^4)$ to find the min cut.

Karger's Min Cut Refinement

The algorithm can be improved to run in $O(n^2 \log^3 n)$ with the high probability of being correct through a divide and conquer approach.

If after running the contraction algorithm until there is $\lfloor \frac{n}{2} \rfloor$ vertices runs in $O(n^2)$ and the chance of being correct is $\approx 1/4$. Hence, by running the algorithm until there are $\lfloor \frac{n}{2} \rfloor$ vertices 4 times, and then recursing on those smaller graphs, we have a runtime of

$$T(n) = 4T\left(\frac{n}{2}\right) + O\left(n^2\right) = O(n^2 \log n).$$

# Python flavoured pseudo code of the refined algorithm

def karger_refined(G: Graph) -> int:
    V, E = G
    # Base case, return the last and only edge weight
    if len(V) == 2:
        return E[V[0], V[1]]

    # Run contraction 4 times and recurse on those 4 new graphs
    min_cuts = [karger_refined(contract(G)) for _ in range(4)]
    return min(min_cuts)

Let $p(n) = P(\text{success for a graph of size $n$})$, then

$$ \begin{align*} p(n) &= 1 - P(\text{failure on one branch})^4 \\ &= 1 - (1 - P(\text{success on one branch}))^4 \\ &= 1 - \left( 1 - \frac{1}{4}p \left(\frac{n}{2}\right)\right)^4 \\ &> p\left(\frac{n}{2}\right) - \frac{3}{8}p\left(\frac{n}{2}\right)^2 \\ &> \frac{1}{log(n)}. \end{align*} $$

If we run our algorithm $(\log n)^2$ times, the probability we are correct $\pi$ is

$$\pi = 1 - \left(1 - \frac{1}{\log n} \right)^{(\log n)^2}$$

However, since for all reasonably large $k$, $(1 - 1/k)^k \approx e^{-1}$. As a result,

$$\pi \approx 1 - e^{-\log n} = 1 - 1/n.$$

Hence, if we run karger_refined $(\log n)^2$ times, we have a probability of being correct $1 - 1/n$, with a runtime of

$$O\left(n^2 \log^3 n \right) \lt\lt O(\left(n^4\right).$$

Randomised Hashing

If the hash function can be analysed, and a sequence of worst keys is chosen, then a lookup in a hash table can take $O(n)$, though ideally we want to have $O(1)$.

Universal Hashing

Let $H$ be a finite collection of hash functions that map a given universe $U$ of keys into the smaller range {0 .. m - 1}. $H$ is said to be universal if for each pair of distinct keys $x, y \in U$, the number of hash functions $h \in H$ for which $h(x) = h(y)$ is $\frac{|H|}{m}$.

Assume that a family of hash functions $H$ is universal, and we are hashing $n$ keys into a hash table of size $m$. Let $C_x$ be the total number of collisions involving key $x$, then the expected value $E[C_x]$ satisifies

$$E[C_x] = \frac{n - 1}{m}.$$

Designing a universal family of hash functions

  • Choose the size of the hash table $m$ to be a prime number
  • Let $r$ be such that the size $|U|$ of the universe of all keys satisifies $m^r \leq |U| \leq m^{r+1}$, i.e. $r = \lfloor \log_m |U|\rfloor$
  • Hence, we can represent each key $x$ in base $m$ where $$\vec{x} = \langle x_0, x_1, ..., x_r\rangle \text{ and } x = \sum_{i=0}^{r}x_i m^i$$
  • Let $\vec{a} = \langle a_0, a_1, ..., a_r \rangle$ be a vector of $r + 1$ randomly chosen elements from the set $\{0, 1, ..., m - 1 \}$
  • Define a corresponding hash function $h_{\vec{a}}(x) = \left( \sum_{i=0}^{r} x_i a_i \right) \mod m$

Proving universality

Assume $x, y$ are two distinct keys. Then for there to be a hash collision,

$$ \begin{align*} h_{\vec{a}}(x) = h_{\vec{a}}(y) &\Leftrightarrow \sum_{i=0}^{r}x_ia_i = \sum_{i=0}^{r}y_ia_i \mod m \\ &\Leftrightarrow \sum_{i=0}^{r}(x_i - y_ia_i) = 0 \mod m \end{align*} $$

Since $x \neq y$ there exits $0 \leq k \leq r$ such that $x_k \neq y_k$. Assume that $x_0 \neq y_0$, then

$$(x_0 - y_0)a_0 = - \sum_{i=1}^{r}(x_i - y_i)a_i \mod m$$

Since $m$ is a prime, every non-zero element $z \in \{0, 1, ..., m - 1\}$ has a multiplicative inverse $z^{-1}$, such that $z \cdot z^{-1} = 1 \mod m$ and so

$$a_0 = \left( - \sum_{i=0}^{r} (x_i - y_i)a_i \right) (x_0 - y_0)^{-1} \mod m$$

this implies that

  1. for any 2 keys $x, y$ such that $x_0 \neq y_0$ and
  2. for any randomly chosen $r$ numbers $a_1, a_2, ..., a_r$

there exists exactly one $a_0$ such that $h_{\vec{a}}(x) = h_{\vec{a}}(y)$.

Since there are

  • $m^r$ sequences of the form $\langle a_{1}, ..., a_r \rangle$
  • $m^{r+1}$ sequences of the form $\langle a_0, a_1, ..., a_r \rangle$

as a result, the probability to choose a sequence $\vec{a}$ such that $h_{\vec{a}}(x) = h_{\vec{a}}(y)$ is equal to

$$\frac{m^r}{m^{r+1}} = \frac{1}{m}.$$

Thus, the family $H$ is a universal collection of hash functions.

Using universal family of hash functions

  1. Pick $r = \lfloor \log_m |U| \rfloor$ ,so that $m^r \leq |U| \leq m^{r+1}$
  2. For each run, pick a hash function by randomly picking a vector $\vec{a} = \langle a_0, a_1, ..., a_r \rangle$ such that $0 \leq a_i \leq m$, for all $i$, $0 \leq i \leq r$.
  3. During each run use that function on all keys

Designing a Perfect Hash table

First step

Given $n$ keys we will be constructing hash tables for size $m < 2n^2$ using universal hashing. The probability that such a table is collision free will be $> 1/2$

  1. Pick the smallest prime $m$, such that $n^2 < m < 2n^2$
  2. Pick a random vector $\vec{a}$ and hash all keys using the corresponding hash function $h_{\vec{a}}$

Given $n$ keys, there will be $n \choose 2$ pairs of keys. By universality of the family of hash functions used, for each pair of keys the probability of collision is $1/m$. Since $m \geq n^2$ we have $\frac{1}{m} \leq \frac{1}{n^2}$. Thus, the expected total number of collisions in the table is at most

$${n \choose 2} \frac{1}{m} \leq \frac{n(n - 1)}{2} \frac{1}{n^2} < \frac{1}{2}.$$

Let $X$ be the random variable equal to no. collisions. Then by the Markov Inequality with $t=1$ we get

$$P\{X \geq 1\} \leq \frac{E[X]}{1} < \frac{1}{2}.$$

Thus, the chance of a collision after $k$ hashes $< (1/2)^k$, which rapidly tends to 0. Consequently, after a few random trial-and-error attempts we will obtain a collision free hash table of size $< 2n^2$.

If $p$ is the probability of a collision, then the expected number of trials $E[N]$ before we hit a collision free hash table of size $2n^2$ is

$$ \begin{align*} E[N] &= 1(1 - p) + 2p(1 - p) + 3p^2(1 - p) + ... \\ &= (1 - p)(1 + 2p + 3p^2 + ...) \\
&= \frac{1}{1 - p} \\ &< 2. \end{align*} $$

Second step

Choose $M$ to be the smallest prime number $> n$

Thus, $n \leq m \leq 2n$. Produce a hash table of size $M$ again by choosing randomly from a universal family of hash funtions. Assume that a slot $i$ of this table has $n_i$ many elements. Hash these $n_i$ elements into a secondary hash table of size $m_i < 2n_i^2$. We have to guarantee that the sum total of sizes of all secondary hash tables, i.e., $\sum_{i=1}^{M}m_i$ is linear in $n$. Note ${n_i \choose 2}$ is the number of collisions at $n_i$ and

$${n_i \choose 2} = \frac{n_i(n_i - 1)}{2} = \frac{n_i^2}{2} - \frac{n_i}{2}.$$

Therefore the total number of collisions in the hash table is $\sum_{i=1}^{M} {n_i \choose 2}$, and since the expected value of collision with universal hashing is $1/M$,

$$ \begin{align*} E\left[ \sum_{i=1}^{M} {n_i \choose 2} \right] &= {n \choose 2}\frac{1}{M} \\ &= \frac{n(n - 1)}{2M} \\ E\left[ \sum_{i=1}^{M} n_i^2 \right] &= 2E\left[ \sum_{i=1}^{M} {n_i \choose 2} \right] + n \end{align*} $$

Thus,

$$E\left[ \sum_{i=1}^{M} n^2_i \right] \leq \frac{n(n-1)}{n} + n = 2n - 1 < 2n.$$

Applying the Markov Inequality to find the probability we have more than $4n$ items in our hash table, we obtain

$$P \left\{ \sum_{i=1}^{M} n_i^2 > 4n \right\} \leq \frac{E\left[ \sum_{i=1}^{M} n_i^2 \right]}{4n} < \frac{2n}{4n} = \frac{1}{2}$$

Thus, after a few attempts we will produce a hash table of size $M < 2n$ for which $\sum_{i=1}^{M} < 4n$, and if we choose primes $m_i < 2n_i^2$ then $\sum_{i=1}^{M} m_i < 8n$. In this way the size of the primary hash table plus the sizes of all secondary hash tables satisfies

$M + \sum_{i=1}^{M}m_i < 2n + 8n = 10n.$

Gaussian Annulus, Random Projection and Johnson Lindenstrauss Lemmas

Generating random points in d-dimensional spaces

Let us consider a Gaussian random variable $X$ with a zero mean ($E[X] = \mu = 0$) and variance $V[X] = v = 1/2\pi$, then its density is given by

TikZ Diagram

$$f_X(x) = \frac{1}{\sqrt{2\pi v}}e^{-\frac{x^2}{2v}} = e^{-\pi x^2}$$

Assume that we use such $X$ to generate independently the coordinates $\langle x_1, ..., x_d \rangle$ of a random vector $\vec{x}$ from $\mathbb{R}^d$. Then $E[X^2] = E[(X - E[X])^2] = V[X]$, and $E\left[ \frac{x_1^2 + ... + x_d^2}{d} \right] = dV[X]/d = V[X]$.

As a result, given the length $|x| = \sqrt{x_1^2 + ... + x_d^2}$, the expected value of the square of a the length of $\vec{x}$ is $E[|x|^2] = d/2\pi$. So, on average, $|x| \approx \frac{\sqrt{d}}{\sqrt{2\pi}} = \Theta(\sqrt{d})$, and if $d$ is large, then this is true with a high probability.

If we choose 2 points independently, then

$$E[\langle x, y \rangle] = E[x_1 y_1 + ... + x_d y_d] = d E[XY] = d E[X] E[Y] = 0.$$

Hence, the expected value of the scalar product of any 2 vector with randomly chosen coordinates is zero.

Hence, vectors with randomly chosen coordinates:

  • have approximate the same length $\Theta(\sqrt{d})$
  • any 2 such vectors are likely to be almost orthogonal

Higher Dimensional Balls

Most of the volume of a high dimensional ball is near:

  • Any of its equators (between two close parallel hyper-planes symmetric with respect to the center).
    • (Insert lots of math). The ratio between a slice of the sphere $A$ that lies above the hyperplane $x_1 = \frac{c}{\sqrt{d - 1}}$ for some constant $c$, and the whole hemisphere satisifies $$ \frac{A}{H} < \frac{ V(d - 1)\frac{ e^{\frac{-c^2}{2}} }{ c \sqrt{d - 1} } }{ V(d - 1) \frac{ 1 } { 2 \sqrt{d - 1} } } = \frac{2}{c}e^{-\frac{c^2}{2}} $$
  • It's surface (if we have a ball of radius $r$, and another smaller, of radius $r(1 - \epsilon)$, most of the volume of the bigger ball is in the annulus outside the smaller ball).
    • This is because the area of the smaller ball is $(1 - \epsilon)^d$ and $(1 - \epsilon) < 1$. Hence, for large values of $d$ this tends to 0 quickly.

Johnson-Lindenstrauss Lemma

For any $\epsilon$, $0 < \epsilon < 1$, and any integer $n$, assume that $k$ satisfies $k > \frac{3}{\gamma \epsilon^2} \ln n$. Then for any set of $n$ points given by the vectors $v_1, ..., v_n$ in $\mathbb{R}^d$, with the probability of at least $1 - 3/(2n)$, the random projection $f' \mathbb{R}^d \rightarrow \mathbb{R}^k$ has the property that for ALL pairs of points $v_i, v_j$

$$|| f'(v_i - v_j)| - |v_i - v_j|| \leq \epsilon | v_i - v_j|.$$

Thus, $f'(v)$ "almost" preserves distances between points given by vectors $v_i$ despite reduction of dimensionality from $d >> k$ to only $k$.

Application of Johnson-Lindenstrauss Lemma

  • Choose $k$ random vectors by choosing each coordinate of every vector using a unit variance Gaussian
  • Pre-process data by projecting to $k < < d$ dimensional subspace spanned by the $k$ random vectors
  • For comparing new data with current data, map the new vector $y \in \mathbb{R}^d$ with its projection $f'(y) = f(y) / \sqrt{k}$.
  • Then nearest neighbours of $f'(y)$ can be used in hte projected $k$ dimensional space instead of dimension $d > > k$.

Page Rank

The Page Rank algorithm aims to solve the problem of how to order webpages.

Setup

Consider all the webpages $P_i$ on the entire internet as vertices of a directed graph, where a directed edge $P_i \rightarrow P_j$ exists if $P_i$ has a link to $P_j$.

Notation:

  • $\rho(P)$ = the rank of a page (to be assigned)
  • $\#(P)$ = the number of outgoing links on a web page

A web page $P$ should have a high rank only if it is pointed at by many pages $P_i$ which:

  1. themselves have a high rank $\rho(P_i)$
  2. and do not point to an excessive number of other web pages, i.e. $\#P(_i)$ is reasonably small.

So we want the following system of equations to be satisfied:

$$\left\{\rho(P) = \sum_{P_i \rightarrow P} \frac{\rho(P_i)}{\#(P_i)} \right\}_{P \in WWW}$$

We have a large sparse matrix $G_1$ of size $M \times M$, where $M = |WWW|$

$$ G_1 = \begin{pmatrix} g(1, 1) & \ldots & g(1, j) & \ldots & g(1, M) \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ g(i, 1) & \ldots & g(i, j) & \ldots & g(i, M) \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ g(M, 1) & \ldots & g(M, j) & \ldots & g(M, M) \end{pmatrix} \ $$

$$ g(i, j) =
\begin{cases} \frac{1}{\#(P_i)} & \text{ if } P_i \rightarrow P_j \\ 0 & \text{otherwise} \end{cases} $$

However, because $G_1$ is mostly a sparse matrix, it resembles

$$ G_1 = \begin{pmatrix} & & \vdots & & & & \vdots & & \\ & & \vdots & & & & \vdots & & \\ \ldots & 0 & \frac{1}{k} & 0 & \ldots & 0 & \frac{1}{k} & 0 & \ldots \\ & & \vdots & & & & \vdots & & \\ & & \vdots & & & & \vdots & & \\ \end{pmatrix} \ $$

where $k$ is equal to $\#(P_i)$, the number of pages which th epage $P_i$ has links to. Hence, the system of linear equations can be represented as

$$\mathbf{r}^\intercal = \mathbf{r}^\intercal G_1$$

where

$$\mathbf{r}^\intercal = (\rho(P_1), \rho(P_2), ..., \rho(P_M)).$$

Note that $G_1$ is a markov matrix, and $\mathbf{r}^\intercal$ is a left-hand eigenvector of $G_1$ corresponding to the eigenvalue 1. Thus, finding ranks of web pages is reduced to finding eigenvectors of $G_1$, which corresponds to the eigenvalue 1.

If we model a random walk on the graph of this matrix, there are 2 issues:

  1. What should we do when we get to a webpage with no outgoing links?

    Solution: Jump to a randomly chosen webpage when a node with no outgoing links.

    I.e. the first row in $G_1$ is a dangling page, with no outgoing pages. $G_2$ fixes this by making it point to all pages with equal probability. Such a matrix is row stochastic, meaning that each row sums up to 1.

    $$ G_1 = \begin{pmatrix} & & \vdots & & & & \vdots & & \\ \ldots & 0 & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots \\ & & \vdots & & & & \vdots & & \\ \ldots & 0 & \frac{1}{\#(P_i)} & 0 & \ldots & 0 & \frac{1}{\#(P_i)} & 0 & \ldots \\ & & \vdots & & & & \vdots & & \\ \end{pmatrix} \\ \\ G_2 = \begin{pmatrix} & & \vdots & & & & \vdots & & \\ \ldots & \frac{1}{M} & \frac{1}{M} & \frac{1}{M} & \ldots & \frac{1}{M} & \frac{1}{M} & \frac{1}{M} & \ldots \\ & & \vdots & & & & \vdots & & \\ \ldots & 0 & \frac{1}{\#(P_i)} & 0 & \ldots & 0 & \frac{1}{\#(P_i)} & 0 & \ldots \\ & & \vdots & & & & \vdots & & \\ \end{pmatrix} \ $$

  2. If $T$ is the walk length, then as $T \rightarrow \infty$, the values $N(P)/T$ should converge. This becomes an issue if there are disconnected graphs or if the graph is bipartite (and the probability of being in one side depends if the path length is odd or even).

    Solution: Randomly jump to a new page after some time.

    This transformation does not change the rows corresponding to dangling webpages: $\alpha / M + (1 - \alpha) / M = 1/M$

    $$ G = \begin{pmatrix} & & \vdots & & & & \vdots & & \\ \ldots & \frac{1}{M} & \frac{1}{M} & \frac{1}{M} & \ldots & \frac{1}{M} & \frac{1}{M} & \frac{1}{M} & \ldots \\ & & \vdots & & & & \vdots & & \\ \ldots & \frac{1 - \alpha}{M} & \frac{\alpha}{\#(P_i)} + \frac{1 - \alpha}{M} & \frac{1 - \alpha}{M} & \ldots & \frac{1 - \alpha}{M} & \frac{\alpha}{\#(P_i)} + \frac{1 - \alpha}{M} & \frac{1 - \alpha}{M} & \ldots \\ & & \vdots & & & & \vdots & & \\ \end{pmatrix} \ $$

These two solutions allow the ranks to converge, because the model works like a well behaved Markov Chain.

To implement the first solution:

Let $d$ be 1 at positions $i$ which correspond to dangling webpages and 0 elsewhere

$$\mathbf{d}^\intercal = (0 \ ... \ 0 \ 1 \ 0 \ ... \ 0 \ 1 \ 0 \ ... \ 0 \ 1 \ 0 \ ... \ 0)$$

Let $\mathbf{e}$ be 1 everywhere: $\mathbf{e}^\intercal = (1 \ 1 \ ... \ 1)$.

Then $$G_2 = G_1 + \frac{1}{M} \mathbf{d} \mathbf{e}^\intercal$$

To implement the second solution as well, we get $G$ as:

$$G = \alpha G_2 + \frac{1 - \alpha}{M} \mathbf{e} \mathbf{e}^\intercal = \alpha \left( G_1 + \frac{1}{M} \mathbf{d} \mathbf{e}^\intercal \right) + \frac{1 - \alpha}{M} \mathbf{e} \mathbf{e}^\intercal$$

$G$ is no longer sparse, but the vector matrix product $\mathbf{x}^\intercal G$ for vectors $\mathbf{x}$ whose coordinates sum up to 1 can be decomposed as:

$$\mathbf{x}^\intercal = \alpha \mathbf{x}^\intercal G_1 + \frac{1}{M} \left(1 - \alpha (1 - \mathbf{x}^\intercal \mathbf{d}) \right) \mathbf{e}^\intercal$$

Markov Chains (Discrete Time Markov Processes)

A (finite) Markov Chain is given by a finite set of states $S = \{P_i\}_{i \leq M}$ and by a row stochastic matrix $G$. The process can start at $t = 0$ in any of its states and continue its evolution by going at every discrete moment of time from its present state to another randomly chosen state.

Simple Example: Weather Model

Consider a simple 3-state Markov chain representing weather conditions:

State Transition Diagram:

TikZ Diagram

Transition Matrix G:

$$ G = \begin{pmatrix} 0.7 & 0.2 & 0.1 \\ 0.3 & 0.4 & 0.3 \\ 0.2 & 0.3 & 0.5 \end{pmatrix} $$

Each row sums to 1 (row stochastic property), and $G_{ij}$ represents the probability of transitioning from state $i$ to state $j$.

The model of the random walk on the graph is an example of a Markov Chain:

  • States are "being at a webpage $P_i$", so we have in total $M$ states, one for each web page $P_i$, $1 \leq i \leq M$.
  • We start from a randomly chosen starting web page.
  • Thus, $q^{(0)}(i) = \frac{1}{M}$ for all $i$, because all pages are equally likely to be the starting page.
  • If a webpage $P_i$ is not a dangling webpage, follow a link on the page $P_i$ which points to another webpage $P_j$ with prob $\frac{\alpha}{\#(P_i)}$, or jump directly to a page $P_j$ with probability $\frac{1 - \alpha}{M}$.
    • Hence if a page $P_i$ does not point to a page $P_j$ then $g_{i, j} = \frac{1 - \alpha}{M}$.
  • If $P_i$ is a dangling page then $g_{i, j} = \frac{1}{M}$ for every page $P_j$ (including the same page $P_i$).

General Markov Chains

  • The Google matrix induces a strongly connected graph (as there is a directed edge between any two vertices), and so is irreducible.
  • The Google matrix ix aperiodic
    • A state $P_i$ in a Markov chain si periodic if there exists integer $K > 1$ s.t. all loops in its underlying graph which contain vertex $P_i$ have length divisible by $K$.
    • Markov chains which do not have any periodic states are called aperiodic
  • For any finite, irreducible and aperiodic Markov chain has the following properties:
    1. For every initial distribution of states $\mathbf{q}^{(0)}$ the value of $\mathbf{q}^{(t)} = \mathbf{q}^{(0)}G^t$ converges as $t \rightarrow \infty$ to a unique stationary distribution $\mathbf{q}$, i.e., converges to a unique distribution $\mathbf{q}$ which is independent of $\mathbf{q}^{(0)}$ and satisfies $\mathbf{q} = \mathbf{q}G$.
    2. Note: in the above $\mathbf{q}$ is a row vector, to avoid having to always transpose it.
    3. Let $N(P_i, T)$ be the number of times the system has ben in state $P_i$ during $T$ many transitions of such a Markov chain, then $$\lim_{T \rightarrow \infty} \frac{N(P_i, T)}{T}= \mathbf{q_i}$$

Application to PageRank

The general theorem on Markov chains implies that:

  • 1 is the left eigenvalue of $G$ of the largest absolute value, and the stationary distribution $\mathbf{q}$ is the corresponding left hand side eigenvector, $\mathbf{q}^\intercal = \mathbf{q}^\intercal G$.
  • $\mathbf{q}$ is unique, i.e., if $\mathbf{q}_1^\intercal = \mathbf{q}_1^\intercal G$, then $\mathbf{q}_1 = \mathbf{q}$
  • Distribution $\mathbf{q}$ can be obtained by starting with an arbitrary initial probability distribution $\mathbf{q}_0$ and obtain $\mathbf{q}$ as $\lim_{k \rightarrow \infty} \mathbf{q}_0^\intercal G^k$
  • An approximation $\mathbf{\tilde{q}}$ of $\mathbf{q}$ can be obtained by taking $\mathbf{q}_o^\intercal = (1/M, 1/M, ..., 1/M)$ and a sufficiently large $K$ and computing $\mathbf{\tilde{q}} = \mathbf{q}_0 G^k$ iteratively
  • The $i^{th}$ coordinate of such obtained distribution $\mathbf{q}^\intercal = (q_1, ..., q_i, ..., q_M)$ roughly gives the ratio $N(P_i, T)/T$ where $N(P_i, T)$ is the number of times $P_i$ has been visited during a surfing session of length $T$.

Finding alpha

How close to 1 should $\alpha$ be?

The expected length $lh$ of surfing between two teleportations can be found as:

$$ \begin{align*} E(lh) &= 0(1 - \alpha) + \alpha(1 - \alpha) + ... + k \alpha^k (1 - \alpha) + ... \\ &= \alpha(1 - \alpha)(1 + 2 \alpha + ... + k \alpha^{k - 1} + ...) \\ &= \frac{\alpha}{1 - \alpha}. \end{align*} $$

Google uses $\alpha = 0.85$ (obtained empirically), giving an expected surfing length $\frac{0.85}{1- 0.85}\approx 5.7$, very close to 6 (coming from the idea of six degrees of separation), and it takes approx $50 - 100$ iterations for convergence.

Larger values produce more accurate representation of "importance" of a webpage, but the convergence slows down fast.

Error of approximation of $\mathbf{q}$ by $\mathbf{q_0}G^k$ decreases approximately as $\alpha^k$. More importantly, increasing $\alpha$ increases the sensitivity of the resulting PageRank. This is not effective as internet content and structure changes on a daily basis, but PageRank should change slowly.

Hidden Markov Models and the Viterbi Algorithm and its applications

A Hidden Markov Model is a Markov Model that has two states:

  • observations
  • hidden states

Hidden Markov Model Structure

TikZ Diagram

The diagram shows:

  • $\textbf{S}$: Hidden states that follow Markov transitions
  • $\textbf{O}$: Observable outcomes/emissions
  • Solid arrows: State transition probabilities
  • Dashed arrows: Emission probabilities from hidden states to observations

An example of it's application, is given a sequence of manifestations, how can we determine what sequence of states caused it?

We do it in a way that maximises the likelihood that we are correct which is what the Viterbi algorithm does. It is a dynamic programming algorithm that produces the most likely estimate.

Hidden Markov Models

  • If we are given a finite Markov chain, consisting of a set of its states $S = \{s_1, s_2, ..., s_K \}$.
  • We are also given the initial probabilities $\pi_1, \pi_2, ..., \pi_K$ of states.
  • However, we have no access to the direct states, we only have the set of observables $O = \{o_1, o_2, ..., o_N \}$ which are the possible manifestations of the states $S = \{s_1, s_2, ..., s_K \}$.
  • We are also given the emission matrix $E$ of size $K \times N$ where entry $e(i, k)$ represents the probability that when the chain is in state $s_i$, the observable outcome will be $o_k$.

Maximum Likelihood Estimation

Likelihood is, in a sense, an inverse of probability.

  • Assume there are $n$ balls labeled 1 to $n$, and you can draw a single ball, look at its value, and have to estimate the value of $n$.
  • Assume you drew the ball numbered $k$, which has a probability of $1 / n$. Therefore you have the highest chance of picking $k$ when $n$ is as small as possible, meaning it $n = k$ (as you know there are at least $k$ balls).
  • In this case the MLE estimator is $N(X) = X$, and $E(X)$ is given by $$\mu = \sum_{i=1}^{n}\left( i \times \frac{1}{n} \right) = \frac{n(n + 1)}{2n} = \frac{n + 1}{2}.$$ Thus, this is biased, because the expected value is half of $n$.
  • If we ues the estimator $Y(X) = 2X - 1$, then the expected value of $Y$ is $$\sum_{i = 1}^{n} \frac{2i - 1}{n} = \frac{2\sum_{i=1}^{n}i}{n} - \frac{\sum_{i=1}^n 1}{n} = \frac{2n(n+1)}{2n} - 1 = n$$ and this is unbiased.
  • As the size of the sample increases, ML estimate approaches the best possible estimate.

Viterbi Algorithm

Assume we are given a sequence of observable outcomes $(y_1, y_2, y_r)$. The goal is to find the sequence $(x_1, x_2, ..., x_r)$ of states of the Markov chain for which the likelihood that such a sequence is the cause of the observed sequence of outcomes is the highest.

  • Given $\vec{y}$ observed, we could pick the sequence of states $\vec{x}$ for which the value of $P(\vec{x}, \vec{y})$ is the largest.
  • This is not feasible, as if the total number of states is $K$ and the observed sequence is of length $T$ then there are $K^T$ sequences to try.
  • Instead the Viterbi algorithm solves this problem in $O(T \times K^2)$ using dynamic programming.

Algorithm

We solve all subproblems $S(i, k)$ for every $i \leq i \leq T$ and every $1 \leq k \leq K$:

Subproblem $S(i, k)$: Find the sequence of states $(x_1, ..., x_i)$ such that $x_i = s_k$ and such that the probability of observing the outcome $(y_1, ..., y_i)$ is maximal.

A good intro to Hidden Markov Models and the Viterbi Algorithm.

Recommender Systems

The main purpose of recommender systems is to recommend content / products to users that they may enjoy.

Two major kinds of recommender systems:

  • Content based: items are recommended by their intrinsic similarity
    • This suffers from the problem that content usually has to be done by humans because content is a semantic notion (which computers do not understand well)
  • Collaborative filtering: items are recommended based on some similarity measure between users and between items based on rating of items by the community of users
    • This tends to be superior in performance and does not rely on human advice
    • There are two main approaches:
    • Neighbourhood Method:
      • Based on the similarity of users
      • If $A$ and $B$ gave similar evaluations to movies that they have both seen, if $A$ liked something $b$ has not seen, then $B$ may like it as well
    • Latent Factor Method:
      • Based on the similarity of items
      • Assume two movies $M_1$ and $M_2$ had similar ratings, then if someone liked $M_1$, then it is reasonable to recommend movie $M_2$ to such a user

We can construct a sparsely populated table of ratings $R$, the rows correspond to movies, the columns to users. The entry $r(i, j)$ of the table, if non empty, represents teh rating user $U_i$ gave to some movie $M_j$ Each entry may have some rating in range $0 - 5$ (or a similar relatively small rating range, usually with at most 10 or so levels).

Neighbourhood Method

We replace these rating integers with more informative numbers.

  • First we compute the mean $\bar{r}$ of all ratings for all movies
  • Then we obtain $\bar{R}$ by subtracting $\bar{r}$ from all values
  • Now if a value $r'(i, j) > 0$ then $U_i$ liked $M_j$ above the global average ($r'(i, j)$ can be positive or negative with equal probability). However, some ratings may be influenced by "hype" or "trendiness" AKA systematic biases".
  • To remove this:
    • For every user $U_i$, introduce $v_i$, the "individual bias" of user $U_i$, reflecting tendency to give overall higher or lower scores.
    • For every movie $M_j$, introduce $\mu_j$, the "hype bias" of movie $M_j$ which is how "fashionable" the movie is (which fades with time).
    • Both systematic biases can be removed by seeking the values of $v_i$ and $\mu_j$ which minimises the expression $$S(\vec{v}, \vec{\mu}) = \sum_{(i, j) \in R}(r'(i, j) - v_i - \mu_j)^2$$
    • Note that $\mu$'s are constant shifts of rows (each row corresponding to a movie) and $v$'s are constant shifts of columns (each corresponding to a user). This is a Least squares problem, and $S(\vec{v}, \vec{\mu})$ achieves a minimum when all the partial derivatives are equal to 0: $$ \begin{align*} \frac{\partial}{\partial v_i}S(\vec{v}, \vec{\mu}) &= -2 \sum_{j:(i, j) \in R}(r'(j, i) - v_i - \mu_j) = 0 \\
      \frac{\partial}{\partial \mu_j}S(\vec{v}, \vec{\mu}) &= -2 \sum_{i:(i, j) \in R}(r'(j, i) - v_i - \mu_j) = 0 \\
      \end{align*} $$
    • Unfortunately, Least Squares fits usually suffer from overfitting. The solution for this is called regularisation: where we introduce a term which penalises for large values of the variables. Thus instead, we minimise the sum: $$S(\vec{v}, \vec{\mu}, \lambda) = \sum_{(i, j) \in R}(r'(i, j) - v_i - \mu_j)^2 + \lambda \left( \sum_i v_i^2 + \sum_j \mu_j^2 \right)$$ where $\lambda$ is a suitably chosen small positive constant, usually $10^{-10} \leq \lambda \leq 10^{-2}$ (the optimal value of $\lambda$ can also be "learned")
    • We now obtain from $\bar{R}$ $\tilde{R}$ and are ready to estimate similarities of users and movies

Neighbourhood Method - Similarity of users

One of the most frequently used measure of similarity of users is the cosine similarity measure.

If we want to compare 2 users $U_i$ and $U_k$, we find all movies that both users have ranked and delete all other entries. We obtain two column vectors $\vec{u}_1$ and $\vec{v}_k$, and the similarity of the two users is measured by the cosine of the angle between these two vectors.

$$ \begin{align*} \text{sim}(U_i, U_k) &= \cos(u_i, u_k) \\ &= \frac{\langle \vec{u}_i, \vec{u}_k \rangle}{|\vec{u}_i| \cdot |\vec{u}_k|} \end{align*} $$

We can now predict the rating a user $U_i$ would give to a movie $M_j$ they have not seen as follows:

  1. Among all users who have seen $M_j$, pick $L$ many users $U_{k_l}$ with $L$ largest values of $|\text{sim}(U_i, U_{k_l})|$ (i.e. the $L$ top similar and dissimilar values).
  2. We now predict the rating user $U_i$ would give to movie $M_j$ as $$\text{pred}(i, j) = \bar{r} + v_i + \mu_j + \frac{\sum_{1 \leq l \leq L} \text{sim}(U_i, U_{k_l}) \tilde{r}(j, k_l)}{\sum_{1 \leq l \leq L} |\text{sim}(U_i, U_{k_l})|}$$
  3. We then recommend to user $U_i$ movie $M_j$ for which the predicted rating $\text{pred}(i, j)$ is the highest
    • The "hype factor" $\mu_j$ is brought back when deciding what to recommend).
    • $v_i$ is constant across movies, so it is insignificant; adding it is mostly for test purposes because it will realistically predict the possible rating of $U_i$ of $M_j$ allowing easy comparison in tests

Neighbourhood Method - Similarity of movies

We can in a similar way estimate similarity of movies, working on the columns of $\tilde{R}$ (instead of rows). We predict the rating user $U_i$ would give to the move $M_j$ as

$$ \text{pred} = \bar{r} + v_i + \mu_j + \frac{\sum_{1 \leq l \leq L} \text{sim}(M_j, M_{n_l}) \tilde{r}(n_l, i)}{\sum_{1 \leq l \leq L} |\text{sim}(M_j, M_{n_l})|} $$

and recommend the movie $M_j$ that has the highest value of $\text{pred}(j, i)$.

Latent Factor Method

This method relies on several heuristics

  • There are features movies posses which appeal to different tastes which determine how much a user likes a movie. I.e. "action", "comedy", "romance", "famous actors", "special effects"

  • Enumerate these features as $f_1, f_2, ..., f_N$ where $N$ is to the order of a few 10s to a few 100s A movie can have each of these features, say $f_i$ to an extent $e_i$, where say $0 \leq e_i \leq 10$.

  • Each movie $M_j$ has a vector $\vec{e}^j$ of length $N$, and we can form a matrix $F$ s.t. rows of $F$ correspond to movies and columns correspond to features. I.e. if we have $M$ movies:

    $$ F = \begin{pmatrix} F_{1, 1} & \ldots & F_{1, N} \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \\ F_{M, 1} & \ldots & F_{M, N} \end{pmatrix} $$

    resulting in a very tall matrix

  • We can associate each user $U_i$ with a column vector $\vec{l}^i$ s.t. its $m^{th}$ coordinate is a number, $0 \leq \vec{l}^i_m \leq 10$ which represents how much a user likes a feature $f_m$ in a movie.

  • We can now form a matrix $L$ whose rows correspond to features and columns correspond to users. $$ L = \begin{pmatrix} L_{1, 1} & \ldots & \ldots & \ldots & \ldots & \ldots & L_{1, N} \\ \vdots & \ldots & \ldots & \ldots & \ldots & \ldots & \vdots \\ L_{M, 1} & \ldots & \ldots & \ldots & \ldots & \ldots & L_{M, N} \end{pmatrix} $$ resulting in a very wide matrix

Assume that we have access to $L$ and $F$. Then to predict how much $U_i$ likes $M_j$, we evaluate

$$E(j, i) = \sum_{1 \leq m \leq N} (\vec{e}^j)_m (\vec{l}^i)_m = \langle \vec{e}^j, \vec{l}^i \rangle.$$

and as a result can generate $E = F \times L$, resulting in a very large matrix. However, the issue is that we cannot determine which few dozen features are relevant out of a few hundred features.

Finding relevant features

Let $N$ be the number of features we want (typically $20 \leq N \leq 200$), $\#M$ by the number of movies, and $\#U$ be the number of users.

Fill matrices $F$ of size $\#M \times N$ and $L$ of size $N \times \#U$ with variables $F(j, m)$ and $L(m, i)$ whose values have yet to be determined.

Then sole the least squares problem in the variables $$\{ F(j, m): 1 \leq j \leq \#M; 1 \leq m \leq N \} \cup \{ L(m, i) : 1 \leq m \leq N; 1 \leq i \leq \#U \}$$

minimise

$$S(\vec{F}, \vec{L}) = \sum_{(j, i):R(j, i)} \left( \sum_{1 \leq m \leq N} F(j, m) \cdot L(m, i) - R(j, i) \right)^2$$

We can attempt to find the minimum by finding the when the partial derivative of $S(\vec{F}, \vec{L})$ is equal to 0. However:

  • this results in a huge system of cubic equations that cannot be solved feasibly
  • an optimisation problem is not convex, so search for the optimal solution can end up in a local minimum

Solution:

  1. Set all variables $F(j, m)$ to the same value $F^{(0)}(j, m)$ as a median value

  2. Now solve the following Least Squares problem for the variables $$\{ L(m, i) : 1 \leq m \leq N; 1 \leq i \leq \#U \}$$ minimise $$\sum_{j, i}: R(j, i) \left(\sum{1 \leq m \leq N} F^{(0)}(j, m) \cdot L(m, i) - R(j, i) \right)^2$$ which is now a system of linear equations as after we find the partials $F^{(0)}(j, m)$ is set to 0.

  3. Let $L^{(0)}(m, i)$ be the solutions to such a Least Squares problem.

  4. We now solve the following Least Squares problem in variables $$\{ F(j, m): 1 \leq j \leq \#M; 1 \leq m \leq N \}$$ minimise $$\sum_{(j, i):R(j, i)} \left( \sum_{1 \leq m \leq N} F(j, m) \cdot L^{(0)}(m, i) - R(j, i) \right)^2$$

  5. Now, we keep alternating between taking either $\{ L(m, i) : 1 \leq m \leq N; 1 \leq i \leq \#U \}$ or $\{ F(j, m): 1 \leq j \leq \#M; 1 \leq m \leq N \}$ as free variables, fixing the values of the other set from the previously obtained solution to the corresponding Least Squares problem.

    This method is sometimes called Method of Alternating Projections.

  6. We stop such iterations when $$\sum_{(j. m)}(F^{(k)}(j. m) - F^{(k - 1)}(j. m))^2 + \sum_{(i. m)}(L^{(k)}(m, i) - L^{(k - 1)}(m, i))^2$$ becomes smaller than an accuracy threshold $\epsilon > 0$.

  7. After we obtain the values $F^{(k)}(j, m)$ and $L^{(k)}(m, i)$ from the last iteration $k$, we form teh corresponding matrices $F$ of size $\#M \times N$ and $L$ of size $N \times \#U$ as $$ \begin{align*} \tilde{F} &= \left( F^{(k)}(j, m) : 1 \leq j \leq \#M; 1 \leq m \leq N \right) \\ \tilde{L} &= \left( L^{(k)}(m, i) : 1 \leq m \leq \#M; 1 \leq m \leq \#U \right) \\ \end{align*} $$

  8. We finally set $E = \tilde{F} \times \tilde{L}$ as the final matrix of predicted ratings of all movies by all users, where $E(j, i)$ is the prediction of the rating of movie $M_j$ by user $U_i$.

  • Note we do not know what these latent features actually are
  • However, the Latent Factor Method performs remarkably well in many domains
  • Though in a Netflix Challenge competition to design the best recommendation algorithm, the top algorithms were combination of dozens of components / algorithms with empirically tuned parameters

Clustering Algorithms

Clustering algorithms are a type of unsupervised learning used in data science.

There are two kinds of clusters:

  1. center - based clusters
  2. high density clusters

A good clustering algorithm should be able to handle both kinds.

Data Representation

There are two common representations of points

As vectors in $\mathbb{R}^d$

  • This is suitable when you have several numerical measurements (and each measurement maps to a dimension).
  • Note $d$ can be extremely large, corresponding to thousands or more, and can be complex to store and handle such data. This is where Johnson - Lindenstrauss Theorem comes to play.

As a weighted graph

  • Data points are represented as vertices of the graph
  • The weights of the edges reflect the degree of similarity (or dissimilarity) of the data points. The distance between two data points $x, y \in \mathbb{R}^d$ can be defined as either $$d(x, y) = \sum_{i=1}^d |x_i - y_i| \quad \text{or} \quad d(x, y) = \sqrt{\sum_{i=1}^d (x_i - y_i)^2}$$
  • If the scales of the $i^{th}$ and $j^{th}$ coordinates $x_i$ and $x_j$ differ significantly, or if they are not of equal importance, we might consider instead $$d(x, y)^2 = \sum_{i=1}^d w_i(x_i - y_i)^2$$ where the weights $w_i$ are chosen to normalise different variances of $x_i$ and $x_j$ or to encode their relative significances.
  • Graph representation of data is often much more compact than vectors, as it does not suffer from problems of high dimensionality.
  • However, the geometry of the data is lost, so the clustering is based only on mutual distances of pairs of points, which is good when clustering is not center based.

Center-based Clustering Algorithms

We assume data points are represented as vectors in $\mathbb{R}^d$.

k-center clustering

Find a partition $C = \{C_1, ..., C_k \}$ of a set of data points $A = \{\mathbf{a_1}, ..., \mathbf{a_n} \}$ into $k$ clusters, with the corresponding centers $\mathbf{c_1}, ..., \mathbf{c_k}$ which minimises

$$\Phi(C) = \max_{j=1}^k \max_{a \in C_j} d(\mathbf{a}, \mathbf{c_j})$$

This will minimise the radius of the cluster (as the radius of a cluster is the furthest distance from the cluster center).

k-median clustering

Find a partition $C = \{C_1, ..., C_k \}$ of a set of data points $A = \{\mathbf{a_1}, ..., \mathbf{a_n} \}$ into $k$ clusters, with the corresponding centers $\mathbf{c_1}, ..., \mathbf{c_k}$ which minimises

$$\Phi(C) = \sum_{j=1}^k \sum_{a \in C_j} d(\mathbf{a}, \mathbf{c_j})$$

$d(\mathbf{a}, \mathbf{c}_j)$ can be any distance metric, such as

  • $l_1$: $d(\mathbf{a}, \mathbf{c}_j) = \sum_{k=1}^d| (\mathbf{a})_k - (\mathbf{c}_j)_k |$
  • $l_2$: $d(\mathbf{a}, \mathbf{c}_j) = \sqrt{\sum_{k=1}^d ((\mathbf{a})_k - (\mathbf{c}_j)_k)^2}$ If the distance is the $l_1$ distance, one can show that the coordinates of the optimal centers are the coordinate-wise medians of points in each cluster.

k-means clustering

This is the most frequently used center based clustering algorithm. It is similar to the k-median clustering, except we want to minimise

$$\Phi(C) = \sum_{j=1}^k \sum_{a \in C_j} d(\mathbf{a}, \mathbf{c_j})^2$$

  • This penalises more for larger distances than the k-median clustering
  • This has other nice properties; for example, if $d(\mathbf{a}, \mathbf{c_j})^2 = \sum_{j=1}^d (a_i - c_{ji})^2$ then $\mathbf{c_j}$ must be the centroids of the points in their cluster.

Definitions

Finding the centroid

Since

$$\mathbf{a_i} = (a_{i1}, ..., a_{id})$$

let $\mathbf{c} = (c_1, ..., c_d)$ with for all $1 \leq k \leq d$,

$$c_k = (a_{1k} + ... + a_{nk}) / n$$

As a result, $c_k$ is the arithmetic mean of the $k^{th}$ coordinates of all the points $\{ \mathbf{a}_1, ..., \mathbf{a}_n \}$. Hence, $\mathbf{c}$ is called the centroid of the set of points $\{ \mathbf{a}_1, ..., \mathbf{a}_n \}$.

Finding the distance

Then $\mathbf{c}$ is called the centroid of the set of points $\{ \mathbf{a_1}, ..., \mathbf{a_n} \}$.

We denote $\mathbf{x} \cdot \mathbf{y}$ the scalar product of vectors $\mathbf{x}$ and $\mathbf{y}$ by $|| \mathbf{x} ||$ the norm of a vector $\mathbf{x}$, i.e.

$$\mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^d x_i y_i \quad \text{and} \quad ||x|| = \sqrt{\sum{i=1}^d x_i^2} = \sqrt{\mathbf{x} \cdot \mathbf{x}}$$

Note that $|| \mathbf{x} - \mathbf{y} ||$ is the Euclidean distance of points $\mathbf{x}$ and $\mathbf{y}$:

$$|| \mathbf{x} - \mathbf{y} || = \sqrt{\sum_{i=1}^d(x_i - y_i)^2}$$

Theorem

Let $A = \{\mathbf{a_1}, ..., \mathbf{a_n} \}$ be a set of points and $\mathbf{x}$ be another point, all in $\mathbb{R}^d$. Let also $\mathbf{c}$ be the centroid of $A$. Then

$$\sum_{i=1}^n || \mathbf{a_i} - \mathbf{x} ||^2 + n||\mathbf{c} - \mathbf{x}||^2$$

Corollary

Let $A = \{\mathbf{a_1}, ..., \mathbf{a_n} \}$ be a set of points and $\mathbf{x}$ be another point, all in $\mathbb{R}^d$. Then

$$D(x) = \sum_{i=1}^n || \mathbf{a_i} - \mathbf{x} ||^2$$

is minimised when $\mathbf{x}$ is the centroid $\mathbf{c} = \frac{1}{n} \sum_{i=1}^n \mathbf{x_i}$.

Implementation

Thus, to find a partition of set of points $A$ into $k$ disjoint components $A = \bigcup_{i=1}^k A_i$ and $k$ points $\mathbf{x_1}, ..., \mathbf{x_k}$ such that the sum

$$\sum{j=1}^k \sum{\mathbf{a_i} \in A_j} || \mathbf{a_i} - \mathbf{x_j} ||^2$$

is as small as possible, then whatever such an optimal partition $\{ A_j : 1 \leq j \leq k \}$ might be, the points $\mathbf{x_j}$ must be the centroids $\mathbf{c_j}$ of sets $A_j$.

Hence, finding the $k$ clusters is equivalent to minimising

$$\sum_{m=1}^k \frac{1}{2|A_m|} \sum_{\mathbf{a_i}, \mathbf{a_j} \in A_m} || \mathbf{a_i} - \mathbf{a_j} ||^2$$

Lloyd's Algorithm

Solving the optimal k-means clustering is NP hard, so we use approximate algorithms.

The best known approximate k-means clustering algorithm is Lloyd's algorithm.

  1. Start with an initial set of cluster centers $\{\mathbf{c}^{(0)}_m : 1 \leq m \leq k\}$
  2. Cluster all points $\mathbf{a} \in A$ into clusters $A_m$ by associating each $\mathbf{a} \in A$ with the nearest cluster centre.
  3. Replace cluster centres with the centroids of thus obtained clusters.
  4. Repeat 2 and 3 until cluster centers (and thus also clusters) stop changing.

At every round $p$ of its loop, Lloyd's algorithm reduces the size of

$$\sum_{m=1}^k \sum_{\mathbf{a}_j \in A_m^{(p)}} || \mathbf{a}_j - \mathbf{c}_m^{(p)} ||^2$$

where $A_m^{(p)}$ are the "temporary" clusters and $\mathbf{c}^{(p)}_m$ is the "temporary" centre of cluster $A_m^{(p)}$ at round $p$ of the loop.

However, the algorithm may stop at a local minimum and not the global minimum.

  • We may choose to run this algorithm multiple times and return the best result
  • Another algorithm called the Farthest Traversal k-clustering picks a random point $a_q$ from $A$ as the first center, and then pick the furthest point in $A$ from $a_q$ as the second point, and continue picking the next center as the one with the largest minimum distance from the already picked centers
    • If $A$ has a clustering radius of $r$, then this algorithm produces a radius at most $2r$.
  • We can randomise the selection by picking a point with probability proportional to the shortest distance to one of already picked points

Ward's Algorithm

Wards algorithm is a greedy k-means algorithm:

  1. Start with every point $\mathbf{a}_i$ in its own cluster.

  2. While the number of clusters is larger than $k$ repeat:

    Find two clusters $C$ and $C'$ such that $$\text{cost}(C \cup C') - \text{cost}(C) - \text{cost}(C')$$ is as small as possible and replace them with a single merged cluster $C \cup C'$ with its centroid as its centre

Center-based Clustering Algorithms

There are several ways to find non centre based clusters.

Similarity Graphs

Rather than representing a set of data points $A$ with their locations, we represent them as an undirected weighted graph $G = (V, E)$.

  • The weight $w_{ij}$ of an edge $e = (v_i, v_j)$ is equal to a similarity measure of the data points.
    • If $w_{ij} = 0$ then the vertices $v_i$ and $v_j$ are completely dissimilar points, and so we do not include this edge
    • Else the weight can depend on a decreasing function of the Euclidean distance, i.e. $$e^{-\frac{|| \mathbf{a}_i - \mathbf{a}_j ||^2}{2}}$$
  • Since the graph is weighted, the degree $d_i$ of a vertex $v_i$ is defined as $$d_i = \sum_{j=1}^n w_{ij}$$ This degree matrix $D$ is a diagonal matrix with degree $d_i$ of vertex $v_i$ on the $i^{th}$ entry of the diagonal of $D$ and zeroes everywhere else

From here there are several ways to cluster the points

  1. The $\epsilon$-neighbourhood graph
    • We connect all pairs of vertices $v_i$, $v_j$ such that the distances between the data points $< \epsilon$.
    • This distance is usually the Euclidean distance
  2. The $k$-nearest neighbour graphs
    • There are two flavours of this
      1. Unidirectional k-nearest neighbour graph Connect $v_i$ with $v_j$ if either $v_j$ is among $k$ nearest neightours of $v_i$ or vice versa
      2. Mutual k-nearest neighbour graph Connect $v_i$ with $v_j$ if both $v_j$ is among $k$ nearest neighbours of $v_i$ and vice versa
    • In both cases, the edge is then weighted with the degree of similarity of the vertices $v_i$ and $v_j$
  3. The fully connected graphs
    • Connect all pairs of vertices $v_i$ and $v_j$ where the corresponding data points have a strictly positive similarity or similarity higher than some threshold $\epsilon$.
    • Often we take weights with the formula $$w_{ij} = e^{-\frac{|| \mathbf{a}_i - \mathbf{a}_j ||^2 }{2 \theta^2}}$$
    • Here $\theta$ is a parameter which determines "the size" of the neighbourhood, namely how fast the similarity decreases as distance increases
  • However, there is no simple way to choose a similarity graph, the best way is to try and determine one empirically

Spectral Graph Theory

Recall that the $n \times n$ diagonal matrix $D$ has the degrees $d_i$ of vertices $v_i$ on its diagonal, where $d_i = \sum_{j=1}^n w_{ij}$.

The (unnormalised) graph Laplacian matrix $L$ is defined as

$$L = D - W$$

where $W = (w_{ij})^n_{i, j = 1}$.

Clearly $L$ is symmetric and does not depend on $w_{ii}$, $1 \leq i \leq n$. Graph Laplacians are crucial for spectral clustering.

A matrix $M$ of size $n \times n$ is positive semi-definite if for all vectors $f \in \mathbb{R}^n$ we have

$$f^\intercal M f \geq 0$$

A symmetric matrix is positive semi-definite iff all of its eigvenvalues are real and non-negative.

The matrix $L = D - W$ has the following properties:

  1. For every vector $f \in \mathbb{R}^n$, $$ \begin{align*} f^\intercal L f &= f^\intercal D f - f^\intercal W f \\ &= \sum_{i=1}^n d_i f_i^2 - \sum_{ij=1}^n w_{ij} f_i f_j \\ &= \frac{1}{2} \sum_{i, j = 1}^{n} w_{ij} (f_i - f_j)^2 \end{align*} $$
  2. $L$ is a symmetric positive semi-definite matrix As shown from (1), since $w_{ij} \geq 0$, L satifies $f^\intercal L f \geq 0$ for all vectors $f$ and is thus positive semi-definite.
  3. The smallest eigenvalue of $L$ is 0 and its corresponding eigenvector is $\mathbb{1} = (1, 1, ..., 1)$

(Missing some Spectral Graph Theory)

Spectral Clustering Algorithm:

  1. Construct a similarity graph $G$ by one of the way described and let $W$ be its weighted adjacency matrix
  2. Compute the Laplacian $L = D - W$.
  3. Compute the $k$ eigenvectors $\mathbf{e}_1, ..., \mathbf{e}_k$ of $L$ which correspond to $k$ smallest eigenvalues.
  4. Let $E$ be the matrix of size $n \times k$ containing the eigenvectors $\mathbf{e}_1, ..., \mathbf{e}_k$ as columns.
  5. For $i = 1, ..., n$, let $\mathbf{y}_i$ be the vector corresponding to teh $i^{th}$ row of E
  6. Cluster points $\{ \mathbf{y}_1, ..., \mathbf{y}_n \}$ using the k-means algorithm into clusters $C_1, ..., C_k$.

(Missing application of Spectral Clustering as graph partioning)

DFT, DCT, Convolution

DFT

FFT is an $O(n \log n)$ DFT conversion and the IFFT can compute the IDFT in the same runtime. The benefit of finding the DFT or IDFT of something is that it can represent data, in another form which can be more easily manipulated or used.

(Missing a ton of theory)

Convolution

Let $A = \langle A_0, A_1, ..., A_{n-1} \rangle$ and $B = \langle B_0, B_1, ..., B_{m-1} \rangle$ be two sequences of real or complex numbers. We can now form two associated polynomials $P_A(x)$ and $P_B(x)$ with coefficients given by sequences $A$ and $B$. Finding the multiple of these polynomials $P_C(x) = P_A(x) \cdot P_B(x)$ can be done in $O(m \times n)$. From here, we can get the sequence of it's corresponding coefficients. This sequence of length $m + n - 1$ is the linear convolution of sequences $A$ and $B$. However, using the FFT algorithm, we can find the linear convolution of these two sequences in time $O((m + n) \log_2(m + n))$.

An example application is application of Gaussian smoothing to a noisy signal.

SVD

Singular Value Decomposition is a way of representing a very large matrix $A$ as

$$A = \sum_{i=1}^r = \sigma_i u_i v_i^\intercal = UDV^\intercal$$

(Insert some more theory)

To find $\mathbf{v}_1$ and $\mathbf{u}_1$:

  1. Start with a random vector $\mathbf{x}$ and normalise it so that $| \mathbf{x}| = 1$.
  2. Compute $\mathbf{z}_1 = A\mathbf{x}$; compute $\mathbf{z}_2 = A^\intercal z_1$; let $\rho = |\mathbf{z}_2|/|\mathbf{x}|$ and $\mathbf{x} = \mathbf{z}_2$
  3. Repeat (2) until the difference between $\rho$'s obtained in two consecutive iterations is smaller than a small threshold $\epsilon$.
  4. Set $\mathbf{v}_1 = \mathbf{z}_2 / |\mathbf{z}_2|$
  5. Finally, let $\sigma_1 = |A\mathbf{v}_1|$ and $\mathbf{u}_1 = A \mathbf{v}_1 / \sigma_1$

This method is called the Power Method and it is fast if $A$ and thus also $A^\intercal$ are sparse.

A good series on SVD.

Power Transmission in Cellular Networks

The signal to interference ratio $SIR_i$ at receiver $R_i$ is given by

$$SIR_i = \frac{G_{ii}p_i}{\sum_{j: j \neq i}G_{ii}p_j + \eta_i}$$

where $\eta_i$ is the noise received by the receiver $i$ coming from the environment. $SIR_i$ determines the capacity of the channel $C_{ii}$. Each pair of transmitter $T_i$ and a receiver $R_i$ needs at least some channel capacity to carray information. To achieve this, it needs $SIR_i \geq \gamma_i$.

We will see later how $\gamma_i$ is determined in practice. However, we are interested in:

$$\text{minimise } \sum_{j=1}^n p_j$$

$$\text{subject to constraints } \frac{G_{ii} p_i}{\sum_{j: j \neq i}G_{ij} + \eta_i} \geq \gamma_i, \quad 1 \leq i \leq n$$

Matrix Form

Following this, we can simplify our original LP problem into

$$\text{minimise } \sum_{j=1}^n p_j$$

$$\text{subject to constraints } p_i - \gamma_i \sum{j:j \neq i} \frac{G_{ij}}{G_{ii}} p_j \geq \frac{\gamma_i \eta_i}{G_{ii}} \\ 1 \leq i, j \leq n, p_j > 0$$

We can then convert this into a matrix format:

$$\text{minimise } \mathbf{1}^\intercal \mathbf{p}$$

$$\text{subject to constraints } (I - DF)\mathbf{p} \geq \mathbf{v}, \mathbf{p} \leq 0$$

This will have a feasible solution if we are not demanding excessively large $\gamma_i's$.

This is the case when the spectral radius (largest absolute value of the eigenvalues) of matrix $DF$ $\rho(DF)$, satisfies $\rho(DF) < 1$.

Proof:

  • If $A$ is square matrix and if $\rho(A) < 1$ then $A^m \rightarrow 0$.
  • This is easy to see if $A$ has $n$ linearly independent eigenvectors, because in this case $A$ can be represented as $$A = Q \Lambda Q^{-1}$$
  • Here the $i^{th}$ column of $Q$ is the eigenvector corresponding to eigenvalue $\lambda_i$ $\Lambda$ is a diagonal matrix with $\lambda_i$ in the $i^{th}$ column and row and zeroes elsewhere. In such a case $$A^k = Q \Lambda Q^{-1} Q \Lambda Q^{-1} ... Q \Lambda Q^{-1} Q \Lambda Q^{-1} = Q \Lambda^k Q^{-1}$$
  • Since $\Lambda^k$ has $\lambda_i^k$ on the diagonal, if $\rho(A) < 1$ then clearly $A^k \rightarrow 0$

Moreover it is easy to see that

$$(I - A) \sum_{i=0}^k A^i = \sum_{i=0}^k A^i - \sum_{i=1}^{k+1}A^i = I - A^{k+1}$$

Thus,

$$\lim_{k \rightarrow \infty} \left( (I - A) \sum_{i=0}^k A^k \right) \lim_{k \rightarrow \infty}(I - A^{k + 1}),$$

which implies

$$(I - A) \sum_{i=0}^\infty A^i = I.$$

This shows that matrix $I - A$ is invertible and that

$$(I - A)^{-1} = \sum_{i=0}^\infty A^i$$

Applying this to matrix $A = DF$, let $\mathbf{p}^*$ be give by

$$\mathbf{p^*} = (I - DF)^{-1} \mathbf{v} = \sum_{i=0}^\infty (DF)^i \mathbf{v}$$

Then $(I - DF) \mathbf{p^*} = \mathbf{v}$ and the constraint becomes

$$(I - DF) \mathbf{p} \geq (I - DF) \mathbf{p^*}$$

i.e.

$$(I - DF)(\mathbf{p} - \mathbf{p^*}) \geq 0$$