Advanced Algorithms
26 November 2021
The expected value of a random variable is its mean. Assuming that the expected value converges, the expected value can be calculated as shown below.
Discrete random variable | Continuous random variable |
---|---|
$E(X) = \sum_{i = 1}^{\infty} v_i \cdot p_i$ | $E(X) = \int_{-\infty}^{\infty}x \cdot f(x) dx$ |
The variance of a random variable is defined as $V(X) = E(X - E(X))^2$ assuming that both expectations involved are finite; the standard deviation of a random variable $X$ is given by $\sigma = \sqrt{V(X)}$.
$$1 + x \leq e^x \text{ for all } x \in \mathbb{R}$$ $$\left(1 - \frac{1}{n}\right)^n \leq \frac{1}{e} \leq \left(1 - \frac{1}{n} \right)^{n - 1} \text{ for all } n \in \mathbb{N}$$
The Markov Inequality | Let $X > 0$ be a non-negative random variable. Then for all $t > 0$, $$P(X \geq t) \leq \frac{E(X)}{t}$$ |
Chebyshev Inequality | Let $X > 0$ be a random variable with the expected value $\mu = E(X)$ and standard deviation $\sigma = \sqrt{E((X - \mu)^2)}$. Then for all $\lambda > 0$, $$P(| X - \mu | \geq \lambda \sigma) \leq \frac{1}{\lambda^2}$$ |
Chernoff Bound | Let $X = \sum_{k = 1}^{n} X_k$, where $X_k, 1 \leq k \leq n$, are independent Bernoulli trials with the probability of success $P(X_k = 1) = p_k$, where $0 < p_k < 1$. Thus, $\mu = E(X) = \sum_{k = 1}^{n} p_k$. Let $\sigma > 0$ be any positive real. Then, $$P(X > (1 + \sigma) \mu) < \left( \frac{e^{\sigma}}{(1 + \sigma)^{1 + \sigma}}\right)^{\mu}$$ |
Can we find the $i^{th}$ largest or smallest item in linear time?
In the quicksort algorithm, we note that after a partition, the pivot element ends up being in its correct index. Therefore, we can perform a combination of partitioning and binary search to find the item at the $i^{th}$ index.
For example, we partition first using a random pivot, and as a result we will know the index of our pivot. If the pivot is index $i$, we've found the element. Else if the pivot is at an index greater than $i$, then we partition the smaller side, else we partition the bigger side and continue until we've found the $i^{th}$ element.
# Implementation of rand-select
def partition(A: list[int], lo: int, hi: int) -> int:
# Partition the array A[lo..hi] and return partition index
i = lo - 1
pivot = A[hi]
for j in range(lo, hi):
if A[j] <= pivot:
i += 1
A[i], A[j] = A[j], A[i]
A[i + 1], A[hi] = A[hi], A[i + 1]
return i + 1
def rand_select(A: list[int], i: int, lo: int, hi: int) -> int:
# Return the ith largest index in A
if lo <= hi:
pi = partition(A, lo, hi)
if i < pi:
return rand_select(A, i, lo, pi - 1)
return rand_select(A, i, pi + 1, hi)
return A[hi]
The runtime can be represented through the recurrence relation $T(n) = T\left( \frac{n}{2} \right) + O(n)$ which results in a $O(n)$ runtime. However, it can be analysed more in-depth statistically.
The worst case runtime is $\Theta(n^2)$ which happens when the smallest or largest element is picked as the pivot - resulting in an unbalanced partition.
However (assuming all elements are distinct), let us call a partition a balanced partition if the ratio between the ratio smaller side and larger side is less than 9 (9 is arbitrary, any small number > 2 would do). Then the probability of having a balanced partition is 1 - (chance of choosing smallest 1/10 or biggest 1/10) = 1 - 2/10 = 8/10.
Then let us find the expected number of partitions between two consecutive balanced partitions. In general, the probability that you need $k$ partitions to end up with another balanced partition is $\left( \frac{2}{10} \right)^{k - 1} \cdot \frac{8}{10}$.
Thus, the expected number of partitions between two balanced partitions is
$$ \begin{align*} E &= 1 \cdot \frac{8}{10} + 2 \cdot \left( \frac{2}{10} \right) \cdot \frac{8}{10} + 3 \cdot \left( \frac{2}{10} \right)^2 \cdot \frac{8}{10} + ... \\ &= \frac{8}{10} \cdot \sum_{k = 0}^{\infty}(k + 1) \left( \frac{2}{10} \right)^k \\ &= \frac{8}{10}S. \end{align*} $$
To find $S$, note that
$$\sum_{k = 0}^{\infty} q^k = \frac{1}{1 - q}.$$
By differentiating both sides with respect to $q$ we get
$$\sum_{q = 1}^{\infty} k q^{k - 1} = \frac{1}{(1 - q)^2}.$$
Substituting $q = \frac{2}{10}$ we get that $S = \left(\frac{10}{8} \right)^2$.
Therefore,
$$E = \frac{8}{10} \cdot \left( \frac{8}{10} \right)^2 = \frac{5}{4}.$$
And so there are only 5/4 partitions between two balanced partitions.
Therefore, the total average (expected) run time satisfies
$$ \begin{align*} T(n) &< 5/4n + 5/4 \left( \frac{9}{10} \right)n + 5/4 \left( \frac{9}{10} \right)^2 n + 5/4 \left( \frac{9}{10} \right)^3 n + ... \\ &= \frac{5/4 n}{1 - \frac{9}{10}} \\ &= 12.5n. \end{align*} $$
Overall, the expected runtime of this algorithm is linear.
Assume that $n$ processes want to access a database, and that the time $t$ is discrete. If two processes simultaneously request access, there is a conflict and all processes are locked out of access.
Assume that processes cannot communicate with each other on when to access. One possible for a process to determine if it should access the database at time $t$ is to "request access" with probability $p$ and "do not request access" with probability $(1 - p)$.
What should $p$ be to maximise the probability of a successful access to the database for a process at any instant $t$?
The probability of success of process $i$ at any instant $t$ is
$$P(S(i, t)) = p(1 - p)^{n - 1},$$
because a process $i$ requests access with probability $p$ and the probability that no other process has requested access is $(1 - p)^{n - 1}$.
The graph clearly shows:
The extreme points of this is found by solving
$$\frac{d}{dp}P(S(i, t)) = (1 - p)^{n - 1} - p(n - 1)(1 - p)^{n - 2} = 0$$
which gives $p = 1/n$, which gives a probability of success of
$$P(S(i, t)) = p(1 - p)^{n - 1} = \frac{1}{n} \left( 1 - \frac{1}{n} \right)^{n - 1}.$$
However,
$$\lim_{n \rightarrow \infty} \left(1 - \frac{1}{n} \right)^n = e$$
and hence $P(S(i, t)) = \Theta \left( \frac{1}{n} \right)$. Thus, the probability of failure after $t$ instances is
$$ \begin{align*} P(\text{failure after $t$ instants}) &= \left( 1 - \frac{1}{n} \left( 1 - \frac{1}{n} \right)^{n - 1} \right)^t \\ &\approx \left( 1 - \frac{1}{e n}\right)^t \end{align*} $$
We observe that
P(failure after $t = en$ instances) | P(failure after $t = en \cdot 2 \ln n$ instances) |
---|---|
$$\left( 1 - \frac{1}{en}\right)^{en} \approx \frac{1}{e}$$ | $$\left(1 - \frac{1}{en} \right)^{en \cdot 2 \ln n} \approx \frac{1}{n^2}$$ |
Thus, a small increase in the number of time instants, from $en$ to $en \cdot 2 \ln n$ caused a dramatic reduction in the probability of failure.
After $en \cdot 2 \ln n$ instances, the probability of failure of each process $\leq 1/n^2$ and since there are $n$ processes, then the probability that at least one process failed is $\leq 1/n$. Thus after $en \cdot 2 \ln n$ instances all processes succeeded to access the database with probability at least $1 - 1/n$.
If the processes could communicate, then it would take $n$ instances for all of them to access the database.
If they can't communicate, then the above method will allow them to access the database with probability $1 - 1/n$ time, which is larger only by a relatively small factor of $2e \ln n$.
A skip list is a probabilistic data structure that functions similarly to a binary search tree.
# Example structure of a skip list
[H]--------------------------[35]----------------[T]
[H]----------------[21]------[35]----------------[T]
[H]------[12]------[21]-[24]-[35]------[55]------[T]
[H]-[02]-[12]-[17]-[21]-[24]-[35]-[43]-[55]-[62]-[T]
Search of $k$
Insertion of $k$
Deletion
Deleting an element is just like in a standard doubly linked list
The probability of getting $i$ consecutive tails when flipping a coin $i$ times is $1/2^i$. Thus, an $n$ element Skip List has on average $n/2^i$ elements with links on level $i$. Since an element has links only on levels $0 - i$ with probability $1/2^{i + 1}$, the total expected number of link levels per element is
$$\sum_{i = 0}^{\infty} \frac{i + 1}{2^{i + 1}} = \sum_{i = 1}^{\infty} \frac{i}{2^i} = 2.$$
Let $\#(i)$ be the number of elements on level $i$.
Then, $E[\#(i)] = \frac{n}{2^i}$, and by the Markov inequality, the probability of having at least one element at level $i$ satisfies
$$P(\#(i) \geq 1) \leq \frac{E[\#(i)]}{1} = \frac{n}{2^i}.$$
Thus, the probability to have an element on level $2 \log n$ is smaller than $n/2^{2 \log n} = 1/n.$
More generally, the probability to have an element (be nonempty) on level $k \log n$ $< n /2^{k \log n} = 1/n^{k - 1}$. The expected value $E(k)$ such that $k$ is the lowest integer so that the number of levels is $\leq k \log n$ is
$$E(k) \leq \sum_{k = 1}^{\infty} \frac{k}{n^{k - 1}} = \left( \frac{n}{n - 1} \right)^2.$$
Thus, the expected number of levels is barely larger than $\log n$.
Thus, the expected number of elements between any two consecutive elements with a link on level $i + 1$ which have links only up to level $i$ is smaller than
$$\frac{0}{2} + \frac{1}{2^2} + \frac{2}{2^3} + \frac{3}{2^4} + ... = 1.$$
So once on level $i$, on average we will have to inspect only two elements on that level before going to a lower level.
On average, levels $< 2 \log n$, and 2 elements are visited per level. Therefore, on average, the search will be in time $O(4 \log n) = O(\log n)$.
For an element on levels $0 - i$, we store $O(i + 1)$ pointers, and expected number of elements with highest link on level $i$ is $O(n/2^{i + 1})$. Thus, total expected space for is
$$O \left( \sum_{i = 0}^{\infty}2(i + 1) \frac{n}{2^{i + 1}}\right) = O \left( 2n \sum_{i = 0}^{\infty} \frac{i + 1}{2^{i + 1}} \right) = O(4n) = O(n).$$
Given a graph $G = (V, E)$, Karger's Min Cut algorithm finds a cut $T$ that partitions vertices $V$ into two non empty disjoint subsets $X$ and $Y$, with the lowest capacity of edges which have one edge in $X$ and the other in $Y$.
A deterministic way to solve this is through max flow from one vertex to all other vertices, however this runs in $O(|V|^4)$.
The algorithm makes use of contracting edges in a graph. To contract an edge $e(u, v)$, fuse $u$ and $v$ into a single vertex $[uv]$ and replace edges $e(u, x)$ and $e(v, x)$ with a single edge $e([uv], x)$ of weight $w([uv], x) = w(u, x) + w(v, x)$. The obtained graph after this is called $G_{uv}$.
After collapsing $u$ and $v$ into a single vertex
Therefore,
Let $T_1 = (X_1, Y_1)$ be a min cut in $G_{uv}$
Split $[uv]$ back into $u$ and $v$, but keep them on the same side of the cut $T_1$. This produces a cut $T_2$ in $G$ of the same capacity as the min cut $T_1$ in $G_{uv}$.
Thus, the capacity of the min cut in $G$ can only be smaller than the min cut $T_1$ in $G_{uv}$
Let $M(G)$ represent the min cut capacity of $G$.
The probability that the capacity of a min cut in $G_{uv}$ is larger than the capacity of a min cut in $G$ is smaller than $2/n$ where $n = |V|$.
$$P(M(G_{uv}) > M(G)) < \frac{2}{n}.$$
This is because this probability is less than or equal to the probability that the edge $e(u, v)$ belonged in the set of edges along the min cut $M$. The probability of the $e(u, v)$ in $M$ is less than or equal to the final min cut capacity divided by the total capacity of the graph (which is equal to $\frac{n}{2}$), resulting in the final probability of $\frac{2}{n}$.
If we run edge contraction until there is 1 edge, then the probability $\pi$ that the capacity of that edge is equal to the capacity of the min cut in G is $\Omega \left( \frac{1}{n^2} \right)$.
From the first theorem
$$ \begin{align*} \pi &= P(M(G) = M(G_{n-2})) \\ &= \prod_{i = 1}^{n - 2} P(M(G_i) = M(G_{i - 1})) \\ &\leq \left(1 - \frac{2}{n} \right) \left(1-\frac{2}{n-1}\right) \left(1-\frac{2}{n-2}\right) ... \left(1-\frac{2}{3} \right) \\ &= \frac{n - 2}{n} \times \frac{n - 3}{n - 1} \times \frac{n - 4}{n - 2} \times ... \times \frac{1}{3} \\ &= \frac{2}{n(n-1)}. \end{align*} $$
Since the contraction runs in $O(n^2)$, and has a $\Omega \left( \frac{1}{n^2} \right)$ chance of being correct, it needs to be run $\Theta(n^2)$ times, resulting in a final runtime of $O(n^4)$ to find the min cut.
The algorithm can be improved to run in $O(n^2 \log^3 n)$ with the high probability of being correct through a divide and conquer approach.
If after running the contraction algorithm until there is $\lfloor \frac{n}{2} \rfloor$ vertices runs in $O(n^2)$ and the chance of being correct is $\approx 1/4$. Hence, by running the algorithm until there are $\lfloor \frac{n}{2} \rfloor$ vertices 4 times, and then recursing on those smaller graphs, we have a runtime of
$$T(n) = 4T\left(\frac{n}{2}\right) + O\left(n^2\right) = O(n^2 \log n).$$
# Python flavoured pseudo code of the refined algorithm
def karger_refined(G: Graph) -> int:
V, E = G
# Base case, return the last and only edge weight
if len(V) == 2:
return E[V[0], V[1]]
# Run contraction 4 times and recurse on those 4 new graphs
min_cuts = [karger_refined(contract(G)) for _ in range(4)]
return min(min_cuts)
Let $p(n) = P(\text{success for a graph of size $n$})$, then
$$ \begin{align*} p(n) &= 1 - P(\text{failure on one branch})^4 \\ &= 1 - (1 - P(\text{success on one branch}))^4 \\ &= 1 - \left( 1 - \frac{1}{4}p \left(\frac{n}{2}\right)\right)^4 \\ &> p\left(\frac{n}{2}\right) - \frac{3}{8}p\left(\frac{n}{2}\right)^2 \\ &> \frac{1}{log(n)}. \end{align*} $$
If we run our algorithm $(\log n)^2$ times, the probability we are correct $\pi$ is
$$\pi = 1 - \left(1 - \frac{1}{\log n} \right)^{(\log n)^2}$$
However, since for all reasonably large $k$, $(1 - 1/k)^k \approx e^{-1}$. As a result,
$$\pi \approx 1 - e^{-\log n} = 1 - 1/n.$$
Hence, if we run karger_refined
$(\log n)^2$ times, we have a probability of being correct $1 - 1/n$, with a runtime of
$$O\left(n^2 \log^3 n \right) \lt\lt O(\left(n^4\right).$$
If the hash function can be analysed, and a sequence of worst keys is chosen, then a lookup in a hash table can take $O(n)$, though ideally we want to have $O(1)$.
Let $H$ be a finite collection of hash functions that map a given universe $U$ of keys into the smaller range {0 .. m - 1}. $H$ is said to be universal if for each pair of distinct keys $x, y \in U$, the number of hash functions $h \in H$ for which $h(x) = h(y)$ is $\frac{|H|}{m}$.
Assume that a family of hash functions $H$ is universal, and we are hashing $n$ keys into a hash table of size $m$. Let $C_x$ be the total number of collisions involving key $x$, then the expected value $E[C_x]$ satisifies
$$E[C_x] = \frac{n - 1}{m}.$$
Assume $x, y$ are two distinct keys. Then for there to be a hash collision,
$$ \begin{align*} h_{\vec{a}}(x) = h_{\vec{a}}(y) &\Leftrightarrow \sum_{i=0}^{r}x_ia_i = \sum_{i=0}^{r}y_ia_i \mod m \\ &\Leftrightarrow \sum_{i=0}^{r}(x_i - y_ia_i) = 0 \mod m \end{align*} $$
Since $x \neq y$ there exits $0 \leq k \leq r$ such that $x_k \neq y_k$. Assume that $x_0 \neq y_0$, then
$$(x_0 - y_0)a_0 = - \sum_{i=1}^{r}(x_i - y_i)a_i \mod m$$
Since $m$ is a prime, every non-zero element $z \in \{0, 1, ..., m - 1\}$ has a multiplicative inverse $z^{-1}$, such that $z \cdot z^{-1} = 1 \mod m$ and so
$$a_0 = \left( - \sum_{i=0}^{r} (x_i - y_i)a_i \right) (x_0 - y_0)^{-1} \mod m$$
this implies that
there exists exactly one $a_0$ such that $h_{\vec{a}}(x) = h_{\vec{a}}(y)$.
Since there are
as a result, the probability to choose a sequence $\vec{a}$ such that $h_{\vec{a}}(x) = h_{\vec{a}}(y)$ is equal to
$$\frac{m^r}{m^{r+1}} = \frac{1}{m}.$$
Thus, the family $H$ is a universal collection of hash functions.
Given $n$ keys we will be constructing hash tables for size $m < 2n^2$ using universal hashing. The probability that such a table is collision free will be $> 1/2$
Given $n$ keys, there will be $n \choose 2$ pairs of keys. By universality of the family of hash functions used, for each pair of keys the probability of collision is $1/m$. Since $m \geq n^2$ we have $\frac{1}{m} \leq \frac{1}{n^2}$. Thus, the expected total number of collisions in the table is at most
$${n \choose 2} \frac{1}{m} \leq \frac{n(n - 1)}{2} \frac{1}{n^2} < \frac{1}{2}.$$
Let $X$ be the random variable equal to no. collisions. Then by the Markov Inequality with $t=1$ we get
$$P\{X \geq 1\} \leq \frac{E[X]}{1} < \frac{1}{2}.$$
Thus, the chance of a collision after $k$ hashes $< (1/2)^k$, which rapidly tends to 0. Consequently, after a few random trial-and-error attempts we will obtain a collision free hash table of size $< 2n^2$.
If $p$ is the probability of a collision, then the expected number of trials $E[N]$ before we hit a collision free hash table of size $2n^2$ is
$$
\begin{align*}
E[N]
&= 1(1 - p) + 2p(1 - p) + 3p^2(1 - p) + ... \\
&= (1 - p)(1 + 2p + 3p^2 + ...) \\
&= \frac{1}{1 - p} \\
&< 2.
\end{align*}
$$
Choose $M$ to be the smallest prime number $> n$
Thus, $n \leq m \leq 2n$. Produce a hash table of size $M$ again by choosing randomly from a universal family of hash funtions. Assume that a slot $i$ of this table has $n_i$ many elements. Hash these $n_i$ elements into a secondary hash table of size $m_i < 2n_i^2$. We have to guarantee that the sum total of sizes of all secondary hash tables, i.e., $\sum_{i=1}^{M}m_i$ is linear in $n$. Note ${n_i \choose 2}$ is the number of collisions at $n_i$ and
$${n_i \choose 2} = \frac{n_i(n_i - 1)}{2} = \frac{n_i^2}{2} - \frac{n_i}{2}.$$
Therefore the total number of collisions in the hash table is $\sum_{i=1}^{M} {n_i \choose 2}$, and since the expected value of collision with universal hashing is $1/M$,
$$ \begin{align*} E\left[ \sum_{i=1}^{M} {n_i \choose 2} \right] &= {n \choose 2}\frac{1}{M} \\ &= \frac{n(n - 1)}{2M} \\ E\left[ \sum_{i=1}^{M} n_i^2 \right] &= 2E\left[ \sum_{i=1}^{M} {n_i \choose 2} \right] + n \end{align*} $$
Thus,
$$E\left[ \sum_{i=1}^{M} n^2_i \right] \leq \frac{n(n-1)}{n} + n = 2n - 1 < 2n.$$
Applying the Markov Inequality to find the probability we have more than $4n$ items in our hash table, we obtain
$$P \left\{ \sum_{i=1}^{M} n_i^2 > 4n \right\} \leq \frac{E\left[ \sum_{i=1}^{M} n_i^2 \right]}{4n} < \frac{2n}{4n} = \frac{1}{2}$$
Thus, after a few attempts we will produce a hash table of size $M < 2n$ for which $\sum_{i=1}^{M} < 4n$, and if we choose primes $m_i < 2n_i^2$ then $\sum_{i=1}^{M} m_i < 8n$. In this way the size of the primary hash table plus the sizes of all secondary hash tables satisfies
$M + \sum_{i=1}^{M}m_i < 2n + 8n = 10n.$
Let us consider a Gaussian random variable $X$ with a zero mean ($E[X] = \mu = 0$) and variance $V[X] = v = 1/2\pi$, then its density is given by
$$f_X(x) = \frac{1}{\sqrt{2\pi v}}e^{-\frac{x^2}{2v}} = e^{-\pi x^2}$$
Assume that we use such $X$ to generate independently the coordinates $\langle x_1, ..., x_d \rangle$ of a random vector $\vec{x}$ from $\mathbb{R}^d$. Then $E[X^2] = E[(X - E[X])^2] = V[X]$, and $E\left[ \frac{x_1^2 + ... + x_d^2}{d} \right] = dV[X]/d = V[X]$.
As a result, given the length $|x| = \sqrt{x_1^2 + ... + x_d^2}$, the expected value of the square of a the length of $\vec{x}$ is $E[|x|^2] = d/2\pi$. So, on average, $|x| \approx \frac{\sqrt{d}}{\sqrt{2\pi}} = \Theta(\sqrt{d})$, and if $d$ is large, then this is true with a high probability.
If we choose 2 points independently, then
$$E[\langle x, y \rangle] = E[x_1 y_1 + ... + x_d y_d] = d E[XY] = d E[X] E[Y] = 0.$$
Hence, the expected value of the scalar product of any 2 vector with randomly chosen coordinates is zero.
Hence, vectors with randomly chosen coordinates:
Most of the volume of a high dimensional ball is near:
For any $\epsilon$, $0 < \epsilon < 1$, and any integer $n$, assume that $k$ satisfies $k > \frac{3}{\gamma \epsilon^2} \ln n$. Then for any set of $n$ points given by the vectors $v_1, ..., v_n$ in $\mathbb{R}^d$, with the probability of at least $1 - 3/(2n)$, the random projection $f' \mathbb{R}^d \rightarrow \mathbb{R}^k$ has the property that for ALL pairs of points $v_i, v_j$
$$|| f'(v_i - v_j)| - |v_i - v_j|| \leq \epsilon | v_i - v_j|.$$
Thus, $f'(v)$ "almost" preserves distances between points given by vectors $v_i$ despite reduction of dimensionality from $d >> k$ to only $k$.
The Page Rank algorithm aims to solve the problem of how to order webpages.
Consider all the webpages $P_i$ on the entire internet as vertices of a directed graph, where a directed edge $P_i \rightarrow P_j$ exists if $P_i$ has a link to $P_j$.
Notation:
A web page $P$ should have a high rank only if it is pointed at by many pages $P_i$ which:
So we want the following system of equations to be satisfied:
$$\left\{\rho(P) = \sum_{P_i \rightarrow P} \frac{\rho(P_i)}{\#(P_i)} \right\}_{P \in WWW}$$
We have a large sparse matrix $G_1$ of size $M \times M$, where $M = |WWW|$
$$ G_1 = \begin{pmatrix} g(1, 1) & \ldots & g(1, j) & \ldots & g(1, M) \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ g(i, 1) & \ldots & g(i, j) & \ldots & g(i, M) \\ \vdots & \ddots & \vdots & \ddots & \vdots \\ g(M, 1) & \ldots & g(M, j) & \ldots & g(M, M) \end{pmatrix} \ $$
$$
g(i, j) =
\begin{cases}
\frac{1}{\#(P_i)} & \text{ if } P_i \rightarrow P_j \\
0 & \text{otherwise}
\end{cases}
$$
However, because $G_1$ is mostly a sparse matrix, it resembles
$$ G_1 = \begin{pmatrix} & & \vdots & & & & \vdots & & \\ & & \vdots & & & & \vdots & & \\ \ldots & 0 & \frac{1}{k} & 0 & \ldots & 0 & \frac{1}{k} & 0 & \ldots \\ & & \vdots & & & & \vdots & & \\ & & \vdots & & & & \vdots & & \\ \end{pmatrix} \ $$
where $k$ is equal to $\#(P_i)$, the number of pages which th epage $P_i$ has links to. Hence, the system of linear equations can be represented as
$$\mathbf{r}^\intercal = \mathbf{r}^\intercal G_1$$
where
$$\mathbf{r}^\intercal = (\rho(P_1), \rho(P_2), ..., \rho(P_M)).$$
Note that $G_1$ is a markov matrix, and $\mathbf{r}^\intercal$ is a left-hand eigenvector of $G_1$ corresponding to the eigenvalue 1. Thus, finding ranks of web pages is reduced to finding eigenvectors of $G_1$, which corresponds to the eigenvalue 1.
If we model a random walk on the graph of this matrix, there are 2 issues:
What should we do when we get to a webpage with no outgoing links?
Solution: Jump to a randomly chosen webpage when a node with no outgoing links.
I.e. the first row in $G_1$ is a dangling page, with no outgoing pages. $G_2$ fixes this by making it point to all pages with equal probability. Such a matrix is row stochastic, meaning that each row sums up to 1.
$$ G_1 = \begin{pmatrix} & & \vdots & & & & \vdots & & \\ \ldots & 0 & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots \\ & & \vdots & & & & \vdots & & \\ \ldots & 0 & \frac{1}{\#(P_i)} & 0 & \ldots & 0 & \frac{1}{\#(P_i)} & 0 & \ldots \\ & & \vdots & & & & \vdots & & \\ \end{pmatrix} \\ \\ G_2 = \begin{pmatrix} & & \vdots & & & & \vdots & & \\ \ldots & \frac{1}{M} & \frac{1}{M} & \frac{1}{M} & \ldots & \frac{1}{M} & \frac{1}{M} & \frac{1}{M} & \ldots \\ & & \vdots & & & & \vdots & & \\ \ldots & 0 & \frac{1}{\#(P_i)} & 0 & \ldots & 0 & \frac{1}{\#(P_i)} & 0 & \ldots \\ & & \vdots & & & & \vdots & & \\ \end{pmatrix} \ $$
If $T$ is the walk length, then as $T \rightarrow \infty$, the values $N(P)/T$ should converge. This becomes an issue if there are disconnected graphs or if the graph is bipartite (and the probability of being in one side depends if the path length is odd or even).
Solution: Randomly jump to a new page after some time.
This transformation does not change the rows corresponding to dangling webpages: $\alpha / M + (1 - \alpha) / M = 1/M$
$$ G = \begin{pmatrix} & & \vdots & & & & \vdots & & \\ \ldots & \frac{1}{M} & \frac{1}{M} & \frac{1}{M} & \ldots & \frac{1}{M} & \frac{1}{M} & \frac{1}{M} & \ldots \\ & & \vdots & & & & \vdots & & \\ \ldots & \frac{1 - \alpha}{M} & \frac{\alpha}{\#(P_i)} + \frac{1 - \alpha}{M} & \frac{1 - \alpha}{M} & \ldots & \frac{1 - \alpha}{M} & \frac{\alpha}{\#(P_i)} + \frac{1 - \alpha}{M} & \frac{1 - \alpha}{M} & \ldots \\ & & \vdots & & & & \vdots & & \\ \end{pmatrix} \ $$
These two solutions allow the ranks to converge, because the model works like a well behaved Markov Chain.
To implement the first solution:
Let $d$ be 1 at positions $i$ which correspond to dangling webpages and 0 elsewhere
$$\mathbf{d}^\intercal = (0 \ ... \ 0 \ 1 \ 0 \ ... \ 0 \ 1 \ 0 \ ... \ 0 \ 1 \ 0 \ ... \ 0)$$
Let $\mathbf{e}$ be 1 everywhere: $\mathbf{e}^\intercal = (1 \ 1 \ ... \ 1)$.
Then $$G_2 = G_1 + \frac{1}{M} \mathbf{d} \mathbf{e}^\intercal$$
To implement the second solution as well, we get $G$ as:
$$G = \alpha G_2 + \frac{1 - \alpha}{M} \mathbf{e} \mathbf{e}^\intercal = \alpha \left( G_1 + \frac{1}{M} \mathbf{d} \mathbf{e}^\intercal \right) + \frac{1 - \alpha}{M} \mathbf{e} \mathbf{e}^\intercal$$
$G$ is no longer sparse, but the vector matrix product $\mathbf{x}^\intercal G$ for vectors $\mathbf{x}$ whose coordinates sum up to 1 can be decomposed as:
$$\mathbf{x}^\intercal = \alpha \mathbf{x}^\intercal G_1 + \frac{1}{M} \left(1 - \alpha (1 - \mathbf{x}^\intercal \mathbf{d}) \right) \mathbf{e}^\intercal$$
A (finite) Markov Chain is given by a finite set of states $S = \{P_i\}_{i \leq M}$ and by a row stochastic matrix $G$. The process can start at $t = 0$ in any of its states and continue its evolution by going at every discrete moment of time from its present state to another randomly chosen state.
Consider a simple 3-state Markov chain representing weather conditions:
State Transition Diagram:
Transition Matrix G:
$$ G = \begin{pmatrix} 0.7 & 0.2 & 0.1 \\ 0.3 & 0.4 & 0.3 \\ 0.2 & 0.3 & 0.5 \end{pmatrix} $$
Each row sums to 1 (row stochastic property), and $G_{ij}$ represents the probability of transitioning from state $i$ to state $j$.
The model of the random walk on the graph is an example of a Markov Chain:
The general theorem on Markov chains implies that:
How close to 1 should $\alpha$ be?
The expected length $lh$ of surfing between two teleportations can be found as:
$$ \begin{align*} E(lh) &= 0(1 - \alpha) + \alpha(1 - \alpha) + ... + k \alpha^k (1 - \alpha) + ... \\ &= \alpha(1 - \alpha)(1 + 2 \alpha + ... + k \alpha^{k - 1} + ...) \\ &= \frac{\alpha}{1 - \alpha}. \end{align*} $$
Google uses $\alpha = 0.85$ (obtained empirically), giving an expected surfing length $\frac{0.85}{1- 0.85}\approx 5.7$, very close to 6 (coming from the idea of six degrees of separation), and it takes approx $50 - 100$ iterations for convergence.
Larger values produce more accurate representation of "importance" of a webpage, but the convergence slows down fast.
Error of approximation of $\mathbf{q}$ by $\mathbf{q_0}G^k$ decreases approximately as $\alpha^k$. More importantly, increasing $\alpha$ increases the sensitivity of the resulting PageRank. This is not effective as internet content and structure changes on a daily basis, but PageRank should change slowly.
A Hidden Markov Model is a Markov Model that has two states:
The diagram shows:
An example of it's application, is given a sequence of manifestations, how can we determine what sequence of states caused it?
We do it in a way that maximises the likelihood that we are correct which is what the Viterbi algorithm does. It is a dynamic programming algorithm that produces the most likely estimate.
Likelihood is, in a sense, an inverse of probability.
Assume we are given a sequence of observable outcomes $(y_1, y_2, y_r)$. The goal is to find the sequence $(x_1, x_2, ..., x_r)$ of states of the Markov chain for which the likelihood that such a sequence is the cause of the observed sequence of outcomes is the highest.
We solve all subproblems $S(i, k)$ for every $i \leq i \leq T$ and every $1 \leq k \leq K$:
Subproblem $S(i, k)$: Find the sequence of states $(x_1, ..., x_i)$ such that $x_i = s_k$ and such that the probability of observing the outcome $(y_1, ..., y_i)$ is maximal.
A good intro to Hidden Markov Models and the Viterbi Algorithm.
The main purpose of recommender systems is to recommend content / products to users that they may enjoy.
Two major kinds of recommender systems:
We can construct a sparsely populated table of ratings $R$, the rows correspond to movies, the columns to users. The entry $r(i, j)$ of the table, if non empty, represents teh rating user $U_i$ gave to some movie $M_j$ Each entry may have some rating in range $0 - 5$ (or a similar relatively small rating range, usually with at most 10 or so levels).
We replace these rating integers with more informative numbers.
One of the most frequently used measure of similarity of users is the cosine similarity measure.
If we want to compare 2 users $U_i$ and $U_k$, we find all movies that both users have ranked and delete all other entries. We obtain two column vectors $\vec{u}_1$ and $\vec{v}_k$, and the similarity of the two users is measured by the cosine of the angle between these two vectors.
$$ \begin{align*} \text{sim}(U_i, U_k) &= \cos(u_i, u_k) \\ &= \frac{\langle \vec{u}_i, \vec{u}_k \rangle}{|\vec{u}_i| \cdot |\vec{u}_k|} \end{align*} $$
We can now predict the rating a user $U_i$ would give to a movie $M_j$ they have not seen as follows:
We can in a similar way estimate similarity of movies, working on the columns of $\tilde{R}$ (instead of rows). We predict the rating user $U_i$ would give to the move $M_j$ as
$$ \text{pred} = \bar{r} + v_i + \mu_j + \frac{\sum_{1 \leq l \leq L} \text{sim}(M_j, M_{n_l}) \tilde{r}(n_l, i)}{\sum_{1 \leq l \leq L} |\text{sim}(M_j, M_{n_l})|} $$
and recommend the movie $M_j$ that has the highest value of $\text{pred}(j, i)$.
This method relies on several heuristics
There are features movies posses which appeal to different tastes which determine how much a user likes a movie. I.e. "action", "comedy", "romance", "famous actors", "special effects"
Enumerate these features as $f_1, f_2, ..., f_N$ where $N$ is to the order of a few 10s to a few 100s A movie can have each of these features, say $f_i$ to an extent $e_i$, where say $0 \leq e_i \leq 10$.
Each movie $M_j$ has a vector $\vec{e}^j$ of length $N$, and we can form a matrix $F$ s.t. rows of $F$ correspond to movies and columns correspond to features. I.e. if we have $M$ movies:
$$ F = \begin{pmatrix} F_{1, 1} & \ldots & F_{1, N} \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \\ \vdots & \vdots & \vdots \\ F_{M, 1} & \ldots & F_{M, N} \end{pmatrix} $$
resulting in a very tall matrix
We can associate each user $U_i$ with a column vector $\vec{l}^i$ s.t. its $m^{th}$ coordinate is a number, $0 \leq \vec{l}^i_m \leq 10$ which represents how much a user likes a feature $f_m$ in a movie.
We can now form a matrix $L$ whose rows correspond to features and columns correspond to users. $$ L = \begin{pmatrix} L_{1, 1} & \ldots & \ldots & \ldots & \ldots & \ldots & L_{1, N} \\ \vdots & \ldots & \ldots & \ldots & \ldots & \ldots & \vdots \\ L_{M, 1} & \ldots & \ldots & \ldots & \ldots & \ldots & L_{M, N} \end{pmatrix} $$ resulting in a very wide matrix
Assume that we have access to $L$ and $F$. Then to predict how much $U_i$ likes $M_j$, we evaluate
$$E(j, i) = \sum_{1 \leq m \leq N} (\vec{e}^j)_m (\vec{l}^i)_m = \langle \vec{e}^j, \vec{l}^i \rangle.$$
and as a result can generate $E = F \times L$, resulting in a very large matrix. However, the issue is that we cannot determine which few dozen features are relevant out of a few hundred features.
Let $N$ be the number of features we want (typically $20 \leq N \leq 200$), $\#M$ by the number of movies, and $\#U$ be the number of users.
Fill matrices $F$ of size $\#M \times N$ and $L$ of size $N \times \#U$ with variables $F(j, m)$ and $L(m, i)$ whose values have yet to be determined.
Then sole the least squares problem in the variables $$\{ F(j, m): 1 \leq j \leq \#M; 1 \leq m \leq N \} \cup \{ L(m, i) : 1 \leq m \leq N; 1 \leq i \leq \#U \}$$
minimise
$$S(\vec{F}, \vec{L}) = \sum_{(j, i):R(j, i)} \left( \sum_{1 \leq m \leq N} F(j, m) \cdot L(m, i) - R(j, i) \right)^2$$
We can attempt to find the minimum by finding the when the partial derivative of $S(\vec{F}, \vec{L})$ is equal to 0. However:
Solution:
Set all variables $F(j, m)$ to the same value $F^{(0)}(j, m)$ as a median value
Now solve the following Least Squares problem for the variables $$\{ L(m, i) : 1 \leq m \leq N; 1 \leq i \leq \#U \}$$ minimise $$\sum_{j, i}: R(j, i) \left(\sum{1 \leq m \leq N} F^{(0)}(j, m) \cdot L(m, i) - R(j, i) \right)^2$$ which is now a system of linear equations as after we find the partials $F^{(0)}(j, m)$ is set to 0.
Let $L^{(0)}(m, i)$ be the solutions to such a Least Squares problem.
We now solve the following Least Squares problem in variables $$\{ F(j, m): 1 \leq j \leq \#M; 1 \leq m \leq N \}$$ minimise $$\sum_{(j, i):R(j, i)} \left( \sum_{1 \leq m \leq N} F(j, m) \cdot L^{(0)}(m, i) - R(j, i) \right)^2$$
Now, we keep alternating between taking either $\{ L(m, i) : 1 \leq m \leq N; 1 \leq i \leq \#U \}$ or $\{ F(j, m): 1 \leq j \leq \#M; 1 \leq m \leq N \}$ as free variables, fixing the values of the other set from the previously obtained solution to the corresponding Least Squares problem.
This method is sometimes called Method of Alternating Projections.
We stop such iterations when $$\sum_{(j. m)}(F^{(k)}(j. m) - F^{(k - 1)}(j. m))^2 + \sum_{(i. m)}(L^{(k)}(m, i) - L^{(k - 1)}(m, i))^2$$ becomes smaller than an accuracy threshold $\epsilon > 0$.
After we obtain the values $F^{(k)}(j, m)$ and $L^{(k)}(m, i)$ from the last iteration $k$, we form teh corresponding matrices $F$ of size $\#M \times N$ and $L$ of size $N \times \#U$ as $$ \begin{align*} \tilde{F} &= \left( F^{(k)}(j, m) : 1 \leq j \leq \#M; 1 \leq m \leq N \right) \\ \tilde{L} &= \left( L^{(k)}(m, i) : 1 \leq m \leq \#M; 1 \leq m \leq \#U \right) \\ \end{align*} $$
We finally set $E = \tilde{F} \times \tilde{L}$ as the final matrix of predicted ratings of all movies by all users, where $E(j, i)$ is the prediction of the rating of movie $M_j$ by user $U_i$.
Clustering algorithms are a type of unsupervised learning used in data science.
There are two kinds of clusters:
A good clustering algorithm should be able to handle both kinds.
There are two common representations of points
We assume data points are represented as vectors in $\mathbb{R}^d$.
Find a partition $C = \{C_1, ..., C_k \}$ of a set of data points $A = \{\mathbf{a_1}, ..., \mathbf{a_n} \}$ into $k$ clusters, with the corresponding centers $\mathbf{c_1}, ..., \mathbf{c_k}$ which minimises
$$\Phi(C) = \max_{j=1}^k \max_{a \in C_j} d(\mathbf{a}, \mathbf{c_j})$$
This will minimise the radius of the cluster (as the radius of a cluster is the furthest distance from the cluster center).
Find a partition $C = \{C_1, ..., C_k \}$ of a set of data points $A = \{\mathbf{a_1}, ..., \mathbf{a_n} \}$ into $k$ clusters, with the corresponding centers $\mathbf{c_1}, ..., \mathbf{c_k}$ which minimises
$$\Phi(C) = \sum_{j=1}^k \sum_{a \in C_j} d(\mathbf{a}, \mathbf{c_j})$$
$d(\mathbf{a}, \mathbf{c}_j)$ can be any distance metric, such as
This is the most frequently used center based clustering algorithm. It is similar to the k-median clustering, except we want to minimise
$$\Phi(C) = \sum_{j=1}^k \sum_{a \in C_j} d(\mathbf{a}, \mathbf{c_j})^2$$
Finding the centroid
Since
$$\mathbf{a_i} = (a_{i1}, ..., a_{id})$$
let $\mathbf{c} = (c_1, ..., c_d)$ with for all $1 \leq k \leq d$,
$$c_k = (a_{1k} + ... + a_{nk}) / n$$
As a result, $c_k$ is the arithmetic mean of the $k^{th}$ coordinates of all the points $\{ \mathbf{a}_1, ..., \mathbf{a}_n \}$. Hence, $\mathbf{c}$ is called the centroid of the set of points $\{ \mathbf{a}_1, ..., \mathbf{a}_n \}$.
Finding the distance
Then $\mathbf{c}$ is called the centroid of the set of points $\{ \mathbf{a_1}, ..., \mathbf{a_n} \}$.
We denote $\mathbf{x} \cdot \mathbf{y}$ the scalar product of vectors $\mathbf{x}$ and $\mathbf{y}$ by $|| \mathbf{x} ||$ the norm of a vector $\mathbf{x}$, i.e.
$$\mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^d x_i y_i \quad \text{and} \quad ||x|| = \sqrt{\sum{i=1}^d x_i^2} = \sqrt{\mathbf{x} \cdot \mathbf{x}}$$
Note that $|| \mathbf{x} - \mathbf{y} ||$ is the Euclidean distance of points $\mathbf{x}$ and $\mathbf{y}$:
$$|| \mathbf{x} - \mathbf{y} || = \sqrt{\sum_{i=1}^d(x_i - y_i)^2}$$
Theorem
Let $A = \{\mathbf{a_1}, ..., \mathbf{a_n} \}$ be a set of points and $\mathbf{x}$ be another point, all in $\mathbb{R}^d$. Let also $\mathbf{c}$ be the centroid of $A$. Then
$$\sum_{i=1}^n || \mathbf{a_i} - \mathbf{x} ||^2 + n||\mathbf{c} - \mathbf{x}||^2$$
Corollary
Let $A = \{\mathbf{a_1}, ..., \mathbf{a_n} \}$ be a set of points and $\mathbf{x}$ be another point, all in $\mathbb{R}^d$. Then
$$D(x) = \sum_{i=1}^n || \mathbf{a_i} - \mathbf{x} ||^2$$
is minimised when $\mathbf{x}$ is the centroid $\mathbf{c} = \frac{1}{n} \sum_{i=1}^n \mathbf{x_i}$.
Thus, to find a partition of set of points $A$ into $k$ disjoint components $A = \bigcup_{i=1}^k A_i$ and $k$ points $\mathbf{x_1}, ..., \mathbf{x_k}$ such that the sum
$$\sum{j=1}^k \sum{\mathbf{a_i} \in A_j} || \mathbf{a_i} - \mathbf{x_j} ||^2$$
is as small as possible, then whatever such an optimal partition $\{ A_j : 1 \leq j \leq k \}$ might be, the points $\mathbf{x_j}$ must be the centroids $\mathbf{c_j}$ of sets $A_j$.
Hence, finding the $k$ clusters is equivalent to minimising
$$\sum_{m=1}^k \frac{1}{2|A_m|} \sum_{\mathbf{a_i}, \mathbf{a_j} \in A_m} || \mathbf{a_i} - \mathbf{a_j} ||^2$$
Solving the optimal k-means clustering is NP hard, so we use approximate algorithms.
The best known approximate k-means clustering algorithm is Lloyd's algorithm.
At every round $p$ of its loop, Lloyd's algorithm reduces the size of
$$\sum_{m=1}^k \sum_{\mathbf{a}_j \in A_m^{(p)}} || \mathbf{a}_j - \mathbf{c}_m^{(p)} ||^2$$
where $A_m^{(p)}$ are the "temporary" clusters and $\mathbf{c}^{(p)}_m$ is the "temporary" centre of cluster $A_m^{(p)}$ at round $p$ of the loop.
However, the algorithm may stop at a local minimum and not the global minimum.
Wards algorithm is a greedy k-means algorithm:
Start with every point $\mathbf{a}_i$ in its own cluster.
While the number of clusters is larger than $k$ repeat:
Find two clusters $C$ and $C'$ such that $$\text{cost}(C \cup C') - \text{cost}(C) - \text{cost}(C')$$ is as small as possible and replace them with a single merged cluster $C \cup C'$ with its centroid as its centre
There are several ways to find non centre based clusters.
Rather than representing a set of data points $A$ with their locations, we represent them as an undirected weighted graph $G = (V, E)$.
From here there are several ways to cluster the points
Recall that the $n \times n$ diagonal matrix $D$ has the degrees $d_i$ of vertices $v_i$ on its diagonal, where $d_i = \sum_{j=1}^n w_{ij}$.
The (unnormalised) graph Laplacian matrix $L$ is defined as
$$L = D - W$$
where $W = (w_{ij})^n_{i, j = 1}$.
Clearly $L$ is symmetric and does not depend on $w_{ii}$, $1 \leq i \leq n$. Graph Laplacians are crucial for spectral clustering.
A matrix $M$ of size $n \times n$ is positive semi-definite if for all vectors $f \in \mathbb{R}^n$ we have
$$f^\intercal M f \geq 0$$
A symmetric matrix is positive semi-definite iff all of its eigvenvalues are real and non-negative.
The matrix $L = D - W$ has the following properties:
(Missing some Spectral Graph Theory)
Spectral Clustering Algorithm:
(Missing application of Spectral Clustering as graph partioning)
FFT is an $O(n \log n)$ DFT conversion and the IFFT can compute the IDFT in the same runtime. The benefit of finding the DFT or IDFT of something is that it can represent data, in another form which can be more easily manipulated or used.
(Missing a ton of theory)
Let $A = \langle A_0, A_1, ..., A_{n-1} \rangle$ and $B = \langle B_0, B_1, ..., B_{m-1} \rangle$ be two sequences of real or complex numbers. We can now form two associated polynomials $P_A(x)$ and $P_B(x)$ with coefficients given by sequences $A$ and $B$. Finding the multiple of these polynomials $P_C(x) = P_A(x) \cdot P_B(x)$ can be done in $O(m \times n)$. From here, we can get the sequence of it's corresponding coefficients. This sequence of length $m + n - 1$ is the linear convolution of sequences $A$ and $B$. However, using the FFT algorithm, we can find the linear convolution of these two sequences in time $O((m + n) \log_2(m + n))$.
An example application is application of Gaussian smoothing to a noisy signal.
Singular Value Decomposition is a way of representing a very large matrix $A$ as
$$A = \sum_{i=1}^r = \sigma_i u_i v_i^\intercal = UDV^\intercal$$
(Insert some more theory)
To find $\mathbf{v}_1$ and $\mathbf{u}_1$:
This method is called the Power Method and it is fast if $A$ and thus also $A^\intercal$ are sparse.
A good series on SVD.
The signal to interference ratio $SIR_i$ at receiver $R_i$ is given by
$$SIR_i = \frac{G_{ii}p_i}{\sum_{j: j \neq i}G_{ii}p_j + \eta_i}$$
where $\eta_i$ is the noise received by the receiver $i$ coming from the environment. $SIR_i$ determines the capacity of the channel $C_{ii}$. Each pair of transmitter $T_i$ and a receiver $R_i$ needs at least some channel capacity to carray information. To achieve this, it needs $SIR_i \geq \gamma_i$.
We will see later how $\gamma_i$ is determined in practice. However, we are interested in:
$$\text{minimise } \sum_{j=1}^n p_j$$
$$\text{subject to constraints } \frac{G_{ii} p_i}{\sum_{j: j \neq i}G_{ij} + \eta_i} \geq \gamma_i, \quad 1 \leq i \leq n$$
Following this, we can simplify our original LP problem into
$$\text{minimise } \sum_{j=1}^n p_j$$
$$\text{subject to constraints } p_i - \gamma_i \sum{j:j \neq i} \frac{G_{ij}}{G_{ii}} p_j \geq \frac{\gamma_i \eta_i}{G_{ii}} \\ 1 \leq i, j \leq n, p_j > 0$$
We can then convert this into a matrix format:
$$\text{minimise } \mathbf{1}^\intercal \mathbf{p}$$
$$\text{subject to constraints } (I - DF)\mathbf{p} \geq \mathbf{v}, \mathbf{p} \leq 0$$
This will have a feasible solution if we are not demanding excessively large $\gamma_i's$.
This is the case when the spectral radius (largest absolute value of the eigenvalues) of matrix $DF$ $\rho(DF)$, satisfies $\rho(DF) < 1$.
Proof:
Moreover it is easy to see that
$$(I - A) \sum_{i=0}^k A^i = \sum_{i=0}^k A^i - \sum_{i=1}^{k+1}A^i = I - A^{k+1}$$
Thus,
$$\lim_{k \rightarrow \infty} \left( (I - A) \sum_{i=0}^k A^k \right) \lim_{k \rightarrow \infty}(I - A^{k + 1}),$$
which implies
$$(I - A) \sum_{i=0}^\infty A^i = I.$$
This shows that matrix $I - A$ is invertible and that
$$(I - A)^{-1} = \sum_{i=0}^\infty A^i$$
Applying this to matrix $A = DF$, let $\mathbf{p}^*$ be give by
$$\mathbf{p^*} = (I - DF)^{-1} \mathbf{v} = \sum_{i=0}^\infty (DF)^i \mathbf{v}$$
Then $(I - DF) \mathbf{p^*} = \mathbf{v}$ and the constraint becomes
$$(I - DF) \mathbf{p} \geq (I - DF) \mathbf{p^*}$$
i.e.
$$(I - DF)(\mathbf{p} - \mathbf{p^*}) \geq 0$$