#sdsc6007

English/ Chinese

Tutorial 1

Problem Setting

Dynamic System: $x_{k+1} = x_k + u_k + w_k$ , $k = 0,1,2,3$
Initial State: $x_0 = 5$
Cost Function: $\sum_{k=0}^{3}(x_k^2 + u_k^2)$
State Space: $S_k = \{0,1,2,3,4,5\}$
Control Constraints: $U_k(x_k) = \{u \mid 0 \leq x_k + u \leq 5, u \in \mathbb{Z}\}$
Stochastic Disturbance:
- If $0 < x_k + u_k < 5$ : $w_k = \begin{cases} -1 & \text{probability } \frac{1}{2} \\ 1 & \text{probability } \frac{1}{2} \end{cases}$
- If $x_k + u_k = 0$ or $x_k + u_k = 5$ : $w_k = 0$ (probability 1)

Dynamic Programming Recursion Formula

Define the value function $V_k(x_k)$ as the minimum expected cost starting from state $x_k$ at time $k$ .

Bellman Equation:

$V_k(x_k) = \min_{u_k \in U_k(x_k)} \left\{ x_k^2 + u_k^2 + E[V_{k+1}(x_k + u_k + w_k)] \right\}$

Boundary Condition: $V_4(x_4) = 0$ (end of problem)

Step 1: Time k=3 (Final Step)

$V_3(x_3) = x_3^2 + \min_{u_3 \in U_3(x_3)} u_3^2$

Compute for each state:

$x_3$	$U_3(x_3)$	$V_3(x_3)$
0	{0,1,2,3,4,5}	$0^2 + 0^2 = 0$
1	{-1,0,1,2,3,4}	$1^2 + 0^2 = 1$
2	{-2,-1,0,1,2,3}	$2^2 + 0^2 = 4$
3	{-3,-2,-1,0,1,2}	$3^2 + 0^2 = 9$
4	{-4,-3,-2,-1,0,1}	$4^2 + 0^2 = 16$
5	{-5,-4,-3,-2,-1,0}	$5^2 + 0^2 = 25$

Result: $u_3^*(x_3) = 0$ for all $x_3$

Step 2: Time k=2

$V_2(x_2) = \min_{u_2 \in U_2(x_2)} \left\{ x_2^2 + u_2^2 + E[V_3(x_2 + u_2 + w_2)] \right\}$

Expected value calculation:

If $0 < x_2 + u_2 < 5$ : $E[V_3] = \frac{1}{2}V_3(x_2 + u_2 - 1) + \frac{1}{2}V_3(x_2 + u_2 + 1)$
If $x_2 + u_2 = 0$ or $x_2 + u_2 = 5$ : $E[V_3] = V_3(x_2 + u_2)$

$x_2 = 0$ :

$u_2$	$x_2+u_2$	$E[V_3]$	Total Cost $0^2 + u_2^2 + E[V_3]$
0	0	$V_3(0) = 0$	$0 + 0 + 0 = 0$
1	1	$\frac{1}{2}(0) + \frac{1}{2}(4) = 2$	$0 + 1 + 2 = 3$
2	2	$\frac{1}{2}(1) + \frac{1}{2}(9) = 5$	$0 + 4 + 5 = 9$
3	3	$\frac{1}{2}(4) + \frac{1}{2}(16) = 10$	$0 + 9 + 10 = 19$
4	4	$\frac{1}{2}(9) + \frac{1}{2}(25) = 17$	$0 + 16 + 17 = 33$
5	5	$V_3(5) = 25$	$0 + 25 + 25 = 50$

$V_2(0) = 0$ , $u_2^*(0) = 0$

$x_2 = 1$ :

$u_2$	$x_2+u_2$	$E[V_3]$	Total Cost $1^2 + u_2^2 + E[V_3]$
-1	0	$V_3(0) = 0$	$1 + 1 + 0 = 2$
0	1	$\frac{1}{2}(0) + \frac{1}{2}(4) = 2$	$1 + 0 + 2 = 3$
1	2	$\frac{1}{2}(1) + \frac{1}{2}(9) = 5$	$1 + 1 + 5 = 7$
2	3	$\frac{1}{2}(4) + \frac{1}{2}(16) = 10$	$1 + 4 + 10 = 15$
3	4	$\frac{1}{2}(9) + \frac{1}{2}(25) = 17$	$1 + 9 + 17 = 27$
4	5	$V_3(5) = 25$	$1 + 16 + 25 = 42$

$V_2(1) = 2$ , $u_2^*(1) = -1$

Continue computing other states:

$x_2 = 2$ : $V_2(2) = 7$ , $u_2^*(2) = -1$

$x_2 = 3$ : $V_2(3) = 15$ , $u_2^*(3) = -2$ (or $-1$ )

$x_2 = 4$ : $V_2(4) = 25$ , $u_2^*(4) = -2$

$x_2 = 5$ : $V_2(5) = 39$ , $u_2^*(5) = -3$ (or $-2$ , both yield same cost)

Step 3: Time k=1

Use computed $V_2$ to calculate $V_1(x_1)$ .

$x_1 = 0$ :

$u_1$	$x_1+u_1$	$E[V_2]$	Total Cost $0^2 + u_1^2 + E[V_2]$
0	0	$V_2(0) = 0$	$0 + 0 + 0 = 0$
1	1	$\frac{1}{2}V_2(0) + \frac{1}{2}V_2(2) = \frac{1}{2}(0) + \frac{1}{2}(7) = 3.5$	$0 + 1 + 3.5 = 4.5$

$V_1(0) = 0$ , $u_1^*(0) = 0$

$x_1 = 1$ :

$u_1$	$x_1+u_1$	$E[V_2]$	Total Cost
-1	0	$V_2(0) = 0$	$1 + 1 + 0 = 2$
0	1	$\frac{1}{2}(0) + \frac{1}{2}(7) = 3.5$	$1 + 0 + 3.5 = 4.5$

$V_1(1) = 2$ , $u_1^*(1) = -1$

$x_1 = 2$ :

$u_1$	$x_1+u_1$	$E[V_2]$	Total Cost
-2	0	$V_2(0) = 0$	$4 + 4 + 0 = 8$
-1	1	$\frac{1}{2}(0) + \frac{1}{2}(7) = 3.5$	$4 + 1 + 3.5 = 8.5$

$V_1(2) = 8$ , $u_1^*(2) = -2$

$x_1 = 3$ :

$u_1$	$x_1+u_1$	$E[V_2]$	Total Cost
-3	0	$V_2(0) = 0$	$9 + 9 + 0 = 18$
-2	1	$\frac{1}{2}(0) + \frac{1}{2}(7) = 3.5$	$9 + 4 + 3.5 = 16.5$
-1	2	$\frac{1}{2}(2) + \frac{1}{2}(15) = 8.5$	$9 + 1 + 8.5 = 18.5$

$V_1(3) = 16.5$ , $u_1^*(3) = -2$

$x_1 = 4$ :

After computation: $V_1(4) = 28.5$ , $u_1^*(4) = -3$

$x_1 = 5$ :

After computation: $V_1(5) = 42.5$ , $u_1^*(5) = -3$

Complete Value Function for Time 1:

$x_1$	$V_1(x_1)$	$u_1^*(x_1)$
0	0	0
1	2	-1
2	8	-2
3	16.5	-2
4	28.5	-3
5	42.5	-3

Step 4: Time k=0

For initial state $x_0 = 5$ , compute $V_0(5)$ :

$V_0(5) = \min_{u_0 \in U_0(5)} \left\{ 25 + u_0^2 + E[V_1(5 + u_0 + w_0)] \right\}$

$U_0(5) = \{-5,-4,-3,-2,-1,0\}$

$u_0$	$x_0+u_0$	$E[V_1]$	Total Cost
-5	0	$V_1(0) = 0$	$25 + 25 + 0 = 50$
-4	1	$\frac{1}{2}V_1(0) + \frac{1}{2}V_1(2) = \frac{1}{2}(0) + \frac{1}{2}(8) = 4$	$25 + 16 + 4 = 45$
-3	2	$\frac{1}{2}V_1(1) + \frac{1}{2}V_1(3) = \frac{1}{2}(2) + \frac{1}{2}(16.5) = 9.25$	$25 + 9 + 9.25 = 43.25$
-2	3	$\frac{1}{2}V_1(2) + \frac{1}{2}V_1(4) = \frac{1}{2}(8) + \frac{1}{2}(28.5) = 18.25$	$25 + 4 + 18.25 = 47.25$
-1	4	$\frac{1}{2}V_1(3) + \frac{1}{2}V_1(5) = \frac{1}{2}(16.5) + \frac{1}{2}(42.5) = 29.5$	$25 + 1 + 29.5 = 55.5$
0	5	$V_1(5) = 42.5$	$25 + 0 + 42.5 = 67.5$

Result: $V_0(5) = 43.25$ , $u_0^*(5) = -3$

Optimal Policy Summary

Optimal policy starting from initial state $x_0 = 5$ :

Time	State	Optimal Control	Explanation
$k=0$	$x_0 = 5$	$u_0^* = -3$	Guides state toward 2
$k=1$	Depends on $w_0$	Look up $u_1^*$	From value function table
$k=2$	Depends on $w_0,w_1$	Look up $u_2^*$	From value function table
$k=3$	Path-dependent	$u_3^* = 0$	Always choose 0 regardless of state

Minimum Expected Total Cost: $43.25$

Policy Characteristics:

Tends to guide state toward smaller values (since state cost $x_k^2$ is minimized at 0)
Requires balancing control cost and future expected cost
Stochastic disturbances necessitate consideration of all possible future states

Hidden Markov Model (HMM)

Model Description

A stationary finite-state Markov chain with transition probabilities $p_{ij}$ between states.
States are hidden (not directly observable). Initial state $x_0$ has probability $\pi_i$ of being $i$ .
After each state transition, a value $z$ is observed. Given a transition from state $i$ to state $j$ , the probability of observing $z$ is $r(z; i, j)$ .
Observations are independent.
Transition probabilities $p_{ij}$ and observation probabilities $r(z; i, j)$ are assumed to be independent.

Symbol Explanation and Model Understanding

$p_{ij}$ : Probability of transitioning from state $i$ to state $j$ (e.g., probability of weather changing from “sunny” to “rainy”)
$\pi_i$ : Probability that the system starts in state $i$ (e.g., probability that the first day is “sunny”)
$r(z; i, j)$ : Probability of observing $z$ during transition $i→j$ (e.g., probability of observing “high humidity” during “sunny to rainy” transition)
Independence Assumption: Observation $z_k$ depends only on the current state transition $(x_{k-1},x_k)$ , independent of historical states and observations.

Objective

Given an observation sequence $Z_N = \{z_1, \dots, z_N\}$ , estimate the most probable state sequence $\hat{X}_N = \{\hat{x}_0, \dots, \hat{x}_N\}$ .

Method

To estimate $\hat{X}_N$ , maximize the conditional probability $P(X_N \mid Z_N)$ . Since:

$P(X_N \mid Z_N) = \frac{P(X_N, Z_N)}{P(Z_N)}$

Maximizing $P(X_N \mid Z_N)$ is equivalent to maximizing the joint probability $P(X_N, Z_N)$ , because $P(Z_N)$ is constant for a given observation sequence.

Joint Probability Derivation

The joint probability $P(X_N, Z_N)$ for state sequence $X_N = \{x_0, x_1, \dots, x_N\}$ and observation sequence $Z_N = \{z_1, \dots, z_N\}$ is derived as follows:

$P(X_N, Z_N) = P(x_0, x_1, \dots, x_N, z_1, \dots, z_N)$

Using the chain rule and independence assumptions:

$\begin{align*} = & \pi_{x_0} P(x_1, \dots, x_N, z_1, \dots, z_N \mid x_0) \\ = & \pi_{x_0} p_{x_0 x_1} r(z_1; x_0, x_1) P(x_2, \dots, x_N, z_2, \dots, z_N \mid x_0, x_1, z_1) \end{align*}$

Based on Markov property and observation independence, recursively simplified to:

$P(X_N, Z_N) = \pi_{x_0} \prod_{k=1}^N p_{x_{k-1} x_k} r(z_k; x_{k-1}, x_k)$

Detailed Derivation Process

Chain Rule Decomposition:

$P(X_N, Z_N) = P(x_0) \prod_{k=1}^N P(x_k, z_k | x_{0:k-1}, z_{1:k-1})$

Applying Markov Property:
Future state depends only on current state:

$P(x_k | x_{0:k-1}) = P(x_k | x_{k-1})$

Observation Independence:
Observation $z_k$ determined only by current transition:

$P(z_k | x_{0:k}, z_{1:k-1}) = r(z_k; x_{k-1}, x_k)$

Joint Probability:

$P(x_k, z_k | x_{0:k-1}, z_{1:k-1}) = p_{x_{k-1}x_k} \cdot r(z_k; x_{k-1}, x_k)$

Final Form:
Recursive substitution yields $\pi_{x_0} \prod_{k=1}^N p_{x_{k-1}x_k} r(z_k; x_{k-1}, x_k)$

Explanation:

$\pi_{x_0}$ : Probability of starting state $x_0$ .

$p_{x_{k-1} x_k}$ : Probability of transitioning from state $x_{k-1}$ to $x_k$ at step $k$ .

$r(z_k; x_{k-1}, x_k)$ : Probability of observing $z_k$ given transition from $x_{k-1}$ to $x_k$ .

The product $\prod_{k=1}^N$ captures the sequence of state transitions and observations, leveraging independence to decompose the joint probability into a product.

Equivalent Optimization Problem

Maximizing $P(X_N, Z_N)$ is equivalent to minimizing the negative log-likelihood:

$\min\left( -\log(\pi_{x_0}) - \sum_{k=1}^N \log\left( p_{x_{k-1} x_k} r(z_k; x_{k-1}, x_k) \right) \right)$

Principle of Negative Logarithm Conversion

Monotonicity Principle:
- $\log$ function is monotonically increasing, so maximizing $f(x)$ is equivalent to maximizing $\log f(x)$
- Minimizing $-\log f(x)$ is equivalent to maximizing $f(x)$
Product to Sum Conversion:

$\arg\max \prod p_i = \arg\min \left( -\sum \log p_i \right)$
Numerical Stability:
- Probability products can cause floating-point underflow
- Logarithmic summation avoids multiplication of small values
Optimization Advantage:
- Sum form is suitable for dynamic programming
- Can be viewed as cumulative “path cost”

Note: This conversion transforms the probability product into a sum of logarithmic terms, facilitating optimization (e.g., using dynamic programming algorithms like the Viterbi algorithm). Minimizing this sum finds the state sequence that best explains the observations.

The problem of finding the most probable state sequence (e.g., in hidden Markov models) can be transformed into a shortest path problem with positive weights using the following graph structure:

Add Virtual Nodes:
- Add source node s and sink node t.
Initial State Arcs:
- Create arcs from $s$ to each possible initial state $x_0$ .
- Assign length:
  
  $\ell(s, x_0) = -\log(\pi_{x_0})$
  
  Where $\pi_{x_0}$ is the initial probability of state $x_0$ . The negative logarithm converts maximizing initial probability to minimizing a positive cost.
State Transition Arcs:
- For each time step $k$ $k$ $(k = 1, ..., N)$ $(k = 1, ..., N)$ and each valid state transition $(x_{k - 1}, x_k)$ $(x_{k - 1}, x_{k})$ :
  - Create an arc from state $x_{k - 1}$ to state $x_k$ .
  - Assign length:
    $\ell(x_{k-1}, x_k) = -\log \left( p_{x_{k-1} x_k} \cdot r(z_k; x_{k-1}, x_k) \right)$
    
    Where $p_{x_{k-1} x_k}$ is the transition probability from state $x_{k - 1}$ to $x_k$ , and $r(z_k; x_{k - 1}, x_k)$ is the probability (likelihood) of observing $z_k$ given the transition $x_{k - 1}$ to $x_k$ . The negative logarithm converts maximizing the product of transition and emission probabilities to minimizing a positive cost.
- Note: If transition probability $p_{x_{k - 1} x_k} = 0$ , do not create the corresponding arc.
Termination State Arcs:
- Create arcs from each possible terminal state $x_N$ to sink node $t$ .
- Assign length:
  $\ell(x_N, t) = 0$
  
  This setting makes the cost of ending the path at any valid terminal state $x_N$ zero.

Solution: In this constructed graph, the shortest path from source node s to sink node t corresponds to the most probable state sequence. All arc lengths are non-negative (termination arcs are zero, others are positive).

Proof of Graph Structure Equivalence

Path Cost Calculation:
- Total path cost = $\ell(s,x_0) + \sum_{k=1}^N \ell(x_{k-1},x_k) + \ell(x_N,t)$
$= -\log\pi_{x_0} - \sum_{k=1}^N \log(p_{x_{k-1}x_k}r_k) + 0$

$= -\log\left( \pi_{x_0} \prod_{k=1}^N p_{x_{k-1}x_k}r_k \right)$
Equivalence Relation:

$\min\left(\text{Path Cost}\right) \iff \min\left(-\log P(X_N,Z_N)\right) \iff \max\left(P(X_N,Z_N)\right)$
Non-negative Weights Guarantee:
- Since $0 < p_{ij}, r \leq 1$ , we have $-\log(\cdot) \geq 0$
- Satisfies requirements for Dijkstra and other shortest path algorithms
Dynamic Programming Advantage:
- Viterbi algorithm complexity $O(N|S|^2)$
- Avoids exhaustive search over $|S|^N$ possible paths

Forward DP Algorithm (Viterbi Algorithm)

Core Formulas

1. Initial State Probability

$D_0(x_0) = -\log(\pi_{x_0})$

Meaning: Negative log initial probability of state $x_0$ at time step $t=0$ .
$\pi_{x_0}$ : Probability of initial state $x_0$ .
Negative Log Conversion: Converts probability maximization to cost minimization.

2. Recurrence Relation

$D_{k+1}(x_{k+1}) = \min_{x_k} \left[ D_k(x_k) - \log P(x_{k+1} \mid x_k) - \log P(z_k \mid x_k) \right]$

Meaning: Computes the minimum cumulative cost to reach state $x_{k+1}$ at time step $t=k+1$ .
$D_k(x_k)$ : Minimum cumulative cost to previous state $x_k$ .
$P(x_{k+1} \mid x_k)$ : State transition probability (from $x_k$ to $x_{k+1}$ ).
$P(z_k \mid x_k)$ : Observation probability (observing $z_k$ in state $x_k$ ).
Minimization Operation: Evaluates all possible previous states $x_k$ to select the optimal path.

3. Backtracking Optimal Path

$x_k^* = \argmin_{x_k} \left[ D_k(x_k) - \log P(x_{k+1}^* \mid x_k) - \log P(z_k \mid x_k) \right]$

Meaning: Backtracks from the final state to determine the globally optimal state sequence $\{x_0^*, x_1^*, \dots, x_T^*\}$ .
$x_{k+1}^*$ : Already determined next optimal state.

Additional Notes

Viterbi Algorithm Essence: Dynamic programming solution for finding the most probable state sequence (maximum a posteriori path) in hidden Markov models (HMMs).

Purpose of Negative Log Conversion: Converts the maximization of probability products into minimization of log sums, avoiding floating-point underflow.

Complexity: $O(T \cdot N^2)$ ( $T$ is number of time steps, $N$ is number of states).

Hidden Markov Model Example: Convolutional Encoding and Decoding

Background Story

Dr. Yang stored his grandmother’s recipe as a binary sequence (e.g., $\{0, 1, 0, 0, 1, \ldots\}$ ) and used convolutional encoding to protect the data from theft (e.g., by 6007🧑‍🏫). The data is transmitted through a noisy channel to Dr. Zeng, where the received sequence $\{z_k\}$ differs from the transmitted sequence $\{y_k\}$ . The goal is to decode $\{z_k\}$ to recover the original sequence $\{w_k\}$ .

Convolutional Encoding Process

Input: Source binary sequence $\{w_1, w_2, \ldots\}$ , where $w_k \in \{0, 1\}$ .
Output: Encoded sequence $\{y_1, y_2, \ldots\}$ , each $y_k$ is an $n$ -dimensional binary vector (codeword).
State Equations (using modulo 2 arithmetic):

$y_k = C x_{k-1} + d w_k, \quad k = 1, 2, \ldots$

$x_k = A x_{k-1} + b w_k, \quad k = 1, 2, \ldots$

Where:
- $x_k$ is the hidden state vector (dimension $m \times 1$ ), initial state $x_0$ given (e.g., $x_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$ ).
- $A$ is the state transition matrix ( $m \times m$ ), $C$ is the output matrix ( $n \times m$ ), $b$ and $d$ are weight vectors (dimensions $m \times 1$ and $n \times 1$ respectively).
- Parameter Settings:
  $C = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix}, \quad A = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix}, \quad d = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \quad b = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \quad x_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$
  
  Explanation: $y_k$ is a 3-dimensional vector ( $n=3$ ), $x_k$ is a 2-dimensional vector ( $m=2$ ). Modulo 2 arithmetic ensures all values are binary (0 or 1). Dimensions of $d$ and $b$ must match output and state dimensions; here $d$ is 3x1, $b$ is 2x1.

Example Calculation

Input Sequence: $\{w_1, w_2, w_3, w_4\} = \{1, 0, 0, 1\}$ .
State and Output Sequence Calculation:
- $k=1$ : $x_1 = A x_0 + b w_1 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot 1 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$ , $y_1 = C x_0 + d w_1 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \cdot 1 = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}$ .
- $k=2$ : $x_2 = A x_1 + b w_2 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot 0 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}$ , $y_2 = C x_1 + d w_2 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \cdot 0 = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}$ .
- $k=3$ : $x_3 = A x_2 + b w_3 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot 0 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}$ , $y_3 = C x_2 + d w_3 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \cdot 0 = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}$ .
- $k=4$ : $x_4 = A x_3 + b w_4 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot 1 = \begin{bmatrix} 0 \\ 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}$ (mod 2), $y_4 = C x_3 + d w_4 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \cdot 1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}$ (mod 2).
Results:
- State Sequence: $\{x_0, x_1, x_2, x_3, x_4\} = \left\{ \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \end{bmatrix} \right\}$ (or $\{00, 01, 11, 10, 00\}$ ).
- Output Sequence: $\{y_1, y_2, y_3, y_4\} = \left\{ \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} \right\}$ (or $\{111, 011, 111, 011\}$ ).

Noisy Channel and Probability Model

Transmission Process: Encoded sequence $\{y_k\}$ is transmitted through a noisy channel; received sequence is $\{z_k\}$ ( $z_k$ is also a binary vector).
Error Model: Given conditional probability $p(z_k | y_k)$ , representing the probability of receiving $z_k$ given transmission of $y_k$ .
Independence Assumption: Errors are independent; joint probability is:

$P(Z_N | Y_N) = \prod_{k=1}^N p(z_k | y_k)$

Explanation: $Z_N = \{z_1, \ldots, z_N\}$ , $Y_N = \{y_1, \ldots, y_N\}$ . This model assumes errors occur independently for each codeword.

Symbol Explanation, Principle Derivation, and Practical Significance of Error Model

Symbol Explanation:
- $z_k$ : Received codeword (same dimension as $y_k$ ), possibly erroneous due to noise (e.g., $y_k = 111$ but $z_k = 101$ ).
- $p(z_k | y_k)$ : Conditional probability density function, representing the probability of receiving $z_k$ given transmission of $y_k$ (e.g., if channel has error rate $\epsilon$ , $p(z_k | y_k)$ might be defined as $\epsilon$ when $z_k \neq y_k$ , $1-\epsilon$ when $z_k = y_k$ ). Practical significance: Quantifies the likelihood of transmission errors (e.g., bit-flip probability).
Principle Derivation of Joint Probability Independence:
- Derivation: Assumes channel errors occur independently at each time step $k$ (i.e., noise does not affect adjacent transmissions). Thus, joint probability $P(Z_N | Y_N)$ is the product of marginal probabilities: $P(Z_N | Y_N) = P(z_1 | y_1) P(z_2 | y_2) \cdots P(z_N | y_N) = \prod_{k=1}^N p(z_k | y_k)$ . This follows from the definition of independence in probability theory (if events are independent, joint probability is the product).
- Practical significance: Simplifies model computation (e.g., decoding only needs to handle errors per codeword), applicable to many real channels (e.g., AWGN channel under low-correlation noise).
- Example: If $N=2$ , $p(z_1 | y_1) = 0.9$ (correct reception), $p(z_2 | y_2) = 0.1$ (error), then $P(Z_2 | Y_2) = 0.9 \times 0.1 = 0.09$ . Analogy: Like multiple independent coin tosses, where each outcome does not affect others.

Decoding Problem Formulation

Objective: Minimize negative log-likelihood to recover the most probable sequence $\{y_k\}$ :

$\min \sum_{k=1}^N -\log p(z_k | y_k)$

where optimization variables are all possible $\{y_1, \ldots, y_N\}$ .
Dependence on Hidden State: Since $y_k = C x_{k-1} + d w_k$ , decoding requires finding the state sequence $\{x_0, \ldots, x_N\}$ .

Explanation: This is equivalent to the decoding problem in hidden Markov models (HMMs), with states $x_k$ and observations $z_k$ . Dynamic programming methods (e.g., Viterbi algorithm) can efficiently solve it by optimizing over state transitions.

Principle Derivation of Decoding Objective and Equivalence to HMM

Principle Derivation of Minimizing Negative Log-Likelihood:
- Derivation: The goal is to maximize the likelihood probability $P(Z_N | Y_N)$ of the observation sequence $Z_N$ . Since the likelihood is a product ( $P(Z_N | Y_N) = \prod_{k=1}^N p(z_k | y_k)$ ), maximizing likelihood is equivalent to maximizing log-likelihood (due to monotonicity of log function):
  $\max \log P(Z_N | Y_N) = \max \sum_{k=1}^N \log p(z_k | y_k)$
  Equivalent to minimizing negative log-likelihood: $\min \sum_{k=1}^N -\log p(z_k | y_k)$ . Practical significance: Converts probability maximization into a numerically stable optimization problem (negative log-likelihood is commonly used as a loss function).
- Symbol Explanation: $-\log p(z_k | y_k)$ can be viewed as an “error cost,” where lower values indicate better match between $y_k$ and $z_k$ (e.g., if $p(z_k | y_k) = 0.9$ , $-\log p \approx 0.105$ ; if $p = 0.1$ , $-\log p \approx 2.3$ ). Minimizing the sum finds the best-matching sequence.
Derivation of Equivalence to Hidden Markov Model (HMM):
- Derivation: In HMMs, the decoding problem involves finding the most probable state sequence given observations. Here:
  - Hidden states: $x_k$ (defining state transitions $x_k = A x_{k-1} + b w_k$ ).
  - Observations: $z_k$ (given states or outputs, but $y_k$ depends on $x_{k-1}$ and $w_k$ ).
    Since $y_k$ is a function of $x_{k-1}$ and $w_k$ , and $w_k$ is implicit in state transitions, the observation probability $p(z_k | y_k)$ maps to the emission probability $p(z_k | x_k)$ in HMM (via $y_k$ linkage). The optimization objective $\min \sum -\log p(z_k | y_k)$ corresponds to the path cost in HMM’s Viterbi algorithm. Practical significance: Enables efficient algorithms (e.g., Viterbi) for long sequences.
- Analogy: Like finding the shortest path in a maze (state sequence), where each step’s “cost” is determined by observation match.

Convolutional Code State Transition Diagram (Mermaid)

    stateDiagram-v2
direction LR
[*] --> 00
00 --> 01 : w=1/y=111
00 --> 00 : w=0/y=000
01 --> 11 : w=0/y=011
01 --> 10 : w=1/y=100
11 --> 01 : w=0/y=111
11 --> 00 : w=1/y=100
10 --> 10 : w=0/y=011
10 --> 11 : w=1/y=111

Caption:

Nodes represent states (e.g., “00” represents $x_k = \begin{bmatrix}0\\0\end{bmatrix}$ )
Arrow labels format: input/output (e.g., “w=1/y=111”)
State transition relationships based on example parameters
Dashed box indicates possible initial state

Viterbi Algorithm Decoding Process (Mermaid)

    stateDiagram-v2
    direction TB
    state "k=0\nState:00\nCost:0" as s00
    state "k=1\nState:00\nCost:δ(z₁,000)" as s00_1
    state "k=1\nState:01\nCost:δ(z₁,111)" as s01_1
    state "k=2\nState:00\nCost:δ(z₁,000)+δ(z₂,000)" as s00_2
    state "k=2\nState:01\nCost:δ(z₁,000)+δ(z₂,111)" as s01_2
    state "k=2\nState:11\nCost:δ(z₁,111)+δ(z₂,011)" as s11_2
    state "k=2\nState:10\nCost:δ(z₁,111)+δ(z₂,100)" as s10_2
    state "k=3\nState:00\nCost:..." as s00_3
    state "k=3\nState:01\nCost:..." as s01_3
    state "k=3\nState:11\nCost:..." as s11_3
    state "k=3\nState:10\nCost:..." as s10_3
    
    [*] --> s00
    s00 --> s00_1 : w=0/y=000
    s00 --> s01_1 : w=1/y=111
    
    s00_1 --> s00_2 : w=0/y=000
    s00_1 --> s01_2 : w=1/y=111
    
    s01_1 --> s11_2 : w=0/y=011
    s01_1 --> s10_2 : w=1/y=100
    
    s00_2 --> s00_3 : w=0/y=000
    s00_2 --> s01_3 : w=1/y=111
    
    s01_2 --> s11_3 : w=0/y=011
    s01_2 --> s10_3 : w=1/y=100
    
    s11_2 --> s01_3 : w=0/y=111
    s11_2 --> s00_3 : w=1/y=100
    
    s10_2 --> s10_3 : w=0/y=011
    s10_2 --> s11_3 : w=1/y=111
    
    note right of s00_1
        Cost calculation:
        δ(z_k,y_k) = -log p(z_k|y_k)
        represents the match degree when receiving z_k and sending y_k
    end note
    
    note left of s11_2
        Optimal path selection:
        retain the minimum cost path to each state
        prune high-cost branches
    end note

Caption:

Each node represents the state and cumulative cost at time step k
δ(z_k, y_k) = -log p(z_k|y_k) represents the step cost
Thick border box indicates the current optimal path (needs to be calculated based on actual z_k)
Algorithm principle: retain the minimum cost path to each state and expand step by step

Shortest Path Algorithms

Label Correcting Methods

Motivation

Dynamic Programming (DP) algorithms require computing the cost function $J_k$ for each state $x_k$ at each time $k$ , which is computationally expensive for all nodes and arcs.
Core Idea: Optimize computation by excluding irrelevant nodes → Shortest Path Algorithms.

Key Point: When state space is large, DP becomes infeasible; more efficient path search methods are needed.

Setup

Origin: Node $s$
Destination: Node $t$
Child Node: Node $j$ is a child of node $i$ if arc $(i,j)$ exists
Assumption: All arc lengths $a_{ij} \geq 0$ (non-negative weights)

Core Concept

Maintain a label $d_i$ for each node $i$ , recording the current shortest path length from origin $s$ to $i$ .
During execution, $d_i$ dynamically decreases.
When $d_i$ updates, check if its child nodes $j$ need correction.
The destination label $d_t$ (denoted UPPER) equals the true shortest path length upon convergence.

Intuition: UPPER is the current upper bound for the shortest $s$ to $t$ path; the algorithm reduces this bound by iteratively correcting labels.

Algorithm Steps

Initialization:
- $d_s \leftarrow 0$
- $d_i \leftarrow \infty, \forall i \neq s$
- OPEN set initially contains only $s$
- UPPER $\leftarrow \infty$
Iterative Update:
- Remove a node $i$ from OPEN
- For each child node $j$ $j$ of $i$ $i$ :
  - If $d_i + a_{ij} < \min\{d_j, \text{UPPER}\}$ $d_{i} + a_{ij} < min {d_{j}, UPPER}$ :
    - $d_j \leftarrow d_i + a_{ij}$
    - Set $i$ as parent of $j$
    - If $j \neq t$ and $j \notin \text{OPEN}$ , add $j$ to OPEN
    - If $j = t$ , update $\text{UPPER} \leftarrow d_t$
Termination Condition: Stop when OPEN is empty; otherwise repeat step 2.

Key Notes:

OPEN Set: Stores nodes to be checked, ensuring only affected nodes are updated.
Non-negative Weights: Guarantees algorithm correctness (no negative-weight cycles).
Efficiency Improvement: UPPER enables pruning to avoid unnecessary computations.

Label Correcting Methods: Correctness Proof

Proposition

If a path exists from origin $s$ to destination $t$ , upon termination UPPER equals the shortest path length; otherwise UPPER = ∞ upon termination.

Proof

1. Algorithm Must Terminate

Key Observation:
- Each time node $j$ is added to OPEN, $d_j$ strictly decreases and always represents the length of some $s→j$ path.
- Let $d_j^0$ be the $d_j$ value when $j$ is first added to OPEN (finite).
Non-negative Weights Guarantee:

$\text{Finite number of paths} + \text{no negative cycles} \implies d_j \text{ decreases finitely many times}$
Conclusion: OPEN set will eventually be empty; algorithm terminates.

Core: Non-negative arc lengths ensure path lengths have a lower bound; label values $d_j$ cannot decrease indefinitely.

2. No Path Case: `UPPER = ∞`

Proof by Contradiction:
- Assume arc $(i,t)$ exists (i.e., $t$ has predecessor $i$ ).
- If $i$ ever enters OPEN → path $s→i$ exists → path $s→t$ exists (contradiction).
Corollary:
- $i$ cannot enter OPEN → $d_t$ is never updated
- From initialization $d_t = \infty$ , so $\text{UPPER} = \infty$

3. Path Exists Case: `UPPER` Equals Shortest Path Length

Let shortest path $P^* = (s, j_1, ..., j_k, t)$ with length $d^*$ .

Subpath Optimality:

$P^* \text{ any subpath } (s, j_1, ..., j_m) \text{ is also the shortest path from } s \text{ to } j_m$
Proof by Contradiction Assumption:
If upon termination $\text{UPPER} > d^*$ , then throughout execution:

$\text{UPPER} > \text{length of subpath } (s, j_1, ..., j_m), \quad \forall m=1,...,k$
Recursive Contradiction:
1. Node $j_k$ cannot enter OPEN with optimal $d_{j_k}$ (otherwise $d_{j_k} + a_{j_kt} < \text{UPPER}$ would update $\text{UPPER} = d^*$ )
2. Similarly, $j_{k-1}$ cannot enter OPEN with optimal $d_{j_{k-1}}$
3. Recurse to $j_1$ :
  - But during initialization, $j_1$ enters OPEN with $d_{j_1} = a_{s j_1}$ (optimal $s→j_1$ value)
  - Contradiction!

Conclusion: $\text{UPPER}$ must converge to $d^*$ .

Implementation Strategies for Label Correcting Methods

Common Strategy Comparison

1. Breadth-First Search (BFS) / Bellman-Ford Method

Queue Strategy: First-In-First-Out (FIFO)
Operation Rules:
- Remove node from top of OPEN
- Add new nodes to bottom of OPEN
Characteristics:
- Handles graphs with negative-weight edges (though current scenario requires non-negative weights)
- Time complexity: $O(|V| \cdot |E|)$

2. Depth-First Search (DFS)

Queue Strategy: Last-In-First-Out (LIFO)
Operation Rules:
- Remove node from top of OPEN
- Add new nodes to top of OPEN
Characteristics:
- Lower memory usage (does not store all level nodes)
- May prioritize exploring long paths first; efficiency unstable

3. Best-First Search (Dijkstra’s Algorithm)

Selection Strategy: Choose node in OPEN with smallest label

$\text{Remove node } i \quad \text{where} \quad d_i = \min_{j \in \text{OPEN}} d_j$
Characteristics:
- Key Property: Each node enters OPEN at most once
- Overhead: Requires maintaining a min-heap; each operation $O(\log n)$
- Time Complexity: $O(|E| + |V| \log |V|)$ (optimal for non-negative weights)

Advantage: Most efficient in non-negative weight graphs

4. Hybrid Optimization Strategies

(1) Small Label First (SLF)

Operation Rules:
- Remove node from top of OPEN
- Add new node $i$ $i$ based on:
  - If $d_i \leq d_{\text{top}}$ (top node label), add to top
  - Otherwise add to bottom
Goal: Approximate Dijkstra’s effect without sorting overhead

(2) Large Label Last (LLL)

Enhancement Strategy (often combined with SLF):
1. Compute average label in OPEN: $\bar{d} = \frac{1}{|\text{OPEN}|} \sum_{j} d_j$
2. If top node label $d_{\text{top}} > \bar{d}$ , move it to bottom
3. Repeat until top node label $\leq \bar{d}$
Goal: Delay processing high-label nodes to accelerate convergence

Strategy Selection Summary

Strategy	Time Complexity	Advantages	Applicable Scenarios
BFS (Bellman-Ford)	$O(	V	{\cdot}
DFS	Unstable	Low memory consumption	Depth-first path exploration
Dijkstra	$O(	E	{+}
SLF+LLL	Near Dijkstra	Low-overhead near-optimal solution	Efficient heuristic for large-scale graphs