#sdsc6007

English/ Chinese

Tutorial 1

Problem Setting

  • Dynamic System: xk+1=xk+uk+wkx_{k+1} = x_k + u_k + w_k, k=0,1,2,3k = 0,1,2,3

  • Initial State: x0=5x_0 = 5

  • Cost Function: k=03(xk2+uk2)\sum_{k=0}^{3}(x_k^2 + u_k^2)

  • State Space: Sk={0,1,2,3,4,5}S_k = \{0,1,2,3,4,5\}

  • Control Constraints: Uk(xk)={u0xk+u5,uZ}U_k(x_k) = \{u \mid 0 \leq x_k + u \leq 5, u \in \mathbb{Z}\}

  • Stochastic Disturbance:

    • If 0<xk+uk<50 < x_k + u_k < 5: wk={1probability 121probability 12w_k = \begin{cases} -1 & \text{probability } \frac{1}{2} \\ 1 & \text{probability } \frac{1}{2} \end{cases}
    • If xk+uk=0x_k + u_k = 0 or xk+uk=5x_k + u_k = 5: wk=0w_k = 0 (probability 1)

Dynamic Programming Recursion Formula

Define the value function Vk(xk)V_k(x_k) as the minimum expected cost starting from state xkx_k at time kk.

Bellman Equation:

Vk(xk)=minukUk(xk){xk2+uk2+E[Vk+1(xk+uk+wk)]}V_k(x_k) = \min_{u_k \in U_k(x_k)} \left\{ x_k^2 + u_k^2 + E[V_{k+1}(x_k + u_k + w_k)] \right\}

Boundary Condition: V4(x4)=0V_4(x_4) = 0 (end of problem)

Step 1: Time k=3 (Final Step)

V3(x3)=x32+minu3U3(x3)u32V_3(x_3) = x_3^2 + \min_{u_3 \in U_3(x_3)} u_3^2

Compute for each state:

x3x_3 U3(x3)U_3(x_3) Optimal u3u_3^* V3(x3)V_3(x_3)
0 {0,1,2,3,4,5} 0 02+02=00^2 + 0^2 = 0
1 {-1,0,1,2,3,4} 0 12+02=11^2 + 0^2 = 1
2 {-2,-1,0,1,2,3} 0 22+02=42^2 + 0^2 = 4
3 {-3,-2,-1,0,1,2} 0 32+02=93^2 + 0^2 = 9
4 {-4,-3,-2,-1,0,1} 0 42+02=164^2 + 0^2 = 16
5 {-5,-4,-3,-2,-1,0} 0 52+02=255^2 + 0^2 = 25

Result: u3(x3)=0u_3^*(x_3) = 0 for all x3x_3

Step 2: Time k=2

V2(x2)=minu2U2(x2){x22+u22+E[V3(x2+u2+w2)]}V_2(x_2) = \min_{u_2 \in U_2(x_2)} \left\{ x_2^2 + u_2^2 + E[V_3(x_2 + u_2 + w_2)] \right\}

Expected value calculation:

  • If 0<x2+u2<50 < x_2 + u_2 < 5: E[V3]=12V3(x2+u21)+12V3(x2+u2+1)E[V_3] = \frac{1}{2}V_3(x_2 + u_2 - 1) + \frac{1}{2}V_3(x_2 + u_2 + 1)

  • If x2+u2=0x_2 + u_2 = 0 or x2+u2=5x_2 + u_2 = 5: E[V3]=V3(x2+u2)E[V_3] = V_3(x_2 + u_2)

x2=0x_2 = 0:

u2u_2 x2+u2x_2+u_2 E[V3]E[V_3] Total Cost 02+u22+E[V3]0^2 + u_2^2 + E[V_3]
0 0 V3(0)=0V_3(0) = 0 0+0+0=00 + 0 + 0 = 0
1 1 12(0)+12(4)=2\frac{1}{2}(0) + \frac{1}{2}(4) = 2 0+1+2=30 + 1 + 2 = 3
2 2 12(1)+12(9)=5\frac{1}{2}(1) + \frac{1}{2}(9) = 5 0+4+5=90 + 4 + 5 = 9
3 3 12(4)+12(16)=10\frac{1}{2}(4) + \frac{1}{2}(16) = 10 0+9+10=190 + 9 + 10 = 19
4 4 12(9)+12(25)=17\frac{1}{2}(9) + \frac{1}{2}(25) = 17 0+16+17=330 + 16 + 17 = 33
5 5 V3(5)=25V_3(5) = 25 0+25+25=500 + 25 + 25 = 50

V2(0)=0V_2(0) = 0, u2(0)=0u_2^*(0) = 0

x2=1x_2 = 1:

u2u_2 x2+u2x_2+u_2 E[V3]E[V_3] Total Cost 12+u22+E[V3]1^2 + u_2^2 + E[V_3]
-1 0 V3(0)=0V_3(0) = 0 1+1+0=21 + 1 + 0 = 2
0 1 12(0)+12(4)=2\frac{1}{2}(0) + \frac{1}{2}(4) = 2 1+0+2=31 + 0 + 2 = 3
1 2 12(1)+12(9)=5\frac{1}{2}(1) + \frac{1}{2}(9) = 5 1+1+5=71 + 1 + 5 = 7
2 3 12(4)+12(16)=10\frac{1}{2}(4) + \frac{1}{2}(16) = 10 1+4+10=151 + 4 + 10 = 15
3 4 12(9)+12(25)=17\frac{1}{2}(9) + \frac{1}{2}(25) = 17 1+9+17=271 + 9 + 17 = 27
4 5 V3(5)=25V_3(5) = 25 1+16+25=421 + 16 + 25 = 42

V2(1)=2V_2(1) = 2, u2(1)=1u_2^*(1) = -1

Continue computing other states:

x2=2x_2 = 2: V2(2)=7V_2(2) = 7, u2(2)=1u_2^*(2) = -1

x2=3x_2 = 3: V2(3)=15V_2(3) = 15, u2(3)=2u_2^*(3) = -2 (or 1-1)

x2=4x_2 = 4: V2(4)=25V_2(4) = 25, u2(4)=2u_2^*(4) = -2

x2=5x_2 = 5: V2(5)=39V_2(5) = 39, u2(5)=3u_2^*(5) = -3 (or 2-2, both yield same cost)

Step 3: Time k=1

Use computed V2V_2 to calculate V1(x1)V_1(x_1).

x1=0x_1 = 0:

u1u_1 x1+u1x_1+u_1 E[V2]E[V_2] Total Cost 02+u12+E[V2]0^2 + u_1^2 + E[V_2]
0 0 V2(0)=0V_2(0) = 0 0+0+0=00 + 0 + 0 = 0
1 1 12V2(0)+12V2(2)=12(0)+12(7)=3.5\frac{1}{2}V_2(0) + \frac{1}{2}V_2(2) = \frac{1}{2}(0) + \frac{1}{2}(7) = 3.5 0+1+3.5=4.50 + 1 + 3.5 = 4.5

V1(0)=0V_1(0) = 0, u1(0)=0u_1^*(0) = 0

x1=1x_1 = 1:

u1u_1 x1+u1x_1+u_1 E[V2]E[V_2] Total Cost
-1 0 V2(0)=0V_2(0) = 0 1+1+0=21 + 1 + 0 = 2
0 1 12(0)+12(7)=3.5\frac{1}{2}(0) + \frac{1}{2}(7) = 3.5 1+0+3.5=4.51 + 0 + 3.5 = 4.5

V1(1)=2V_1(1) = 2, u1(1)=1u_1^*(1) = -1

x1=2x_1 = 2:

u1u_1 x1+u1x_1+u_1 E[V2]E[V_2] Total Cost
-2 0 V2(0)=0V_2(0) = 0 4+4+0=84 + 4 + 0 = 8
-1 1 12(0)+12(7)=3.5\frac{1}{2}(0) + \frac{1}{2}(7) = 3.5 4+1+3.5=8.54 + 1 + 3.5 = 8.5

V1(2)=8V_1(2) = 8, u1(2)=2u_1^*(2) = -2

x1=3x_1 = 3:

u1u_1 x1+u1x_1+u_1 E[V2]E[V_2] Total Cost
-3 0 V2(0)=0V_2(0) = 0 9+9+0=189 + 9 + 0 = 18
-2 1 12(0)+12(7)=3.5\frac{1}{2}(0) + \frac{1}{2}(7) = 3.5 9+4+3.5=16.59 + 4 + 3.5 = 16.5
-1 2 12(2)+12(15)=8.5\frac{1}{2}(2) + \frac{1}{2}(15) = 8.5 9+1+8.5=18.59 + 1 + 8.5 = 18.5

V1(3)=16.5V_1(3) = 16.5, u1(3)=2u_1^*(3) = -2

x1=4x_1 = 4:

After computation: V1(4)=28.5V_1(4) = 28.5, u1(4)=3u_1^*(4) = -3

x1=5x_1 = 5:

After computation: V1(5)=42.5V_1(5) = 42.5, u1(5)=3u_1^*(5) = -3

Complete Value Function for Time 1:

x1x_1 V1(x1)V_1(x_1) u1(x1)u_1^*(x_1)
0 0 0
1 2 -1
2 8 -2
3 16.5 -2
4 28.5 -3
5 42.5 -3

Step 4: Time k=0

For initial state x0=5x_0 = 5, compute V0(5)V_0(5):

V0(5)=minu0U0(5){25+u02+E[V1(5+u0+w0)]}V_0(5) = \min_{u_0 \in U_0(5)} \left\{ 25 + u_0^2 + E[V_1(5 + u_0 + w_0)] \right\}

U0(5)={5,4,3,2,1,0}U_0(5) = \{-5,-4,-3,-2,-1,0\}

u0u_0 x0+u0x_0+u_0 E[V1]E[V_1] Total Cost
-5 0 V1(0)=0V_1(0) = 0 25+25+0=5025 + 25 + 0 = 50
-4 1 12V1(0)+12V1(2)=12(0)+12(8)=4\frac{1}{2}V_1(0) + \frac{1}{2}V_1(2) = \frac{1}{2}(0) + \frac{1}{2}(8) = 4 25+16+4=4525 + 16 + 4 = 45
-3 2 12V1(1)+12V1(3)=12(2)+12(16.5)=9.25\frac{1}{2}V_1(1) + \frac{1}{2}V_1(3) = \frac{1}{2}(2) + \frac{1}{2}(16.5) = 9.25 25+9+9.25=43.2525 + 9 + 9.25 = 43.25
-2 3 12V1(2)+12V1(4)=12(8)+12(28.5)=18.25\frac{1}{2}V_1(2) + \frac{1}{2}V_1(4) = \frac{1}{2}(8) + \frac{1}{2}(28.5) = 18.25 25+4+18.25=47.2525 + 4 + 18.25 = 47.25
-1 4 12V1(3)+12V1(5)=12(16.5)+12(42.5)=29.5\frac{1}{2}V_1(3) + \frac{1}{2}V_1(5) = \frac{1}{2}(16.5) + \frac{1}{2}(42.5) = 29.5 25+1+29.5=55.525 + 1 + 29.5 = 55.5
0 5 V1(5)=42.5V_1(5) = 42.5 25+0+42.5=67.525 + 0 + 42.5 = 67.5

Result: V0(5)=43.25V_0(5) = 43.25, u0(5)=3u_0^*(5) = -3

Optimal Policy Summary

Optimal policy starting from initial state x0=5x_0 = 5:

Time State Optimal Control Explanation
k=0k=0 x0=5x_0 = 5 u0=3u_0^* = -3 Guides state toward 2
k=1k=1 Depends on w0w_0 Look up u1u_1^* From value function table
k=2k=2 Depends on w0,w1w_0,w_1 Look up u2u_2^* From value function table
k=3k=3 Path-dependent u3=0u_3^* = 0 Always choose 0 regardless of state

Minimum Expected Total Cost: 43.2543.25

Policy Characteristics:

  • Tends to guide state toward smaller values (since state cost xk2x_k^2 is minimized at 0)

  • Requires balancing control cost and future expected cost

  • Stochastic disturbances necessitate consideration of all possible future states


Hidden Markov Model (HMM)

Model Description

  • A stationary finite-state Markov chain with transition probabilities pijp_{ij} between states.

  • States are hidden (not directly observable). Initial state x0x_0 has probability πi\pi_i of being ii.

  • After each state transition, a value zz is observed. Given a transition from state ii to state jj, the probability of observing zz is r(z;i,j)r(z; i, j).

  • Observations are independent.

  • Transition probabilities pijp_{ij} and observation probabilities r(z;i,j)r(z; i, j) are assumed to be independent.

Symbol Explanation and Model Understanding
  • pijp_{ij}: Probability of transitioning from state ii to state jj (e.g., probability of weather changing from “sunny” to “rainy”)

  • πi\pi_i: Probability that the system starts in state ii (e.g., probability that the first day is “sunny”)

  • r(z;i,j)r(z; i, j): Probability of observing zz during transition iji→j (e.g., probability of observing “high humidity” during “sunny to rainy” transition)

  • Independence Assumption: Observation zkz_k depends only on the current state transition (xk1,xk)(x_{k-1},x_k), independent of historical states and observations.

Objective

Given an observation sequence ZN={z1,,zN}Z_N = \{z_1, \dots, z_N\}, estimate the most probable state sequence X^N={x^0,,x^N}\hat{X}_N = \{\hat{x}_0, \dots, \hat{x}_N\}.

Method

To estimate X^N\hat{X}_N, maximize the conditional probability P(XNZN)P(X_N \mid Z_N). Since:

P(XNZN)=P(XN,ZN)P(ZN)P(X_N \mid Z_N) = \frac{P(X_N, Z_N)}{P(Z_N)}

Maximizing P(XNZN)P(X_N \mid Z_N) is equivalent to maximizing the joint probability P(XN,ZN)P(X_N, Z_N), because P(ZN)P(Z_N) is constant for a given observation sequence.

Joint Probability Derivation

The joint probability P(XN,ZN)P(X_N, Z_N) for state sequence XN={x0,x1,,xN}X_N = \{x_0, x_1, \dots, x_N\} and observation sequence ZN={z1,,zN}Z_N = \{z_1, \dots, z_N\} is derived as follows:

P(XN,ZN)=P(x0,x1,,xN,z1,,zN)P(X_N, Z_N) = P(x_0, x_1, \dots, x_N, z_1, \dots, z_N)

Using the chain rule and independence assumptions:

=πx0P(x1,,xN,z1,,zNx0)=πx0px0x1r(z1;x0,x1)P(x2,,xN,z2,,zNx0,x1,z1)\begin{align*} = & \pi_{x_0} P(x_1, \dots, x_N, z_1, \dots, z_N \mid x_0) \\ = & \pi_{x_0} p_{x_0 x_1} r(z_1; x_0, x_1) P(x_2, \dots, x_N, z_2, \dots, z_N \mid x_0, x_1, z_1) \end{align*}

Based on Markov property and observation independence, recursively simplified to:

P(XN,ZN)=πx0k=1Npxk1xkr(zk;xk1,xk)P(X_N, Z_N) = \pi_{x_0} \prod_{k=1}^N p_{x_{k-1} x_k} r(z_k; x_{k-1}, x_k)

Detailed Derivation Process
  1. Chain Rule Decomposition:

    P(XN,ZN)=P(x0)k=1NP(xk,zkx0:k1,z1:k1)P(X_N, Z_N) = P(x_0) \prod_{k=1}^N P(x_k, z_k | x_{0:k-1}, z_{1:k-1})

  2. Applying Markov Property:
    Future state depends only on current state:

    P(xkx0:k1)=P(xkxk1)P(x_k | x_{0:k-1}) = P(x_k | x_{k-1})

  3. Observation Independence:
    Observation zkz_k determined only by current transition:

    P(zkx0:k,z1:k1)=r(zk;xk1,xk)P(z_k | x_{0:k}, z_{1:k-1}) = r(z_k; x_{k-1}, x_k)

  4. Joint Probability:

    P(xk,zkx0:k1,z1:k1)=pxk1xkr(zk;xk1,xk)P(x_k, z_k | x_{0:k-1}, z_{1:k-1}) = p_{x_{k-1}x_k} \cdot r(z_k; x_{k-1}, x_k)

  5. Final Form:
    Recursive substitution yields πx0k=1Npxk1xkr(zk;xk1,xk)\pi_{x_0} \prod_{k=1}^N p_{x_{k-1}x_k} r(z_k; x_{k-1}, x_k)

Explanation:

  • πx0\pi_{x_0}: Probability of starting state x0x_0.
  • pxk1xkp_{x_{k-1} x_k}: Probability of transitioning from state xk1x_{k-1} to xkx_k at step kk.
  • r(zk;xk1,xk)r(z_k; x_{k-1}, x_k): Probability of observing zkz_k given transition from xk1x_{k-1} to xkx_k.
  • The product k=1N\prod_{k=1}^N captures the sequence of state transitions and observations, leveraging independence to decompose the joint probability into a product.

Equivalent Optimization Problem

Maximizing P(XN,ZN)P(X_N, Z_N) is equivalent to minimizing the negative log-likelihood:

min(log(πx0)k=1Nlog(pxk1xkr(zk;xk1,xk)))\min\left( -\log(\pi_{x_0}) - \sum_{k=1}^N \log\left( p_{x_{k-1} x_k} r(z_k; x_{k-1}, x_k) \right) \right)

Principle of Negative Logarithm Conversion
  1. Monotonicity Principle:

    • log\log function is monotonically increasing, so maximizing f(x)f(x) is equivalent to maximizing logf(x)\log f(x)
    • Minimizing logf(x)-\log f(x) is equivalent to maximizing f(x)f(x)
  2. Product to Sum Conversion:

    argmaxpi=argmin(logpi)\arg\max \prod p_i = \arg\min \left( -\sum \log p_i \right)

  3. Numerical Stability:

    • Probability products can cause floating-point underflow
    • Logarithmic summation avoids multiplication of small values
  4. Optimization Advantage:

    • Sum form is suitable for dynamic programming
    • Can be viewed as cumulative “path cost”

Note: This conversion transforms the probability product into a sum of logarithmic terms, facilitating optimization (e.g., using dynamic programming algorithms like the Viterbi algorithm). Minimizing this sum finds the state sequence that best explains the observations.

The problem of finding the most probable state sequence (e.g., in hidden Markov models) can be transformed into a shortest path problem with positive weights using the following graph structure:

  1. Add Virtual Nodes:

    • Add source node s and sink node t.
  2. Initial State Arcs:

    • Create arcs from ss to each possible initial state x0x_0.

    • Assign length:

      (s,x0)=log(πx0)\ell(s, x_0) = -\log(\pi_{x_0})

      Where πx0\pi_{x_0} is the initial probability of state x0x_0. The negative logarithm converts maximizing initial probability to minimizing a positive cost.

  3. State Transition Arcs:

    • For each time step kk (k=1,...,N)(k = 1, ..., N) and each valid state transition (xk1,xk)(x_{k - 1}, x_k):
      • Create an arc from state xk1x_{k - 1} to state xkx_k.
      • Assign length:

        (xk1,xk)=log(pxk1xkr(zk;xk1,xk))\ell(x_{k-1}, x_k) = -\log \left( p_{x_{k-1} x_k} \cdot r(z_k; x_{k-1}, x_k) \right)

        Where pxk1xkp_{x_{k-1} x_k} is the transition probability from state xk1x_{k - 1} to xkx_k, and r(zk;xk1,xk)r(z_k; x_{k - 1}, x_k) is the probability (likelihood) of observing zkz_k given the transition xk1x_{k - 1} to xkx_k. The negative logarithm converts maximizing the product of transition and emission probabilities to minimizing a positive cost.

    • Note: If transition probability pxk1xk=0p_{x_{k - 1} x_k} = 0, do not create the corresponding arc.
  4. Termination State Arcs:

    • Create arcs from each possible terminal state xNx_N to sink node tt.
    • Assign length:

      (xN,t)=0\ell(x_N, t) = 0

      This setting makes the cost of ending the path at any valid terminal state xNx_N zero.

Solution: In this constructed graph, the shortest path from source node s to sink node t corresponds to the most probable state sequence. All arc lengths are non-negative (termination arcs are zero, others are positive).

Proof of Graph Structure Equivalence
  1. Path Cost Calculation:

    • Total path cost = (s,x0)+k=1N(xk1,xk)+(xN,t)\ell(s,x_0) + \sum_{k=1}^N \ell(x_{k-1},x_k) + \ell(x_N,t)

    =logπx0k=1Nlog(pxk1xkrk)+0= -\log\pi_{x_0} - \sum_{k=1}^N \log(p_{x_{k-1}x_k}r_k) + 0

    =log(πx0k=1Npxk1xkrk)= -\log\left( \pi_{x_0} \prod_{k=1}^N p_{x_{k-1}x_k}r_k \right)

  2. Equivalence Relation:

    min(Path Cost)    min(logP(XN,ZN))    max(P(XN,ZN))\min\left(\text{Path Cost}\right) \iff \min\left(-\log P(X_N,Z_N)\right) \iff \max\left(P(X_N,Z_N)\right)

  3. Non-negative Weights Guarantee:

    • Since 0<pij,r10 < p_{ij}, r \leq 1, we have log()0-\log(\cdot) \geq 0
    • Satisfies requirements for Dijkstra and other shortest path algorithms
  4. Dynamic Programming Advantage:

    • Viterbi algorithm complexity O(NS2)O(N|S|^2)
    • Avoids exhaustive search over SN|S|^N possible paths

Forward DP Algorithm (Viterbi Algorithm)

Core Formulas

1. Initial State Probability

D0(x0)=log(πx0)D_0(x_0) = -\log(\pi_{x_0})

  • Meaning: Negative log initial probability of state x0x_0 at time step t=0t=0.

  • πx0\pi_{x_0}: Probability of initial state x0x_0.

  • Negative Log Conversion: Converts probability maximization to cost minimization.

2. Recurrence Relation

Dk+1(xk+1)=minxk[Dk(xk)logP(xk+1xk)logP(zkxk)]D_{k+1}(x_{k+1}) = \min_{x_k} \left[ D_k(x_k) - \log P(x_{k+1} \mid x_k) - \log P(z_k \mid x_k) \right]

  • Meaning: Computes the minimum cumulative cost to reach state xk+1x_{k+1} at time step t=k+1t=k+1.

  • Dk(xk)D_k(x_k): Minimum cumulative cost to previous state xkx_k.

  • P(xk+1xk)P(x_{k+1} \mid x_k): State transition probability (from xkx_k to xk+1x_{k+1}).

  • P(zkxk)P(z_k \mid x_k): Observation probability (observing zkz_k in state xkx_k).

  • Minimization Operation: Evaluates all possible previous states xkx_k to select the optimal path.

3. Backtracking Optimal Path

xk=arg minxk[Dk(xk)logP(xk+1xk)logP(zkxk)]x_k^* = \argmin_{x_k} \left[ D_k(x_k) - \log P(x_{k+1}^* \mid x_k) - \log P(z_k \mid x_k) \right]

  • Meaning: Backtracks from the final state to determine the globally optimal state sequence {x0,x1,,xT}\{x_0^*, x_1^*, \dots, x_T^*\}.

  • xk+1x_{k+1}^*: Already determined next optimal state.

Additional Notes

  • Viterbi Algorithm Essence: Dynamic programming solution for finding the most probable state sequence (maximum a posteriori path) in hidden Markov models (HMMs).
  • Purpose of Negative Log Conversion: Converts the maximization of probability products into minimization of log sums, avoiding floating-point underflow.
  • Complexity: O(TN2)O(T \cdot N^2) (TT is number of time steps, NN is number of states).

Hidden Markov Model Example: Convolutional Encoding and Decoding

Background Story

Dr. Yang stored his grandmother’s recipe as a binary sequence (e.g., {0,1,0,0,1,}\{0, 1, 0, 0, 1, \ldots\}) and used convolutional encoding to protect the data from theft (e.g., by 6007🧑‍🏫). The data is transmitted through a noisy channel to Dr. Zeng, where the received sequence {zk}\{z_k\} differs from the transmitted sequence {yk}\{y_k\}. The goal is to decode {zk}\{z_k\} to recover the original sequence {wk}\{w_k\}.

Convolutional Encoding Process

  • Input: Source binary sequence {w1,w2,}\{w_1, w_2, \ldots\}, where wk{0,1}w_k \in \{0, 1\}.

  • Output: Encoded sequence {y1,y2,}\{y_1, y_2, \ldots\}, each yky_k is an nn-dimensional binary vector (codeword).

  • State Equations (using modulo 2 arithmetic):

    yk=Cxk1+dwk,k=1,2,y_k = C x_{k-1} + d w_k, \quad k = 1, 2, \ldots

    xk=Axk1+bwk,k=1,2,x_k = A x_{k-1} + b w_k, \quad k = 1, 2, \ldots

    Where:

    • xkx_k is the hidden state vector (dimension m×1m \times 1), initial state x0x_0 given (e.g., x0=[00]x_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}).
    • AA is the state transition matrix (m×mm \times m), CC is the output matrix (n×mn \times m), bb and dd are weight vectors (dimensions m×1m \times 1 and n×1n \times 1 respectively).
    • Parameter Settings:

      C=[100101],A=[0111],d=[111],b=[01],x0=[00]C = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix}, \quad A = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix}, \quad d = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \quad b = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \quad x_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}

      Explanation: yky_k is a 3-dimensional vector (n=3n=3), xkx_k is a 2-dimensional vector (m=2m=2). Modulo 2 arithmetic ensures all values are binary (0 or 1). Dimensions of dd and bb must match output and state dimensions; here dd is 3x1, bb is 2x1.

Example Calculation

  • Input Sequence: {w1,w2,w3,w4}={1,0,0,1}\{w_1, w_2, w_3, w_4\} = \{1, 0, 0, 1\}.

  • State and Output Sequence Calculation:

    • k=1k=1: x1=Ax0+bw1=[0111][00]+[01]1=[01]x_1 = A x_0 + b w_1 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot 1 = \begin{bmatrix} 0 \\ 1 \end{bmatrix}, y1=Cx0+dw1=[100101][00]+[111]1=[111]y_1 = C x_0 + d w_1 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \cdot 1 = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}.
    • k=2k=2: x2=Ax1+bw2=[0111][01]+[01]0=[11]x_2 = A x_1 + b w_2 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot 0 = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, y2=Cx1+dw2=[100101][01]+[111]0=[011]y_2 = C x_1 + d w_2 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 \\ 1 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \cdot 0 = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}.
    • k=3k=3: x3=Ax2+bw3=[0111][11]+[01]0=[10]x_3 = A x_2 + b w_3 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot 0 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, y3=Cx2+dw3=[100101][11]+[111]0=[111]y_3 = C x_2 + d w_3 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \cdot 0 = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}.
    • k=4k=4: x4=Ax3+bw4=[0111][10]+[01]1=[01]+[01]=[00]x_4 = A x_3 + b w_4 = \begin{bmatrix} 0 & 1 \\ 1 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} \cdot 1 = \begin{bmatrix} 0 \\ 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} (mod 2), y4=Cx3+dw4=[100101][10]+[111]1=[100]+[111]=[011]y_4 = C x_3 + d w_4 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \cdot 1 = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} + \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} (mod 2).
  • Results:

    • State Sequence: {x0,x1,x2,x3,x4}={[00],[01],[11],[10],[00]}\{x_0, x_1, x_2, x_3, x_4\} = \left\{ \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \end{bmatrix} \right\} (or {00,01,11,10,00}\{00, 01, 11, 10, 00\}).
    • Output Sequence: {y1,y2,y3,y4}={[111],[011],[111],[011]}\{y_1, y_2, y_3, y_4\} = \left\{ \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} \right\} (or {111,011,111,011}\{111, 011, 111, 011\}).

Noisy Channel and Probability Model

  • Transmission Process: Encoded sequence {yk}\{y_k\} is transmitted through a noisy channel; received sequence is {zk}\{z_k\} (zkz_k is also a binary vector).

  • Error Model: Given conditional probability p(zkyk)p(z_k | y_k), representing the probability of receiving zkz_k given transmission of yky_k.

  • Independence Assumption: Errors are independent; joint probability is:

    P(ZNYN)=k=1Np(zkyk)P(Z_N | Y_N) = \prod_{k=1}^N p(z_k | y_k)

    Explanation: ZN={z1,,zN}Z_N = \{z_1, \ldots, z_N\}, YN={y1,,yN}Y_N = \{y_1, \ldots, y_N\}. This model assumes errors occur independently for each codeword.

Symbol Explanation, Principle Derivation, and Practical Significance of Error Model
  • Symbol Explanation:

    • zkz_k: Received codeword (same dimension as yky_k), possibly erroneous due to noise (e.g., yk=111y_k = 111 but zk=101z_k = 101).
    • p(zkyk)p(z_k | y_k): Conditional probability density function, representing the probability of receiving zkz_k given transmission of yky_k (e.g., if channel has error rate ϵ\epsilon, p(zkyk)p(z_k | y_k) might be defined as ϵ\epsilon when zkykz_k \neq y_k, 1ϵ1-\epsilon when zk=ykz_k = y_k). Practical significance: Quantifies the likelihood of transmission errors (e.g., bit-flip probability).
  • Principle Derivation of Joint Probability Independence:

    • Derivation: Assumes channel errors occur independently at each time step kk (i.e., noise does not affect adjacent transmissions). Thus, joint probability P(ZNYN)P(Z_N | Y_N) is the product of marginal probabilities: P(ZNYN)=P(z1y1)P(z2y2)P(zNyN)=k=1Np(zkyk)P(Z_N | Y_N) = P(z_1 | y_1) P(z_2 | y_2) \cdots P(z_N | y_N) = \prod_{k=1}^N p(z_k | y_k). This follows from the definition of independence in probability theory (if events are independent, joint probability is the product).
    • Practical significance: Simplifies model computation (e.g., decoding only needs to handle errors per codeword), applicable to many real channels (e.g., AWGN channel under low-correlation noise).
    • Example: If N=2N=2, p(z1y1)=0.9p(z_1 | y_1) = 0.9 (correct reception), p(z2y2)=0.1p(z_2 | y_2) = 0.1 (error), then P(Z2Y2)=0.9×0.1=0.09P(Z_2 | Y_2) = 0.9 \times 0.1 = 0.09. Analogy: Like multiple independent coin tosses, where each outcome does not affect others.

Decoding Problem Formulation

  • Objective: Minimize negative log-likelihood to recover the most probable sequence {yk}\{y_k\}:

    mink=1Nlogp(zkyk)\min \sum_{k=1}^N -\log p(z_k | y_k)

    where optimization variables are all possible {y1,,yN}\{y_1, \ldots, y_N\}.

  • Dependence on Hidden State: Since yk=Cxk1+dwky_k = C x_{k-1} + d w_k, decoding requires finding the state sequence {x0,,xN}\{x_0, \ldots, x_N\}.

    Explanation: This is equivalent to the decoding problem in hidden Markov models (HMMs), with states xkx_k and observations zkz_k. Dynamic programming methods (e.g., Viterbi algorithm) can efficiently solve it by optimizing over state transitions.

Principle Derivation of Decoding Objective and Equivalence to HMM
  • Principle Derivation of Minimizing Negative Log-Likelihood:

    • Derivation: The goal is to maximize the likelihood probability P(ZNYN)P(Z_N | Y_N) of the observation sequence ZNZ_N. Since the likelihood is a product (P(ZNYN)=k=1Np(zkyk)P(Z_N | Y_N) = \prod_{k=1}^N p(z_k | y_k)), maximizing likelihood is equivalent to maximizing log-likelihood (due to monotonicity of log function):

      maxlogP(ZNYN)=maxk=1Nlogp(zkyk)\max \log P(Z_N | Y_N) = \max \sum_{k=1}^N \log p(z_k | y_k)

      Equivalent to minimizing negative log-likelihood: mink=1Nlogp(zkyk)\min \sum_{k=1}^N -\log p(z_k | y_k). Practical significance: Converts probability maximization into a numerically stable optimization problem (negative log-likelihood is commonly used as a loss function).
    • Symbol Explanation: logp(zkyk)-\log p(z_k | y_k) can be viewed as an “error cost,” where lower values indicate better match between yky_k and zkz_k (e.g., if p(zkyk)=0.9p(z_k | y_k) = 0.9, logp0.105-\log p \approx 0.105; if p=0.1p = 0.1, logp2.3-\log p \approx 2.3). Minimizing the sum finds the best-matching sequence.
  • Derivation of Equivalence to Hidden Markov Model (HMM):

    • Derivation: In HMMs, the decoding problem involves finding the most probable state sequence given observations. Here:
      • Hidden states: xkx_k (defining state transitions xk=Axk1+bwkx_k = A x_{k-1} + b w_k).
      • Observations: zkz_k (given states or outputs, but yky_k depends on xk1x_{k-1} and wkw_k).
        Since yky_k is a function of xk1x_{k-1} and wkw_k, and wkw_k is implicit in state transitions, the observation probability p(zkyk)p(z_k | y_k) maps to the emission probability p(zkxk)p(z_k | x_k) in HMM (via yky_k linkage). The optimization objective minlogp(zkyk)\min \sum -\log p(z_k | y_k) corresponds to the path cost in HMM’s Viterbi algorithm. Practical significance: Enables efficient algorithms (e.g., Viterbi) for long sequences.
    • Analogy: Like finding the shortest path in a maze (state sequence), where each step’s “cost” is determined by observation match.

Convolutional Code State Transition Diagram (Mermaid)

Caption:

  • Nodes represent states (e.g., “00” represents xk=[00]x_k = \begin{bmatrix}0\\0\end{bmatrix})

  • Arrow labels format: input/output (e.g., “w=1/y=111”)

  • State transition relationships based on example parameters

  • Dashed box indicates possible initial state

Viterbi Algorithm Decoding Process (Mermaid)

Caption:

  • Each node represents the state and cumulative cost at time step k

  • δ(z_k, y_k) = -log p(z_k|y_k) represents the step cost

  • Thick border box indicates the current optimal path (needs to be calculated based on actual z_k)

  • Algorithm principle: retain the minimum cost path to each state and expand step by step


Shortest Path Algorithms

Label Correcting Methods

Motivation

  • Dynamic Programming (DP) algorithms require computing the cost function JkJ_k for each state xkx_k at each time kk, which is computationally expensive for all nodes and arcs.

  • Core Idea: Optimize computation by excluding irrelevant nodes → Shortest Path Algorithms.

Key Point: When state space is large, DP becomes infeasible; more efficient path search methods are needed.


Setup

  • Origin: Node ss

  • Destination: Node tt

  • Child Node: Node jj is a child of node ii if arc (i,j)(i,j) exists

  • Assumption: All arc lengths aij0a_{ij} \geq 0 (non-negative weights)


Core Concept

  • Maintain a label did_i for each node ii, recording the current shortest path length from origin ss to ii.

  • During execution, did_i dynamically decreases.

  • When did_i updates, check if its child nodes jj need correction.

  • The destination label dtd_t (denoted UPPER) equals the true shortest path length upon convergence.

Intuition: UPPER is the current upper bound for the shortest ss to tt path; the algorithm reduces this bound by iteratively correcting labels.


Algorithm Steps

  1. Initialization:

    • ds0d_s \leftarrow 0
    • di,isd_i \leftarrow \infty, \forall i \neq s
    • OPEN set initially contains only ss
    • UPPER \leftarrow \infty
  2. Iterative Update:

    • Remove a node ii from OPEN
    • For each child node jj of ii:
      • If di+aij<min{dj,UPPER}d_i + a_{ij} < \min\{d_j, \text{UPPER}\}:
        • djdi+aijd_j \leftarrow d_i + a_{ij}
        • Set ii as parent of jj
        • If jtj \neq t and jOPENj \notin \text{OPEN}, add jj to OPEN
        • If j=tj = t, update UPPERdt\text{UPPER} \leftarrow d_t
  3. Termination Condition: Stop when OPEN is empty; otherwise repeat step 2.


Key Notes:

  • OPEN Set: Stores nodes to be checked, ensuring only affected nodes are updated.

  • Non-negative Weights: Guarantees algorithm correctness (no negative-weight cycles).

  • Efficiency Improvement: UPPER enables pruning to avoid unnecessary computations.


Label Correcting Methods: Correctness Proof

Proposition

If a path exists from origin ss to destination tt, upon termination UPPER equals the shortest path length; otherwise UPPER = ∞ upon termination.


Proof

1. Algorithm Must Terminate

  • Key Observation:

    • Each time node jj is added to OPEN, djd_j strictly decreases and always represents the length of some sjs→j path.
    • Let dj0d_j^0 be the djd_j value when jj is first added to OPEN (finite).
  • Non-negative Weights Guarantee:

    Finite number of paths+no negative cycles    dj decreases finitely many times\text{Finite number of paths} + \text{no negative cycles} \implies d_j \text{ decreases finitely many times}

  • Conclusion: OPEN set will eventually be empty; algorithm terminates.

Core: Non-negative arc lengths ensure path lengths have a lower bound; label values djd_j cannot decrease indefinitely.


2. No Path Case: UPPER = ∞

  • Proof by Contradiction:

    • Assume arc (i,t)(i,t) exists (i.e., tt has predecessor ii).
    • If ii ever enters OPEN → path sis→i exists → path sts→t exists (contradiction).
  • Corollary:

    • ii cannot enter OPENdtd_t is never updated
    • From initialization dt=d_t = \infty, so UPPER=\text{UPPER} = \infty

3. Path Exists Case: UPPER Equals Shortest Path Length

Let shortest path P=(s,j1,...,jk,t)P^* = (s, j_1, ..., j_k, t) with length dd^*.

  • Subpath Optimality:

    P any subpath (s,j1,...,jm) is also the shortest path from s to jmP^* \text{ any subpath } (s, j_1, ..., j_m) \text{ is also the shortest path from } s \text{ to } j_m

  • Proof by Contradiction Assumption:
    If upon termination UPPER>d\text{UPPER} > d^*, then throughout execution:

    UPPER>length of subpath (s,j1,...,jm),m=1,...,k\text{UPPER} > \text{length of subpath } (s, j_1, ..., j_m), \quad \forall m=1,...,k

  • Recursive Contradiction:

    1. Node jkj_k cannot enter OPEN with optimal djkd_{j_k} (otherwise djk+ajkt<UPPERd_{j_k} + a_{j_kt} < \text{UPPER} would update UPPER=d\text{UPPER} = d^*)
    2. Similarly, jk1j_{k-1} cannot enter OPEN with optimal djk1d_{j_{k-1}}
    3. Recurse to j1j_1:
      • But during initialization, j1j_1 enters OPEN with dj1=asj1d_{j_1} = a_{s j_1} (optimal sj1s→j_1 value)
      • Contradiction!

Conclusion: UPPER\text{UPPER} must converge to dd^*.

Implementation Strategies for Label Correcting Methods

Common Strategy Comparison

1. Breadth-First Search (BFS) / Bellman-Ford Method

  • Queue Strategy: First-In-First-Out (FIFO)

  • Operation Rules:

    • Remove node from top of OPEN
    • Add new nodes to bottom of OPEN
  • Characteristics:

    • Handles graphs with negative-weight edges (though current scenario requires non-negative weights)
    • Time complexity: O(VE)O(|V| \cdot |E|)

2. Depth-First Search (DFS)

  • Queue Strategy: Last-In-First-Out (LIFO)

  • Operation Rules:

    • Remove node from top of OPEN
    • Add new nodes to top of OPEN
  • Characteristics:

    • Lower memory usage (does not store all level nodes)
    • May prioritize exploring long paths first; efficiency unstable

3. Best-First Search (Dijkstra’s Algorithm)

  • Selection Strategy: Choose node in OPEN with smallest label

    Remove node iwheredi=minjOPENdj\text{Remove node } i \quad \text{where} \quad d_i = \min_{j \in \text{OPEN}} d_j

  • Characteristics:

    • Key Property: Each node enters OPEN at most once
    • Overhead: Requires maintaining a min-heap; each operation O(logn)O(\log n)
    • Time Complexity: O(E+VlogV)O(|E| + |V| \log |V|) (optimal for non-negative weights)

Advantage: Most efficient in non-negative weight graphs


4. Hybrid Optimization Strategies

(1) Small Label First (SLF)
  • Operation Rules:

    • Remove node from top of OPEN
    • Add new node ii based on:
      • If didtopd_i \leq d_{\text{top}} (top node label), add to top
      • Otherwise add to bottom
  • Goal: Approximate Dijkstra’s effect without sorting overhead

(2) Large Label Last (LLL)
  • Enhancement Strategy (often combined with SLF):

    1. Compute average label in OPEN: dˉ=1OPENjdj\bar{d} = \frac{1}{|\text{OPEN}|} \sum_{j} d_j
    2. If top node label dtop>dˉd_{\text{top}} > \bar{d}, move it to bottom
    3. Repeat until top node label dˉ\leq \bar{d}
  • Goal: Delay processing high-label nodes to accelerate convergence


Strategy Selection Summary

Strategy Time Complexity Advantages Applicable Scenarios
BFS (Bellman-Ford) $O( V {\cdot}
DFS Unstable Low memory consumption Depth-first path exploration
Dijkstra $O( E {+}
SLF+LLL Near Dijkstra Low-overhead near-optimal solution Efficient heuristic for large-scale graphs