SDSC6007 Course 3-Tutorial and HMM
#sdsc6007
English/ Chinese
Tutorial 1
Problem Setting
-
Dynamic System: ,
-
Initial State:
-
Cost Function:
-
State Space:
-
Control Constraints:
-
Stochastic Disturbance:
- If :
- If or : (probability 1)
Dynamic Programming Recursion Formula
Define the value function as the minimum expected cost starting from state at time .
Bellman Equation:
Boundary Condition: (end of problem)
Step 1: Time k=3 (Final Step)
Compute for each state:
| Optimal | |||
|---|---|---|---|
| 0 | {0,1,2,3,4,5} | 0 | |
| 1 | {-1,0,1,2,3,4} | 0 | |
| 2 | {-2,-1,0,1,2,3} | 0 | |
| 3 | {-3,-2,-1,0,1,2} | 0 | |
| 4 | {-4,-3,-2,-1,0,1} | 0 | |
| 5 | {-5,-4,-3,-2,-1,0} | 0 |
Result: for all
Step 2: Time k=2
Expected value calculation:
-
If :
-
If or :
:
| Total Cost | |||
|---|---|---|---|
| 0 | 0 | ||
| 1 | 1 | ||
| 2 | 2 | ||
| 3 | 3 | ||
| 4 | 4 | ||
| 5 | 5 |
,
:
| Total Cost | |||
|---|---|---|---|
| -1 | 0 | ||
| 0 | 1 | ||
| 1 | 2 | ||
| 2 | 3 | ||
| 3 | 4 | ||
| 4 | 5 |
,
Continue computing other states:
: ,
: , (or )
: ,
: , (or , both yield same cost)
Step 3: Time k=1
Use computed to calculate .
:
| Total Cost | |||
|---|---|---|---|
| 0 | 0 | ||
| 1 | 1 |
,
:
| Total Cost | |||
|---|---|---|---|
| -1 | 0 | ||
| 0 | 1 |
,
:
| Total Cost | |||
|---|---|---|---|
| -2 | 0 | ||
| -1 | 1 |
,
:
| Total Cost | |||
|---|---|---|---|
| -3 | 0 | ||
| -2 | 1 | ||
| -1 | 2 |
,
:
After computation: ,
:
After computation: ,
Complete Value Function for Time 1:
| 0 | 0 | 0 |
| 1 | 2 | -1 |
| 2 | 8 | -2 |
| 3 | 16.5 | -2 |
| 4 | 28.5 | -3 |
| 5 | 42.5 | -3 |
Step 4: Time k=0
For initial state , compute :
| Total Cost | |||
|---|---|---|---|
| -5 | 0 | ||
| -4 | 1 | ||
| -3 | 2 | ||
| -2 | 3 | ||
| -1 | 4 | ||
| 0 | 5 |
Result: ,
Optimal Policy Summary
Optimal policy starting from initial state :
| Time | State | Optimal Control | Explanation |
|---|---|---|---|
| Guides state toward 2 | |||
| Depends on | Look up | From value function table | |
| Depends on | Look up | From value function table | |
| Path-dependent | Always choose 0 regardless of state |
Minimum Expected Total Cost:
Policy Characteristics:
-
Tends to guide state toward smaller values (since state cost is minimized at 0)
-
Requires balancing control cost and future expected cost
-
Stochastic disturbances necessitate consideration of all possible future states
Hidden Markov Model (HMM)
Model Description
-
A stationary finite-state Markov chain with transition probabilities between states.
-
States are hidden (not directly observable). Initial state has probability of being .
-
After each state transition, a value is observed. Given a transition from state to state , the probability of observing is .
-
Observations are independent.
-
Transition probabilities and observation probabilities are assumed to be independent.
Symbol Explanation and Model Understanding
-
: Probability of transitioning from state to state (e.g., probability of weather changing from “sunny” to “rainy”)
-
: Probability that the system starts in state (e.g., probability that the first day is “sunny”)
-
: Probability of observing during transition (e.g., probability of observing “high humidity” during “sunny to rainy” transition)
-
Independence Assumption: Observation depends only on the current state transition , independent of historical states and observations.
Objective
Given an observation sequence , estimate the most probable state sequence .
Method
To estimate , maximize the conditional probability . Since:
Maximizing is equivalent to maximizing the joint probability , because is constant for a given observation sequence.
Joint Probability Derivation
The joint probability for state sequence and observation sequence is derived as follows:
Using the chain rule and independence assumptions:
Based on Markov property and observation independence, recursively simplified to:
Detailed Derivation Process
Chain Rule Decomposition:
Applying Markov Property:
Future state depends only on current state:
Observation Independence:
Observation determined only by current transition:
Joint Probability:
Final Form:
Recursive substitution yields
Explanation:
- : Probability of starting state .
- : Probability of transitioning from state to at step .
- : Probability of observing given transition from to .
- The product captures the sequence of state transitions and observations, leveraging independence to decompose the joint probability into a product.
Equivalent Optimization Problem
Maximizing is equivalent to minimizing the negative log-likelihood:
Principle of Negative Logarithm Conversion
-
Monotonicity Principle:
- function is monotonically increasing, so maximizing is equivalent to maximizing
- Minimizing is equivalent to maximizing
-
Product to Sum Conversion:
-
Numerical Stability:
- Probability products can cause floating-point underflow
- Logarithmic summation avoids multiplication of small values
-
Optimization Advantage:
- Sum form is suitable for dynamic programming
- Can be viewed as cumulative “path cost”
Note: This conversion transforms the probability product into a sum of logarithmic terms, facilitating optimization (e.g., using dynamic programming algorithms like the Viterbi algorithm). Minimizing this sum finds the state sequence that best explains the observations.
The problem of finding the most probable state sequence (e.g., in hidden Markov models) can be transformed into a shortest path problem with positive weights using the following graph structure:
-
Add Virtual Nodes:
- Add source node
sand sink nodet.
- Add source node
-
Initial State Arcs:
-
Create arcs from to each possible initial state .
-
Assign length:
Where is the initial probability of state . The negative logarithm converts maximizing initial probability to minimizing a positive cost.
-
-
State Transition Arcs:
- For each time step and each valid state transition :
- Create an arc from state to state .
- Assign length:
Where is the transition probability from state to , and is the probability (likelihood) of observing given the transition to . The negative logarithm converts maximizing the product of transition and emission probabilities to minimizing a positive cost.
- Note: If transition probability , do not create the corresponding arc.
- For each time step and each valid state transition :
-
Termination State Arcs:
- Create arcs from each possible terminal state to sink node .
- Assign length:
This setting makes the cost of ending the path at any valid terminal state zero.
Solution: In this constructed graph, the shortest path from source node s to sink node t corresponds to the most probable state sequence. All arc lengths are non-negative (termination arcs are zero, others are positive).
Proof of Graph Structure Equivalence
-
Path Cost Calculation:
- Total path cost =
-
Equivalence Relation:
-
Non-negative Weights Guarantee:
- Since , we have
- Satisfies requirements for Dijkstra and other shortest path algorithms
-
Dynamic Programming Advantage:
- Viterbi algorithm complexity
- Avoids exhaustive search over possible paths
Forward DP Algorithm (Viterbi Algorithm)
Core Formulas
1. Initial State Probability
-
Meaning: Negative log initial probability of state at time step .
-
: Probability of initial state .
-
Negative Log Conversion: Converts probability maximization to cost minimization.
2. Recurrence Relation
-
Meaning: Computes the minimum cumulative cost to reach state at time step .
-
: Minimum cumulative cost to previous state .
-
: State transition probability (from to ).
-
: Observation probability (observing in state ).
-
Minimization Operation: Evaluates all possible previous states to select the optimal path.
3. Backtracking Optimal Path
-
Meaning: Backtracks from the final state to determine the globally optimal state sequence .
-
: Already determined next optimal state.
Additional Notes
- Viterbi Algorithm Essence: Dynamic programming solution for finding the most probable state sequence (maximum a posteriori path) in hidden Markov models (HMMs).
- Purpose of Negative Log Conversion: Converts the maximization of probability products into minimization of log sums, avoiding floating-point underflow.
- Complexity: ( is number of time steps, is number of states).
Hidden Markov Model Example: Convolutional Encoding and Decoding
Background Story
Dr. Yang stored his grandmother’s recipe as a binary sequence (e.g., ) and used convolutional encoding to protect the data from theft (e.g., by 6007🧑🏫). The data is transmitted through a noisy channel to Dr. Zeng, where the received sequence differs from the transmitted sequence . The goal is to decode to recover the original sequence .
Convolutional Encoding Process
-
Input: Source binary sequence , where .
-
Output: Encoded sequence , each is an -dimensional binary vector (codeword).
-
State Equations (using modulo 2 arithmetic):
Where:
- is the hidden state vector (dimension ), initial state given (e.g., ).
- is the state transition matrix (), is the output matrix (), and are weight vectors (dimensions and respectively).
- Parameter Settings:
Explanation: is a 3-dimensional vector (), is a 2-dimensional vector (). Modulo 2 arithmetic ensures all values are binary (0 or 1). Dimensions of and must match output and state dimensions; here is 3x1, is 2x1.
Example Calculation
-
Input Sequence: .
-
State and Output Sequence Calculation:
- : , .
- : , .
- : , .
- : (mod 2), (mod 2).
-
Results:
- State Sequence: (or ).
- Output Sequence: (or ).
Noisy Channel and Probability Model
-
Transmission Process: Encoded sequence is transmitted through a noisy channel; received sequence is ( is also a binary vector).
-
Error Model: Given conditional probability , representing the probability of receiving given transmission of .
-
Independence Assumption: Errors are independent; joint probability is:
Explanation: , . This model assumes errors occur independently for each codeword.
Symbol Explanation, Principle Derivation, and Practical Significance of Error Model
-
Symbol Explanation:
- : Received codeword (same dimension as ), possibly erroneous due to noise (e.g., but ).
- : Conditional probability density function, representing the probability of receiving given transmission of (e.g., if channel has error rate , might be defined as when , when ). Practical significance: Quantifies the likelihood of transmission errors (e.g., bit-flip probability).
-
Principle Derivation of Joint Probability Independence:
- Derivation: Assumes channel errors occur independently at each time step (i.e., noise does not affect adjacent transmissions). Thus, joint probability is the product of marginal probabilities: . This follows from the definition of independence in probability theory (if events are independent, joint probability is the product).
- Practical significance: Simplifies model computation (e.g., decoding only needs to handle errors per codeword), applicable to many real channels (e.g., AWGN channel under low-correlation noise).
- Example: If , (correct reception), (error), then . Analogy: Like multiple independent coin tosses, where each outcome does not affect others.
Decoding Problem Formulation
-
Objective: Minimize negative log-likelihood to recover the most probable sequence :
where optimization variables are all possible .
-
Dependence on Hidden State: Since , decoding requires finding the state sequence .
Explanation: This is equivalent to the decoding problem in hidden Markov models (HMMs), with states and observations . Dynamic programming methods (e.g., Viterbi algorithm) can efficiently solve it by optimizing over state transitions.
Principle Derivation of Decoding Objective and Equivalence to HMM
-
Principle Derivation of Minimizing Negative Log-Likelihood:
- Derivation: The goal is to maximize the likelihood probability of the observation sequence . Since the likelihood is a product (), maximizing likelihood is equivalent to maximizing log-likelihood (due to monotonicity of log function):
Equivalent to minimizing negative log-likelihood: . Practical significance: Converts probability maximization into a numerically stable optimization problem (negative log-likelihood is commonly used as a loss function).
- Symbol Explanation: can be viewed as an “error cost,” where lower values indicate better match between and (e.g., if , ; if , ). Minimizing the sum finds the best-matching sequence.
- Derivation: The goal is to maximize the likelihood probability of the observation sequence . Since the likelihood is a product (), maximizing likelihood is equivalent to maximizing log-likelihood (due to monotonicity of log function):
-
Derivation of Equivalence to Hidden Markov Model (HMM):
- Derivation: In HMMs, the decoding problem involves finding the most probable state sequence given observations. Here:
- Hidden states: (defining state transitions ).
- Observations: (given states or outputs, but depends on and ).
Since is a function of and , and is implicit in state transitions, the observation probability maps to the emission probability in HMM (via linkage). The optimization objective corresponds to the path cost in HMM’s Viterbi algorithm. Practical significance: Enables efficient algorithms (e.g., Viterbi) for long sequences.
- Analogy: Like finding the shortest path in a maze (state sequence), where each step’s “cost” is determined by observation match.
- Derivation: In HMMs, the decoding problem involves finding the most probable state sequence given observations. Here:
Convolutional Code State Transition Diagram (Mermaid)
stateDiagram-v2
direction LR
[*] --> 00
00 --> 01 : w=1/y=111
00 --> 00 : w=0/y=000
01 --> 11 : w=0/y=011
01 --> 10 : w=1/y=100
11 --> 01 : w=0/y=111
11 --> 00 : w=1/y=100
10 --> 10 : w=0/y=011
10 --> 11 : w=1/y=111
Caption:
-
Nodes represent states (e.g., “00” represents )
-
Arrow labels format: input/output (e.g., “w=1/y=111”)
-
State transition relationships based on example parameters
-
Dashed box indicates possible initial state
Viterbi Algorithm Decoding Process (Mermaid)
stateDiagram-v2
direction TB
state "k=0\nState:00\nCost:0" as s00
state "k=1\nState:00\nCost:δ(z₁,000)" as s00_1
state "k=1\nState:01\nCost:δ(z₁,111)" as s01_1
state "k=2\nState:00\nCost:δ(z₁,000)+δ(z₂,000)" as s00_2
state "k=2\nState:01\nCost:δ(z₁,000)+δ(z₂,111)" as s01_2
state "k=2\nState:11\nCost:δ(z₁,111)+δ(z₂,011)" as s11_2
state "k=2\nState:10\nCost:δ(z₁,111)+δ(z₂,100)" as s10_2
state "k=3\nState:00\nCost:..." as s00_3
state "k=3\nState:01\nCost:..." as s01_3
state "k=3\nState:11\nCost:..." as s11_3
state "k=3\nState:10\nCost:..." as s10_3
[*] --> s00
s00 --> s00_1 : w=0/y=000
s00 --> s01_1 : w=1/y=111
s00_1 --> s00_2 : w=0/y=000
s00_1 --> s01_2 : w=1/y=111
s01_1 --> s11_2 : w=0/y=011
s01_1 --> s10_2 : w=1/y=100
s00_2 --> s00_3 : w=0/y=000
s00_2 --> s01_3 : w=1/y=111
s01_2 --> s11_3 : w=0/y=011
s01_2 --> s10_3 : w=1/y=100
s11_2 --> s01_3 : w=0/y=111
s11_2 --> s00_3 : w=1/y=100
s10_2 --> s10_3 : w=0/y=011
s10_2 --> s11_3 : w=1/y=111
note right of s00_1
Cost calculation:
δ(z_k,y_k) = -log p(z_k|y_k)
represents the match degree when receiving z_k and sending y_k
end note
note left of s11_2
Optimal path selection:
retain the minimum cost path to each state
prune high-cost branches
end note
Caption:
-
Each node represents the state and cumulative cost at time step k
-
δ(z_k, y_k) = -log p(z_k|y_k) represents the step cost
-
Thick border box indicates the current optimal path (needs to be calculated based on actual z_k)
-
Algorithm principle: retain the minimum cost path to each state and expand step by step
Shortest Path Algorithms
Label Correcting Methods
Motivation
-
Dynamic Programming (DP) algorithms require computing the cost function for each state at each time , which is computationally expensive for all nodes and arcs.
-
Core Idea: Optimize computation by excluding irrelevant nodes → Shortest Path Algorithms.
Key Point: When state space is large, DP becomes infeasible; more efficient path search methods are needed.
Setup
-
Origin: Node
-
Destination: Node
-
Child Node: Node is a child of node if arc exists
-
Assumption: All arc lengths (non-negative weights)
Core Concept
-
Maintain a label for each node , recording the current shortest path length from origin to .
-
During execution, dynamically decreases.
-
When updates, check if its child nodes need correction.
-
The destination label (denoted
UPPER) equals the true shortest path length upon convergence.
Intuition:
UPPERis the current upper bound for the shortest to path; the algorithm reduces this bound by iteratively correcting labels.
Algorithm Steps
-
Initialization:
OPENset initially contains onlyUPPER
-
Iterative Update:
- Remove a node from
OPEN - For each child node of :
- If :
- Set as parent of
- If and , add to
OPEN - If , update
- If :
- Remove a node from
-
Termination Condition: Stop when
OPENis empty; otherwise repeat step 2.
Key Notes:
-
OPENSet: Stores nodes to be checked, ensuring only affected nodes are updated. -
Non-negative Weights: Guarantees algorithm correctness (no negative-weight cycles).
-
Efficiency Improvement:
UPPERenables pruning to avoid unnecessary computations.
Label Correcting Methods: Correctness Proof
Proposition
If a path exists from origin to destination , upon termination
UPPERequals the shortest path length; otherwiseUPPER = ∞upon termination.
Proof
1. Algorithm Must Terminate
-
Key Observation:
- Each time node is added to
OPEN, strictly decreases and always represents the length of some path. - Let be the value when is first added to
OPEN(finite).
- Each time node is added to
-
Non-negative Weights Guarantee:
-
Conclusion:
OPENset will eventually be empty; algorithm terminates.
Core: Non-negative arc lengths ensure path lengths have a lower bound; label values cannot decrease indefinitely.
2. No Path Case: UPPER = ∞
-
Proof by Contradiction:
- Assume arc exists (i.e., has predecessor ).
- If ever enters
OPEN→ path exists → path exists (contradiction).
-
Corollary:
- cannot enter
OPEN→ is never updated - From initialization , so
- cannot enter
3. Path Exists Case: UPPER Equals Shortest Path Length
Let shortest path with length .
-
Subpath Optimality:
-
Proof by Contradiction Assumption:
If upon termination , then throughout execution: -
Recursive Contradiction:
- Node cannot enter
OPENwith optimal (otherwise would update ) - Similarly, cannot enter
OPENwith optimal - Recurse to :
- But during initialization, enters
OPENwith (optimal value) - Contradiction!
- But during initialization, enters
- Node cannot enter
Conclusion: must converge to .
Implementation Strategies for Label Correcting Methods
Common Strategy Comparison
1. Breadth-First Search (BFS) / Bellman-Ford Method
-
Queue Strategy: First-In-First-Out (FIFO)
-
Operation Rules:
- Remove node from top of
OPEN - Add new nodes to bottom of
OPEN
- Remove node from top of
-
Characteristics:
- Handles graphs with negative-weight edges (though current scenario requires non-negative weights)
- Time complexity:
2. Depth-First Search (DFS)
-
Queue Strategy: Last-In-First-Out (LIFO)
-
Operation Rules:
- Remove node from top of
OPEN - Add new nodes to top of
OPEN
- Remove node from top of
-
Characteristics:
- Lower memory usage (does not store all level nodes)
- May prioritize exploring long paths first; efficiency unstable
3. Best-First Search (Dijkstra’s Algorithm)
-
Selection Strategy: Choose node in
OPENwith smallest label -
Characteristics:
- Key Property: Each node enters
OPENat most once - Overhead: Requires maintaining a min-heap; each operation
- Time Complexity: (optimal for non-negative weights)
- Key Property: Each node enters
Advantage: Most efficient in non-negative weight graphs
4. Hybrid Optimization Strategies
(1) Small Label First (SLF)
-
Operation Rules:
- Remove node from top of
OPEN - Add new node based on:
- If (top node label), add to top
- Otherwise add to bottom
- Remove node from top of
-
Goal: Approximate Dijkstra’s effect without sorting overhead
(2) Large Label Last (LLL)
-
Enhancement Strategy (often combined with SLF):
- Compute average label in
OPEN: - If top node label , move it to bottom
- Repeat until top node label
- Compute average label in
-
Goal: Delay processing high-label nodes to accelerate convergence
Strategy Selection Summary
| Strategy | Time Complexity | Advantages | Applicable Scenarios |
|---|---|---|---|
| BFS (Bellman-Ford) | $O( | V | {\cdot} |
| DFS | Unstable | Low memory consumption | Depth-first path exploration |
| Dijkstra | $O( | E | {+} |
| SLF+LLL | Near Dijkstra | Low-overhead near-optimal solution | Efficient heuristic for large-scale graphs |
