Introduction to Dynamic Programming

#sdsc6007

English / 中文

Introduction

The Discrete-Time Dynamic System

The system has the form

xk+1=fk(xk,uk,wk),k=0,1,...,N1,x_{k + 1} = f_{k} (x_k, u_k, w_k ), k = 0, 1, . . . , N − 1,

where

  • kk : index of discrete time

  • NN : the horizon or number of times control is applied

  • xkx_k : the state of the system, from the set of states Sk

  • uku_k : the control/decision variable/action to be selected from the set Uk (xk ) at time k

  • wkw_k : a random parameter (also called disturbance)

  • fkf_k : a function that describes how the state is updated

Assumption
wkw_k ’s are independent. The probability distribution may depend on xkx_k and uku_k .

The Cost Function

The (expected) cost has the form

E[gN(xN)+k=0N1gk(xk,uk,wk)]\mathbb{E} \left[ g_N (x_N ) + \sum^{N−1}_{k=0}{g_k (x_k, u_k, w_k )} \right]

where

  • gk(xk,uk,wk)g_k(x_k, u_k, w_k) : the cost incurred at time kk.

  • gN(xN)g_N(x_N) : the terminal cost incurred at the end of process.

Note
Because of wkw_k , the cost is a random variable and so we are optimizing over the expectation.

A Deterministic Scheduling Problem

Example
Prof wants to produce luxury headphones that perform better than the bear-pods 3 that he is currently using. To do so, four operations must be performed on a certain machine, and they are denoted by A, B, C, D. Assuming that

  1. operation B can only be performed after operation A

  2. operation D can only be performed after operation C

Denote

  • setup cost Cmn for passing any operation m to n

  • initial startup cost SA or SC (can only start with operation A or C)

Solution

  • We need to make three decisions (the last is determined by the first three)

  • This problem is deterministic (no wk )

  • This problem has finite number of states

  • Deterministic problems with finite number of states => transition graph

A Deterministic Scheduling Problem - Transition Graph

Discrete-State and Finite-State Problems

To capture the transition between states, it is often convenient to define the transition probabilities

pij(u,k)=P(xk+1=jxk=i,uk=u)p_{ij}(u,k) = \mathbb{P}(x_{k+1} = j | x_k = i, u_k = u)

In the dynamic system, it means that xk+1=wkx_{k+1} = w_k, where wkw_k follows a probabilitydistribution which has probabilities pij(u,k)p_{ij}(u,k)'s.

Example

Consider a problem of N time periods that a machine can be in any one of n states. We denote g(i)g(i) as the operating cost per period when the machine is in state i. Assume that

g(1)g(2)g(n)g(1) ≤ g(2) ≤ ··· ≤ g(n)

That is, a machine in state i works more efficiently than a machine that is in state i + 1. During a period of time, the state of the machine can become worse or stay the same with probability

pij=P(next state will be jcurrent state is i) and pij=0,ifj<i.p_{ij} =\mathbb{P}(\text{next state will be } j | \text{current state is } i ) \text{ and } p_{ij} =0,\text{if} j < i.

At the start of each period, we can choose

  • let the machine operate one more period

  • repair the machine and bring it to state 1 (and it will stay there for 1 period) at a cost R

截屏2025-09-03 15.01.39.png

Inventory Control Problem

Ordering a quantity of a certain item at each stage to meet a stochastic demand

  • xkx_k: stock available at the beginning of the kth period

  • uku_k: stock ordered and delivered at the beginning of the kth period

  • wkw_k: demand during the kth period with given probability distribution

  • r()r(·): penalty/cost for either positive or negative stock

  • cc: cost per unit ordered

Example: Inventory Control

Expample: Inventory Control

Suppose (u0,u1,,uN1)(u_0^{\star}, u_1^\star, \ldots, u_{N-1}^\star) is the optimal solution of

minE[R(xN)+k=0N1(r(xk+ukwk)+cuk)]\min \mathbb{E} \left[ R(x_N) + \sum_{k=0}^{N-1} \left( r(x_k + u_k - w_k) + c \cdot u_k \right) \right]

What if:w1=w2==wN1=0w_1 = w_2 = \cdots = w_{N-1} = 0? (recall wi’s are the demands.)

We can do better if we can adjust our decisions to different situations!

Open-Loop and Closed-Loop Control

Open-loop Control

At initial time k=0k=0, given initial state x0x_0, find optimal control sequence (u0,u1,,uN1)(u_0^\star, u_1^\star, \ldots, u_{N-1}^\star) minimizing expected total cost:

  • Key feature: Subsequent state information is NOT used to adjust control decisions

Closed-loop Control

At each time kk, make decisions based on current state information xkx_k (e.g. ordering decision at time kk):

  • Core objective: Find state feedback strategy μk()\mu_k(\cdot) mapping state xkx_k to control uku_k

  • Decision characteristics:

  • Re-optimization at each decision point kk

  • Control rule designed for every possible state value xkx_k

  • Computational properties:

  • Higher computational cost (requires real-time state mapping)

  • Same performance as open-loop when no uncertainty exists

Closed-loop Control

Core Concepts

  • Control Law Definition
    Let μk()\mu_k(\cdot) be a function mapping state xkx_k to control uku_k:

uk=μk(xk)u_k = \mu_k(x_k)

  • Control Policy
    Define a policy π\pi as a sequence of control laws:

π=μ0,μ1,,μN1\pi = {\mu_0, \mu_1, \ldots, \mu_{N-1}}

  • Policy Cost Function
    Given initial state x0x_0, the expected cost of policy π\pi is:

Jπ(x0)=E[gN(xN)+k=0N1gk(xk,μk(xk),wk)]J_{\pi}(x_0) = \mathbb{E} \left[ g_N (x_N) + \sum_{k=0}^{N-1} g_k \big( x_k, \mu_k(x_k), w_k \big) \right]

Admissible Policy

A policy π={μ0,,μN1}\pi = \{\mu_0, \ldots, \mu_{N-1}\} is called admissible if and only if:

μk(xk)Uk(xk),xkSk, k=0,1,,N1\mu_k(x_k) \in U_k(x_k), \quad \forall x_k \in S_k, \ \forall k = 0,1,\ldots,N-1

Meaning at each time kk and for every possible state xkx_k, the control uku_k must belong to the allowable control set Uk(xk)U_k(x_k).

Summary

Definition

Consider a function JJ^* defined as:

J(x0)=minπΠJπ(x0),x0S0J^*(x_0) = \min_{\pi \in \Pi} J^{\pi}(x_0), \quad \forall x_0 \in S_0

where:

  • Π\Pi: Set of all admissible policies

  • Jπ(x0)J^{\pi}(x_0): Expected cost of policy π\pi starting from initial state x0x_0

We call JJ^* the optimal value function.

Key Properties

  1. Global Optimality
    JJ^* gives the minimum possible expected cost from any initial state x0x_0

  2. Policy Independence
    Represents theoretical performance limit, independent of specific policies

  3. Benchmarking Role
    Any admissible policy π\pi satisfies:

Jπ(x0)J(x0),x0S0J^{\pi}(x_0) \geq J^*(x_0), \quad \forall x_0 \in S_0

Computational Significance

  • Core Objective of DP: Compute JJ^* exactly via backward induction

  • Control Engineering Application: Measure gap between actual policies and theoretical optimum

The Dynamic Programming Algorithm

Principle of Optimality

Theorem Statement

Let π={μ0,μ1,,μN1}\pi^* = \{\mu_0^*, \mu_1^*, \ldots, \mu_{N-1}^*\} be an optimal policy for the basic problem. Suppose when using π\pi^*, the system reaches state xix_i at time ii with positive probability. Consider the subproblem starting from (xi,i)(x_i, i):

minE[gN(xN)+k=iN1gk(xk,μk(xk),wk)]\min \mathbb{E} \left[ g_N(x_N) + \sum_{k=i}^{N-1} g_k \big( x_k, \mu_k(x_k), w_k \big) \right]

Then the truncated policy {μi,μi+1,,μN1}\{\mu_i^*, \mu_{i+1}^*, \ldots, \mu_{N-1}^*\} is optimal for this subproblem.

Core Implications

  • Piecemeal Construction: Optimal policy can be constructed in piecemeal fashion

  • Tail First: First solve the “tail subproblem” (starting from final stage)

  • Sequential Extension: Extend sequentially to the original problem

  1. Heritability of Optimality
    Any tail portion of a globally optimal policy remains optimal for its starting state

  2. Time Consistency
    Optimal decisions account for both immediate cost and optimal future state evolution

  3. Foundation for Backward Induction
    This principle validates the dynamic programming backward solution approach.

DP Algorithm - Inventory Control Example

Example
Inventory Control Example.png

Tail Subproblem Solving Process

Core idea: Backward solution from final stage to initial stage

  1. Length-1 Tail Subproblem (at k=N1k = N-1)

    • Decision Objective: Minimize current cost + terminal cost expectation

      minimizecuN1+EwN1[R(xN)]\text{minimize} \quad c \cdot u_{N-1} + \mathbb{E}_{w_{N-1}} \left[ R(x_N) \right]

      • cuN1c \cdot u_{N-1}: Current ordering cost
      • EwN1[R(xN)]\mathbb{E}_{w_{N-1}} \left[ R(x_N) \right]: Expected terminal inventory cost
    • State Transition: xN=xN1+uN1wN1x_N = x_{N-1} + u_{N-1} - w_{N-1}

      minimizecuN1+EwN1[R(xN1+uN1wN1)]\Rightarrow \text{minimize} \quad c \cdot u_{N-1} + \mathbb{E}_{w_{N-1}} \left[ R(x_{N-1} + u_{N-1} - w_{N-1}) \right]

    • Value Function Calculation:

      JN1(xN1)=r(xN1)+minuN10{cuN1+EwN1[R(xN1+uN1wN1)]}J_{N-1}(x_{N-1}) = r(x_{N-1}) + \min_{u_{N-1} \geq 0} \left\{ c \cdot u_{N-1} + \mathbb{E}_{w_{N-1}} \left[ R(x_{N-1} + u_{N-1} - w_{N-1}) \right] \right\}

      • r(xN1)r(x_{N-1}): Current holding/shortage cost
      • Compute for all xN1SN1x_{N-1} \in S_{N-1}
  2. Length-2 Tail Subproblem (at k=N2k = N-2)

    • Decision Objective: Minimize current cost + expected future cost

      minimizer(xN2)+cuN2+EwN2[JN1(xN1)]\text{minimize} \quad r(x_{N-2}) + c \cdot u_{N-2} + \mathbb{E}_{w_{N-2}} \left[ J_{N-1}(x_{N-1}) \right]

    • State Transition: xN1=xN2+uN2wN2x_{N-1} = x_{N-2} + u_{N-2} - w_{N-2}

      minimizer(xN2)+cuN2+EwN2[JN1(xN2+uN2wN2)]\Rightarrow \text{minimize} \quad r(x_{N-2}) + c \cdot u_{N-2} + \mathbb{E}_{w_{N-2}} \left[ J_{N-1}(x_{N-2} + u_{N-2} - w_{N-2}) \right]

    • Value Function Calculation:

      JN2(xN2)=r(xN2)+minuN20{cuN2+EwN2[JN1(xN2+uN2wN2)]}J_{N-2}(x_{N-2}) = r(x_{N-2}) + \min_{u_{N-2} \geq 0} \left\{ c \cdot u_{N-2} + \mathbb{E}_{w_{N-2}} \left[ J_{N-1}(x_{N-2} + u_{N-2} - w_{N-2}) \right] \right\}

  3. General Recursion (Length NkN-k Tail Subproblem)

    Jk(xk)=r(xk)+minuk0{cuk+Ewk[Jk+1(xk+ukwk)]}J_k(x_k) = r(x_k) + \min_{u_k \geq 0} \left\{ c \cdot u_k + \mathbb{E}_{w_k} \left[ J_{k+1}(x_k + u_k - w_k) \right] \right\}

    • r(xk)r(x_k): Inventory cost at kk
    • cukc \cdot u_k: Ordering cost at kk
    • Ewk[Jk+1()]\mathbb{E}_{w_k} \left[ J_{k+1}(\cdot) \right]: Expected optimal future cost

Final Goal: Solve original problem via length-NN tail subproblem at k=0k=0

The Dynamic Programming Algorithm

For every initial state x0x_0, the optimal cost J(x0)J^*(x_0) equals J0(x0)J_0(x_0) computed by:

Boundary Condition:

JN(xN)=gN(xN)J_N(x_N) = g_N(x_N)

Recursive Relation:

Jk(xk)=minukUk(xk)Ewk[gk(xk,uk,wk)+Jk+1(fk(xk,uk,wk))]()J_k(x_k) = \min_{u_k \in U_k(x_k)} \mathbb{E}_{w_k} \left[ g_k(x_k,u_k,w_k) + J_{k+1}\big(f_k(x_k,u_k,w_k)\big) \right] \quad (\star)

The expectation Ewk\mathbb{E}_{w_k} is over wkw_k’s probability distribution (dependent on xkx_k and uku_k). If uk=μk(xk)u_k^\star = \mu_k^\star(x_k) minimizes the RHS of ()(\star), then π={μ0,,μN1}\pi^\star = \{\mu_0^\star, \ldots, \mu_{N-1}^\star\} is optimal.

Key Definitions

  • Jk(xk)J_k(x_k): Cost-to-go at state xkx_k and time kk

  • JkJ_k: Cost-to-go function at time kk

Algorithm Proof (Informal)

Setup
  1. Define subpolicy πk={μk,μk+1,,μN1}\pi_k = \{\mu_k, \mu_{k+1}, \ldots, \mu_{N-1}\}

  2. Let Jk(xk)J_k^\star(x_k) be optimal cost for (Nk)(N-k)-stage starting at (xk,k)(x_k, k):

Jk(xk)=minπkEwk,,wN1[gN(xN)+i=kN1gi(xi,μi(xi),wi)]J_k^\star(x_k) = \min_{\pi_k} \mathbb{E}_{w_k,\ldots,w_{N-1}} \left[ g_N(x_N) + \sum_{i=k}^{N-1} g_i\big(x_i, \mu_i(x_i), w_i\big) \right]

  1. Boundary conditions:

    • At k=0k=0: J0J_0^\star solves original problem
    • At k=Nk=N: JN=gN(xN)J_N^\star = g_N(x_N)
Inductive Proof

Goal: Prove Jk=JkJ_k^\star = J_k (DP algorithm value)

  1. Base Case: Holds at k=Nk=N (JN=gN(xN)=JNJ_N^\star = g_N(x_N) = J_N)

  2. Induction Hypothesis: Assume xk+1Sk+1\forall x_{k+1} \in S_{k+1}, Jk+1(xk+1)=Jk+1(xk+1)J_{k+1}^\star(x_{k+1}) = J_{k+1}(x_{k+1})

  3. Derivation:

Jk(xk)=min(μk,πk+1)Ewk,,wN1[gN(xN)+i=kN1gi(xi,μi(xi),wi)]=min(μk,πk+1)E[gk(xk,μk(xk),wk)+gN(xN)+i=k+1N1gi()]=minμkEwk[gk(xk,μk(xk),wk)+minπk+1Ewk+1,,wN1[gN(xN)+i=k+1N1gi()]](Principle of Optimality)=minμkEwk[gk(xk,μk(xk),wk)+Jk+1(fk(xk,μk(xk),wk))]=minμkEwk[gk(xk,μk(xk),wk)+Jk+1(fk(xk,μk(xk),wk))](by Induction Hypothesis)=minukEwk[gk(xk,uk,wk)+Jk+1(fk(xk,uk,wk))]=Jk(xk) \begin{align*} J_k^\star(x_k) &= \min_{(\mu_k, \pi_{k+1})} \mathbb{E}_{w_k,\ldots,w_{N-1}} \left[ g_N(x_N) + \sum_{i=k}^{N-1} g_i\big(x_i, \mu_i(x_i), w_i\big) \right] \\ &= \min_{(\mu_k, \pi_{k+1})} \mathbb{E} \left[ g_k(x_k, \mu_k(x_k), w_k) + g_N(x_N) + \sum_{i=k+1}^{N-1} g_i(\cdots) \right] \\ &= \min_{\mu_k} \mathbb{E}_{w_k} \left[ g_k(x_k, \mu_k(x_k), w_k) + \min_{\pi_{k+1}} \mathbb{E}_{w_{k+1},\ldots,w_{N-1}} \left[ g_N(x_N) + \sum_{i=k+1}^{N-1} g_i(\cdots) \right] \right] \\ &\quad \text{(Principle of Optimality)} \\ &= \min_{\mu_k} \mathbb{E}_{w_k} \left[ g_k(x_k, \mu_k(x_k), w_k) + J_{k+1}^\star\big(f_k(x_k, \mu_k(x_k), w_k)\big) \right] \\ &= \min_{\mu_k} \mathbb{E}_{w_k} \left[ g_k(x_k, \mu_k(x_k), w_k) + J_{k+1}\big(f_k(x_k, \mu_k(x_k), w_k)\big) \right] \\ &\quad \text{(by Induction Hypothesis)} \\ &= \min_{u_k} \mathbb{E}_{w_k} \left[ g_k(x_k, u_k, w_k) + J_{k+1}\big(f_k(x_k, u_k, w_k)\big) \right] \\ &= J_k(x_k) \end{align*}

DP Algorithm - Baking Example

Problem Description

Dr. Yang bakes a cake using two ovens with dynamics and cost:

  • State equation: xk+1=(1a)xk+auk,k=0,1x_{k+1} = (1-a) \cdot x_k + a \cdot u_k, \quad k=0,1

  • Cost function: r(x2T)2+u02+u12r \cdot (x_2 - T)^2 + u_0^2 + u_1^2

  • Parameters:

    • xkx_k: Cake temperature at exit of oven kk
    • uku_k: Set temperature of oven kk
    • a(0,1)a \in (0,1): Heat transfer coefficient
    • TT: Target temperature
    • r>0r > 0: Penalty for terminal temperature deviation

DP Solution Process

  1. Terminal Time (k=2k=2)

J2(x2)=r(x2T)2J_2(x_2) = r \cdot (x_2 - T)^2

  1. Time k=1k=1

J1(x1)=minu1[u12+J2(x2)]=minu1[u12+r((1a)x1+au1T)2]\begin{align*} J_1(x_1) &= \min_{u_1} \left[ u_1^2 + J_2(x_2) \right] \\ &= \min_{u_1} \left[ u_1^2 + r \cdot \big((1-a)x_1 + a u_1 - T\big)^2 \right] \end{align*}

Optimization:

  • Set derivative w.r.t u1u_1 to zero:

    0=2u1+2ra((1a)x1+au1T)0 = 2u_1 + 2ra \big((1-a)x_1 + a u_1 - T\big)

  • Optimal control:

    u1=ra(T(1a)x1)1+ra2u_1^* = \frac{ra \big(T - (1-a)x_1\big)}{1 + ra^2}

  • Value function:

    J1(x1)=r((1a)x1T)21+ra2J_1(x_1) = \frac{r \big((1-a)x_1 - T\big)^2}{1 + ra^2}

  1. Initial Time (k=0k=0)

J0(x0)=minu0[u02+J1(x1)]=minu0[u02+r((1a)2x0+(1a)au0T)21+ra2]\begin{align*} J_0(x_0) &= \min_{u_0} \left[ u_0^2 + J_1(x_1) \right] \\ &= \min_{u_0} \left[ u_0^2 + \frac{r \big((1-a)^2 x_0 + (1-a)a u_0 - T\big)^2}{1 + ra^2} \right] \end{align*}

Optimization:

  • Optimal control:

    u0=r(1a)a(T(1a)2x0)1+ra2(1+(1a)2)u_0^* = \frac{r(1-a)a \big(T - (1-a)^2 x_0\big)}{1 + ra^2 \big(1 + (1-a)^2\big)}

  • Value function:

    J0(x0)=r((1a)2x0T)21+ra2(1+(1a)2)J_0(x_0) = \frac{r \big((1-a)^2 x_0 - T\big)^2}{1 + ra^2 \big(1 + (1-a)^2\big)}

Key Insight:
This LQ problem admits analytical solution, but most DP problems require numerical methods

DP Algorithm - Inventory Control Example

Problem Setup

  • Demand Handling: Unmet demand is lost

  • State Transition: xk+1=max{0,xk+ukwk}x_{k+1} = \max\{0, x_k + u_k - w_k\}

  • Control Constraint: xk+uk2x_k + u_k \leq 2

  • Cost Structure:

    • Ordering cost: uku_k (unit cost=1)
    • Holding cost: (xk+ukwk)2(x_k + u_k - w_k)^2
    • Terminal cost: gN(xN)=0g_N(x_N) = 0
  • Demand Distribution:
    P(wk=0)=0.1, P(wk=1)=0.7, P(wk=2)=0.2P(w_k=0)=0.1,\ P(w_k=1)=0.7,\ P(w_k=2)=0.2

  • Initial State: x0=0, N=3x_0=0,\ N=3

Backward Solution Process

需要注意,由于此时 uku_k 为随机分布,因此计算得到的 JkJ_k 为数学期望。
It should be noticed that, as uku_k is random distribution in this case, thus JkJ_k is mathematical expectation.

  1. Terminal Time (k=3k=3)

J3(x3)=0(x3)J_3(x_3) = 0 \quad (\forall x_3)

  1. Time k=2k=2 (Penultimate Stage)

State Space: x2{0,1,2}x_2 \in \{0,1,2\}

  • x2=0x_2=0:

    J2(0)=minu2{0,1,2}Ew2[u2+(0+u2w2)2]=min{u2=0:0+0.1(0)2+0.7(1)2+0.2(2)2=1.5u2=1:1+0.1(1)2+0.7(0)2+0.2(1)2=1.3u2=2:2+0.1(4)+0.7(1)+0.2(0)=3.1=1.3(u2=1)\begin{align*} J_2(0) &= \min_{u_2 \in \{0,1,2\}} \mathbb{E}_{w_2} \left[ u_2 + (0 + u_2 - w_2)^2 \right] \\ &= \min \begin{cases} u_2=0: & 0 + 0.1(0)^2 + 0.7(-1)^2 + 0.2(-2)^2 = 1.5 \\ u_2=1: & 1 + 0.1(1)^2 + 0.7(0)^2 + 0.2(-1)^2 = 1.3 \\ u_2=2: & 2 + 0.1(4) + 0.7(1) + 0.2(0) = 3.1 \end{cases} \\ &= 1.3 \quad (u_2^*=1) \end{align*}

  • x2=1x_2=1:

    J2(1)=minu2{0,1}Ew2[u2+(1+u2w2)2]=min{u2=0:0+0.1(1)2+0.7(0)2+0.2(1)2=0.3u2=1:1+0.1(4)+0.7(1)+0.2(0)=2.1=0.3(u2=0)\begin{align*} J_2(1) &= \min_{u_2 \in \{0,1\}} \mathbb{E}_{w_2} \left[ u_2 + (1 + u_2 - w_2)^2 \right] \\ &= \min \begin{cases} u_2=0: & 0 + 0.1(1)^2 + 0.7(0)^2 + 0.2(-1)^2 = 0.3 \\ u_2=1: & 1 + 0.1(4) + 0.7(1) + 0.2(0) = 2.1 \end{cases} \\ &= 0.3 \quad (u_2^*=0) \end{align*}

  • x2=2x_2=2:

    J2(2)=minu2{0}Ew2[u2+(2+u2w2)2]=0+0.1(4)+0.7(1)+0.2(0)=1.1(u2=0)\begin{align*} J_2(2) &= \min_{u_2 \in \{0\}} \mathbb{E}_{w_2} \left[ u_2 + (2 + u_2 - w_2)^2 \right] \\ &= 0 + 0.1(4) + 0.7(1) + 0.2(0) = 1.1 \quad (u_2^*=0) \end{align*}

Value Function Summary:

x2x_2 J2(x2)J_2(x_2) μ2(x2)\mu_2^*(x_2)
0 1.3 1
1 0.3 0
2 1.1 0
  1. Continue to k=1,0k=1,0

Similar computation for J1(x1)J_1(x_1) and J0(x0)J_0(x_0), considering state transition:

xk+1=max{0,xk+ukwk}x_{k+1} = \max\{0, x_k + u_k - w_k\}

Computational Complexity

  • State Space: Sk=3|S_k| = 3 (this example)

  • Control Space: Uk3|U_k| \leq 3 (this example)

  • Time Complexity: O(NSUW)O(N \cdot |S| \cdot |U| \cdot |W|)

  • Practical Challenge: State space grows exponentially with dimensions (curse of dimensionality)

DP Algorithm - Finite-State Systems

System Description

Consider finite-state systems with:

  • Transition probability: pij(u)=P(xk+1=jxk=i,uk=u)p_{ij}(u) = \mathbb{P}(x_{k+1} = j \mid x_k = i, u_k = u)

  • Dynamics: xk+1=wkx_{k+1} = w_k where wkw_k follows distribution defined by pij(u)p_{ij}(u)

Key Assumptions

  1. Stationarity:

    • Transition probabilities pij(u)p_{ij}(u) time-invariant
    • Control constraint sets Uk(i)=U(i)U_k(i) = U(i) constant
  2. Cost Structure:

    • Stage cost g(i,u)g(i,u) independent of disturbance wkw_k

DP Algorithm

Recursive Formula:

Jk(i)=minuU(i)[g(i,u)+E[Jk+1(wk)]]J_k(i) = \min_{u \in U(i)} \left[ g(i,u) + \mathbb{E} \left[ J_{k+1}(w_k) \right] \right]

Explicit Form:

Jk(i)=minuU(i)[g(i,u)+jpij(u)Jk+1(j)]J_k(i) = \min_{u \in U(i)} \left[ g(i,u) + \sum_{j} p_{ij}(u) J_{k+1}(j) \right]

Computational Properties

Property Description
State Space Finite (i{1,2,,n}i \in \{1,2,\dots,n\})
Action Space Finite (uU(i)u \in U(i))
Complexity $O(N \cdot n^2 \cdot
Advantage Avoids curse of dimensionality

Applications: Markov Decision Processes (MDP), queueing systems, network routing

State Augmentation and System Reformulation

1. Handling Time Lags

Problem: When state transition depends on historical states (e.g., xk+1=fk(xk,xk1,uk,uk1,wk)x_{k+1} = f_k(x_k, x_{k-1}, u_k, u_{k-1}, w_k))

Solution: State Augmentation
Define new state vector:

x~k=[xkyksk]=[xkxk1uk1]\tilde{x}_k = \begin{bmatrix} x_k \\ y_k \\ s_k \end{bmatrix} = \begin{bmatrix} x_k \\ x_{k-1} \\ u_{k-1} \end{bmatrix}

New state transition:

x~k+1=[xk+1yk+1sk+1]=[fk(xk,yk,uk,sk,wk)xkuk]=f~k(x~k,uk,wk)\tilde{x}_{k+1} = \begin{bmatrix} x_{k+1} \\ y_{k+1} \\ s_{k+1} \end{bmatrix} = \begin{bmatrix} f_k(x_k, y_k, u_k, s_k, w_k) \\ x_k \\ u_k \end{bmatrix} = \tilde{f}_k(\tilde{x}_k, u_k, w_k)

Key: Embed historical state xk1x_{k-1} and control uk1u_{k-1} into current state


2. Handling Correlated Disturbances

Problem: Correlated disturbances (e.g., wk=λwk1+ξkw_k = \lambda w_{k-1} + \xi_k)

Solution: State Augmentation
Define new state vector:

x~k=[xkyk]whereyk=wk1\tilde{x}_k = \begin{bmatrix} x_k \\ y_k \end{bmatrix} \quad \text{where} \quad y_k = w_{k-1}

New state transition:

x~k+1=[xk+1yk+1]=[fk(xk,uk,λyk+ξk)λyk+ξk]=f~k(x~k,uk,ξk)\tilde{x}_{k+1} = \begin{bmatrix} x_{k+1} \\ y_{k+1} \end{bmatrix} = \begin{bmatrix} f_k(x_k, u_k, \lambda y_k + \xi_k) \\ \lambda y_k + \xi_k \end{bmatrix} = \tilde{f}_k(\tilde{x}_k, u_k, \xi_k)

Generalized: For wk=Ckyk+1w_k = C_k y_{k+1}, yk+1=Akyk+ξky_{k+1} = A_k y_k + \xi_k:

x~k+1=[fk(xk,uk,Ck(Akyk+ξk))Akyk+ξk]\tilde{x}_{k+1} = \begin{bmatrix} f_k(x_k, u_k, C_k(A_k y_k + \xi_k)) \\ A_k y_k + \xi_k \end{bmatrix}


3. Incorporating Forecast Information

Problem: Disturbance distribution known before decision (e.g., weather affecting demand)

Modeling:

  • wkQiw_k \sim Q_i (i{1,,m}i \in \{1,\dots,m\})

  • ii known before deciding uku_k

  • Forecast process: yk+1=ξky_{k+1} = \xi_k (ξk\xi_k independent, P(ξk=i)=piP(\xi_k=i)=p_i)

Augmented state:

x~k=[xkyk]\tilde{x}_k = \begin{bmatrix} x_k \\ y_k \end{bmatrix}

where yky_k encodes next-period disturbance distribution

State transition:

x~k+1=[xk+1yk+1]=[fk(xk,uk,wk)ξk]\tilde{x}_{k+1} = \begin{bmatrix} x_{k+1} \\ y_{k+1} \end{bmatrix} = \begin{bmatrix} f_k(x_k, u_k, w_k) \\ \xi_k \end{bmatrix}

Decision advantage: uku_k can leverage yky_k (future disturbance distribution)