#sdsc6007

English / 中文

Elements of Reinforcement Learning

Reinforcement learning includes the following five core elements:

  • Agent and Environment: The agent performs actions, and the environment returns observations and rewards.

  • Reward Signal: A scalar feedback signal indicating the agent’s performance at time t.

  • Policy: Describes the agent’s behavior, mapping from states to actions.

  • Value Function: Predicts the expected future reward (under a specific policy).

  • Model: Predicts the behavior/rewards of the environment.

Interaction between Agent and Environment

At each time step t:

  • The agent performs action AtA_t, receives observation OtO_t and scalar reward RtR_t.

  • The environment receives action AtA_t, emits observation Ot+1O_{t+1} and scalar reward Rt+1R_{t+1}.

History is the sequence of observations, actions, and rewards:

Ht=O1,R1,A1,,At1,Ot,RtH_t = O_1, R_1, A_1, \ldots, A_{t-1}, O_t, R_t

Definition of State

State StS_t is Markov if and only if it contains all useful information from the history:

P(St+1St)=P(St+1S1,,St)P(S_{t+1} \mid S_t) = P(S_{t+1} \mid S_1, \ldots, S_t)

Once the state is known, the history can be discarded.

Key Insight: The Markov property means “the future depends only on the present, not on the past,” which greatly simplifies problem modeling.

Exploration vs. Exploitation Trade-off

  • Exploration: Seeking more information about the environment.

  • Exploitation: Maximizing total reward based on current information.

  • There is a trade-off relationship (need to balance both).

Practical Examples:

  • Choosing a restaurant: Exploitation (go to favorite restaurant) vs. Exploration (try a new restaurant).
  • Online advertising: Exploitation (show relevant products) vs. Exploration (show different ads).
  • Course selection: Exploitation (choose the best teacher from before) vs. Exploration (try a new teacher).

Introduction to Markov Decision Processes

Basics and Markov Chains

Markov Chain/Markov Process describes a memoryless stochastic process (no rewards or actions), i.e., a sequence S1,S2,S_1, S_2, \ldots with the Markov property.

Definition: A finite-state Markov chain is a tuple S,P\langle \mathcal{S}, P \rangle

  • S\mathcal{S} is a finite set of states.

  • PP is the state transition probability matrix, where pss=P(St+1=sSt=s)p_{ss'} = P(S_{t+1} = s' \mid S_t = s).

PRS×SP \in \mathbb{R}^{S \times S} can be represented as:

P=[p11p12p1Sp21p22p2SpS1pS2pSS]P = \begin{bmatrix} p_{11} & p_{12} & \cdots & p_{1S} \\ p_{21} & p_{22} & \cdots & p_{2S} \\ \vdots & \vdots & \ddots & \vdots \\ p_{S1} & p_{S2} & \cdots & p_{SS} \end{bmatrix}

where each row of the matrix sums to 1.

Example

The transition matrix is as follows:

From \ To Lecture 1 Lecture 2 Lecture 3 Homework 1 Instagram Facebook Sleep
Lecture 1 0.7 0.3
Lecture 2 0.6 0.4
Lecture 3 0.5 0.25 0.25
Homework 1 1
Instagram 0.5 0.2 0.3
Facebook 0.5 0.5
Sleep 1

Markov Reward Process

Markov Reward Process = Markov Chain + Reward (still no actions or fixed policy)

Definition: MRP is a tuple S,P,r,γ\langle \mathcal{S}, P, r, \gamma \rangle

  • S\mathcal{S} is a finite set of states.

  • PP is the state transition probability matrix.

  • rr is the reward function, rs=E[Rt+1St=s]r_s = \mathbb{E}[R_{t+1} \mid S_t = s].

  • γ\gamma is the discount factor, γ[0,1]\gamma \in [0, 1].

Example:

Return and Value Function

Return GtG_t is the total discounted reward starting from time step t:

Gt=Rt+1+γRt+2+=k=0γkRt+k+1G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}

State Value Function v(s)v(s) is the expected return starting from state s:

v(s)=E[GtSt=s]v(s) = \mathbb{E}[G_t \mid S_t = s]

Detailed Explanation of the Bellman Equation for MRP

The Bellman equation for Markov Reward Processes (MRP) is a core equation in reinforcement learning, establishing the relationship between state values and subsequent state values.

1. Equation Derivation

Starting from the definition of the value function:

v(s)=E[GtSt=s]=E[Rt+1+γRt+2+γ2Rt+3+St=s]v(s) = \mathbb{E}[G_t \mid S_t = s] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s]

Decomposing it into immediate reward and discounted future rewards:

v(s)=E[Rt+1+γ(Rt+2+γRt+3+)St=s]v(s) = \mathbb{E}[R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \cdots) \mid S_t = s]

v(s)=E[Rt+1+γGt+1St=s]v(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_t = s]

Using the Markov property and linearity of expectation:

v(s)=E[Rt+1St=s]+γE[Gt+1St=s]v(s) = \mathbb{E}[R_{t+1} \mid S_t = s] + \gamma \mathbb{E}[G_{t+1} \mid S_t = s]

v(s)=rs+γsSpssE[Gt+1St+1=s]v(s) = r_s + \gamma \sum_{s' \in \mathcal{S}} p_{ss'} \mathbb{E}[G_{t+1} \mid S_{t+1} = s']

v(s)=rs+γsSpssv(s)v(s) = r_s + \gamma \sum_{s' \in \mathcal{S}} p_{ss'} v(s')

2. Parameter Explanation

  • v(s)v(s): The value of state s, representing the expected cumulative reward starting from state s.

  • rsr_s: The expected immediate reward, rs=E[Rt+1St=s]r_s = \mathbb{E}[R_{t+1} \mid S_t = s].

  • γ\gamma: Discount factor (0 ≤ γ ≤ 1)

    • γ = 0: Only consider immediate rewards.
    • γ → 1: Increasingly value future rewards.
    • Typically set between 0.9 and 0.99.
  • pssp_{ss'}: Probability of transitioning from state s to state s’.

  • S\mathcal{S}: The set of all possible states.

3. Matrix Form

Organizing the value of each state into a vector v=[v(1),v(2),,v(S)]Tv = [v(1), v(2), \ldots, v(S)]^T,
rewards into a vector r=[r1,r2,,rS]Tr = [r_1, r_2, \ldots, r_S]^T,
we get the matrix form:

v=r+γPvv = r + \gamma P v

where P is the state transition probability matrix.

4. Analytical Solution

Through algebraic transformation:

vγPv=rv - \gamma P v = r

(IγP)v=r(I - \gamma P)v = r

v=(IγP)1rv = (I - \gamma P)^{-1} r

Note: This analytical solution is only practical when the state space is small, as matrix inversion has complexity O(S³).

Practical Example

Assume three states {A, B, C}, γ=0.9, reward function and transition probabilities as follows:

State Reward To A To B To C
A 1 0.2 0.5 0.3
B 2 0.1 0.6 0.3
C -1 0.4 0.4 0.2

Set up the Bellman equation:

v(A)=1+0.9(0.2v(A)+0.5v(B)+0.3v(C))v(A) = 1 + 0.9(0.2v(A) + 0.5v(B) + 0.3v(C))

v(B)=2+0.9(0.1v(A)+0.6v(B)+0.3v(C))v(B) = 2 + 0.9(0.1v(A) + 0.6v(B) + 0.3v(C))

v(C)=1+0.9(0.4v(A)+0.4v(B)+0.2v(C))v(C) = -1 + 0.9(0.4v(A) + 0.4v(B) + 0.2v(C))

Convert to matrix form:

[v(A)v(B)v(C)]=[121]+0.9[0.20.50.30.10.60.30.40.40.2][v(A)v(B)v(C)]\begin{bmatrix} v(A) \\ v(B) \\ v(C) \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ -1 \end{bmatrix} + 0.9 \begin{bmatrix} 0.2 & 0.5 & 0.3 \\ 0.1 & 0.6 & 0.3 \\ 0.4 & 0.4 & 0.2 \end{bmatrix} \begin{bmatrix} v(A) \\ v(B) \\ v(C) \end{bmatrix}

Solve:

v=(I0.9P)1rv = (I - 0.9P)^{-1} r

Through calculation, the specific value of each state can be obtained.

Markov Decision Process

Markov Decision Process = Markov Reward Process + Actions

Definition: MDP is a tuple S,A,P,r,γ\langle \mathcal{S}, \mathcal{A}, P, r, \gamma \rangle

  • S\mathcal{S} is a finite set of states.

  • A\mathcal{A} is a finite set of actions.

  • PP is the state transition kernel, psas=P(St+1=sSt=s,At=a)p_{sas'} = P(S_{t+1} = s' \mid S_t = s, A_t = a).

  • rr is the reward function, rsa=E[Rt+1St=s,At=a]r_{sa} = \mathbb{E}[R_{t+1} \mid S_t = s, A_t = a].

  • γ\gamma is the discount factor, γ[0,1]\gamma \in [0, 1].

Policy

Policy π\pi is a distribution over actions given a state:

π(as)=P(At=aSt=s)\pi(a \mid s) = P(A_t = a \mid S_t = s)

This is called a stochastic policy.

Optimality Conditions in MDP

Finite Horizon Case and Bellman Operator

Goal: Maximize E[t=0T1γtRt+1]\mathbb{E}[\sum_{t=0}^{T-1} \gamma^t R_{t+1}]

Using DP algorithms, we have:

JNt(s)=maxaA(γNtrsa+sSpsasJNt+1(s))J_{N-t}(s) = \max_{a \in \mathcal{A}} \left( \gamma^{N-t} r_{sa} + \sum_{s' \in \mathcal{S}} p_{sas'} \cdot J_{N-t+1}(s') \right)

JN(s)=γN×0=0J_N(s) = \gamma^N \times 0 = 0

Normalization: vt(s)=JNt(s)γNtv_t(s) = \frac{J_{N-t}(s)}{\gamma^{N-t}}

vt+1(s)=maxaA(rsa+γsSpsasvt(s))v_{t+1}(s) = \max_{a \in \mathcal{A}} \left( r_{sa} + \gamma \sum_{s' \in \mathcal{S}} p_{sas'} \cdot v_t(s') \right)

v0(s)=0v_0(s) = 0

Bellman Operator

Define the Bellman operator T\mathcal{T}:

(Tv)(s)=maxaA(rsa+γsSpsasv(s))(\mathcal{T}v)(s) = \max_{a \in \mathcal{A}} \left( r_{sa} + \gamma \sum_{s' \in \mathcal{S}} p_{sas'} \cdot v(s') \right)

Define Tkv(s)=T(Tk1v)(s)\mathcal{T}^k v(s) = \mathcal{T}(\mathcal{T}^{k-1} v)(s) (where T0v(s)=v(s)\mathcal{T}^0 v(s) = v(s))

We can solve the finite horizon problem with TNv0=vN=J0\mathcal{T}^N v_0 = v_N = J_0

For a static deterministic policy π\pi, define the Bellman operator under policy Tπ\mathcal{T}_\pi:

(Tπv)(s)=rsπ(s)+γsSpsπ(s)sv(s)(\mathcal{T}_\pi v)(s) = r_{s\pi(s)} + \gamma \sum_{s' \in \mathcal{S}} p_{s\pi(s)s'} \cdot v(s')

Properties of the Bellman Operator

Monotonicity Lemma

Consider v,vRSv, v' \in \mathbb{R}^S, where v(s)v(s)v(s) \leq v'(s), sS\forall s \in \mathcal{S}:

(Tkv)(s)(Tkv)(s),sS,k=1,2,(\mathcal{T}^k v)(s) \leq (\mathcal{T}^k v')(s), \quad s \in \mathcal{S}, k = 1, 2, \ldots

Similarly for any static policy π\pi.

If v(s)(Tv)(s)v(s) \leq (\mathcal{T}v)(s), sS\forall s \in \mathcal{S}, then:

(Tkv)(s)(Tk+1v)(s),sS,k=1,2,(\mathcal{T}^k v)(s) \leq (\mathcal{T}^{k+1} v)(s), \quad s \in \mathcal{S}, k = 1, 2, \ldots

Constant Shift Lemma

For any static policy π\pi, scalar bb, vRSv \in \mathbb{R}^S:

(Tk(v+b1S))(s)=(Tkv)(s)+γkb,sS,k=1,2,(\mathcal{T}^k (v + b \cdot 1_S))(s) = (\mathcal{T}^k v)(s) + \gamma^k b, \quad s \in \mathcal{S}, k = 1, 2, \ldots

(Tπk(v+b1S))(s)=(Tπkv)(s)+γkb,sS,k=1,2,(\mathcal{T}_\pi^k (v + b \cdot 1_S))(s) = (\mathcal{T}_\pi^k v)(s) + \gamma^k b, \quad s \in \mathcal{S}, k = 1, 2, \ldots

where 1SRS1_S \in \mathbb{R}^S is the all-ones vector.

Proposition: Convergence of DP Algorithm

Assumptions

  1. Immediate rewards RtR_t are bounded, i.e., RtM|R_t| \leq M for some constant M.

  2. 0<γ<10 < \gamma < 1

Proposition

For any initial guess v0RSv_0 \in \mathbb{R}^S, for any sSs \in \mathcal{S}:

v(s)=limN(TNv0)(s)v^\star(s) = \lim_{N \to \infty} (\mathcal{T}^N v_0)(s)

where vv^\star is the optimal value function, i.e.:

v(s0)=maxπlimKE[t=0KγtRt+1S0=s0]v^\star(s_0) = \max_\pi \lim_{K \to \infty} \mathbb{E} \left[ \sum_{t=0}^K \gamma^t R_{t+1} \mid S_0 = s_0 \right]

Similarly, for a fixed policy π\pi:

vπ(s)=limN(TπNv0)(s)v_\pi^\star(s) = \lim_{N \to \infty} (\mathcal{T}_\pi^N v_0)(s)

In-depth Understanding: This proposition shows that no matter what initial value estimate we start with, by repeatedly applying the Bellman operator, we will eventually converge to the true optimal value function.

Proposition: Bellman Equation

Proposition

The optimal value function vv^\star satisfies for all sSs \in \mathcal{S}:

v(s)=maxaA(rsa+γsSpsasv(s))v^\star(s) = \max_{a \in \mathcal{A}} \left( r_{sa} + \gamma \sum_{s' \in \mathcal{S}} p_{sas'} \cdot v^\star(s') \right)

or equivalently:

v=Tvv^\star = \mathcal{T} v^\star

Moreover, vv^\star is the unique solution to this equation (thus, the solution to this equation must be the optimal value function).

Similarly, for any vRSv \in \mathbb{R}^S where vTvv \geq \mathcal{T}v (or vTvv \leq \mathcal{T}v), we have vvv \geq v^\star (or vvv \leq v^\star, respectively).

For a fixed policy π\pi, the associated value function vπv_\pi satisfies:

vπ(s)=rsπ(s)+γsSpsπ(s)svπ(s)v_\pi(s) = r_{s\pi(s)} + \gamma \sum_{s' \in \mathcal{S}} p_{s\pi(s)s'} \cdot v_\pi(s')

or equivalently:

vπ=Tπvπv_\pi = \mathcal{T}_\pi v_\pi

Proposition: Necessary and Sufficient Conditions for Optimality

Proposition

A static policy π\pi is optimal if and only if π(s)\pi(s) achieves the maximum in the Bellman equation for each sSs \in \mathcal{S}, i.e.:

Tv=Tπv\mathcal{T} v^\star = \mathcal{T}_\pi v^\star

Implications

  1. First compute vv^\star.

  2. For each sSs \in \mathcal{S}, choose:

asargmaxaA(rsa+γsSpsasv(s))a_s \in \arg\max_{a \in \mathcal{A}} \left( r_{sa} + \gamma \sum_{s' \in \mathcal{S}} p_{sas'} \cdot v^\star(s') \right)

  1. Construct a policy π\pi such that π(s)=as\pi(s) = a_s.

  2. By construction, Tv=Tπv\mathcal{T} v^\star = \mathcal{T}_\pi v^\star.

  3. π\pi is an optimal static deterministic policy.

Uniqueness and Stochastic Policies

  • The optimal policy is not unique.

  • For any sSs \in \mathcal{S}, there may be multiple asa_s satisfying the above condition.

  • Randomly choosing any action from {as1,as2,,asK}\{a_{s1}, a_{s2}, \ldots, a_{sK}\} in state s is optimal.

  • Stochastic policy: π(as)=P(At=aSt=s)\pi(a \mid s) = P(A_t = a \mid S_t = s).

  • Optimal stochastic policy: If a{as1,as2,,asK}a \in \{a_{s1}, a_{s2}, \ldots, a_{sK}\}, then π(as)>0\pi(a \mid s) > 0, otherwise π(as)=0\pi(a \mid s) = 0.

Key Conclusion: In the MDP setting, the best deterministic policy performs as well as the best stochastic policy; stochastic policies do not provide additional advantage.