#sdsc6007

English / 中文

Elements of Reinforcement Learning

Reinforcement learning includes the following five core elements:

Agent and Environment: The agent performs actions, and the environment returns observations and rewards.
Reward Signal: A scalar feedback signal indicating the agent’s performance at time t.
Policy: Describes the agent’s behavior, mapping from states to actions.
Value Function: Predicts the expected future reward (under a specific policy).
Model: Predicts the behavior/rewards of the environment.

Interaction between Agent and Environment

At each time step t:

The agent performs action $A_t$ , receives observation $O_t$ and scalar reward $R_t$ .
The environment receives action $A_t$ , emits observation $O_{t+1}$ and scalar reward $R_{t+1}$ .

History is the sequence of observations, actions, and rewards:

$H_t = O_1, R_1, A_1, \ldots, A_{t-1}, O_t, R_t$

Definition of State

State $S_t$ is Markov if and only if it contains all useful information from the history:

$P(S_{t+1} \mid S_t) = P(S_{t+1} \mid S_1, \ldots, S_t)$

Once the state is known, the history can be discarded.

Key Insight: The Markov property means “the future depends only on the present, not on the past,” which greatly simplifies problem modeling.

Exploration vs. Exploitation Trade-off

Exploration: Seeking more information about the environment.
Exploitation: Maximizing total reward based on current information.
There is a trade-off relationship (need to balance both).

Practical Examples:

Choosing a restaurant: Exploitation (go to favorite restaurant) vs. Exploration (try a new restaurant).

Online advertising: Exploitation (show relevant products) vs. Exploration (show different ads).

Course selection: Exploitation (choose the best teacher from before) vs. Exploration (try a new teacher).

Introduction to Markov Decision Processes

Basics and Markov Chains

Markov Chain/Markov Process describes a memoryless stochastic process (no rewards or actions), i.e., a sequence $S_1, S_2, \ldots$ with the Markov property.

Definition: A finite-state Markov chain is a tuple $\langle \mathcal{S}, P \rangle$

$\mathcal{S}$ is a finite set of states.
$P$ is the state transition probability matrix, where $p_{ss'} = P(S_{t+1} = s' \mid S_t = s)$ .

$P \in \mathbb{R}^{S \times S}$ can be represented as:

$P = \begin{bmatrix} p_{11} & p_{12} & \cdots & p_{1S} \\ p_{21} & p_{22} & \cdots & p_{2S} \\ \vdots & \vdots & \ddots & \vdots \\ p_{S1} & p_{S2} & \cdots & p_{SS} \end{bmatrix}$

where each row of the matrix sums to 1.

Example

    graph LR
    Lec1 -- 0.7 --> Lec2
    Lec1 -- 0.3 --> IG
    Lec2 -- 0.6 --> Lec3
    Lec2 -- 0.4 --> FB
    Lec3 -- 0.5 --> HW1
    Lec3 -- 0.25 --> IG
    Lec3 -- 0.25 --> Sleep
    HW1 -- 1.0 --> Sleep
    IG -- 0.5 --> Lec1
    IG -- 0.2 --> IG
    IG -- 0.3 --> FB
    FB -- 0.5 --> FB
    FB -- 0.5 --> IG
    Sleep -- 1.0 --> Sleep

The transition matrix is as follows:

From \ To	Lecture 1	Lecture 2	Lecture 3	Homework 1	Instagram	Facebook	Sleep
Lecture 1		0.7			0.3
Lecture 2			0.6			0.4
Lecture 3				0.5	0.25		0.25
Homework 1							1
Instagram	0.5				0.2	0.3
Facebook					0.5	0.5
Sleep							1

Markov Reward Process

Markov Reward Process = Markov Chain + Reward (still no actions or fixed policy)

Definition: MRP is a tuple $\langle \mathcal{S}, P, r, \gamma \rangle$

$\mathcal{S}$ is a finite set of states.
$P$ is the state transition probability matrix.
$r$ is the reward function, $r_s = \mathbb{E}[R_{t+1} \mid S_t = s]$ .
$\gamma$ is the discount factor, $\gamma \in [0, 1]$ .

Example:

    graph LR
	Lec1[Lecture 1</br>R = -1]
	Lec2[Lecture 2</br>R = -1]
	Lec3[Lecture 3</br>R = -1]
	HW1[Homework 1</br>R = 10]
	IG[Instagram</br>R = 0.5]
	FB[Facebook</br>R = -0.5]
	Sleep[Sleep</br>R = 0]
	
    Lec1 -- 0.7 --> Lec2
    Lec1 -- 0.3 --> IG
    Lec2 -- 0.6 --> Lec3
    Lec2 -- 0.4 --> FB
    Lec3 -- 0.5 --> HW1
    Lec3 -- 0.25 --> IG
    Lec3 -- 0.25 --> Sleep
    HW1 -- 1.0 --> Sleep
    IG -- 0.5 --> Lec1
    IG -- 0.2 --> IG
    IG -- 0.3 --> FB
    FB -- 0.5 --> FB
    FB -- 0.5 --> IG
    Sleep -- 1.0 --> Sleep

Return and Value Function

Return $G_t$ is the total discounted reward starting from time step t:

$G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

State Value Function $v(s)$ is the expected return starting from state s:

$v(s) = \mathbb{E}[G_t \mid S_t = s]$

Detailed Explanation of the Bellman Equation for MRP

The Bellman equation for Markov Reward Processes (MRP) is a core equation in reinforcement learning, establishing the relationship between state values and subsequent state values.

1. Equation Derivation

Starting from the definition of the value function:

$v(s) = \mathbb{E}[G_t \mid S_t = s] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \mid S_t = s]$

Decomposing it into immediate reward and discounted future rewards:

$v(s) = \mathbb{E}[R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \cdots) \mid S_t = s]$

$v(s) = \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_t = s]$

Using the Markov property and linearity of expectation:

$v(s) = \mathbb{E}[R_{t+1} \mid S_t = s] + \gamma \mathbb{E}[G_{t+1} \mid S_t = s]$

$v(s) = r_s + \gamma \sum_{s' \in \mathcal{S}} p_{ss'} \mathbb{E}[G_{t+1} \mid S_{t+1} = s']$

$v(s) = r_s + \gamma \sum_{s' \in \mathcal{S}} p_{ss'} v(s')$

2. Parameter Explanation

$v(s)$ : The value of state s, representing the expected cumulative reward starting from state s.
$r_s$ : The expected immediate reward, $r_s = \mathbb{E}[R_{t+1} \mid S_t = s]$ .
$\gamma$ : Discount factor (0 ≤ γ ≤ 1)
- γ = 0: Only consider immediate rewards.
- γ → 1: Increasingly value future rewards.
- Typically set between 0.9 and 0.99.
$p_{ss'}$ : Probability of transitioning from state s to state s’.
$\mathcal{S}$ : The set of all possible states.

3. Matrix Form

Organizing the value of each state into a vector $v = [v(1), v(2), \ldots, v(S)]^T$ ,
rewards into a vector $r = [r_1, r_2, \ldots, r_S]^T$ ,
we get the matrix form:

$v = r + \gamma P v$

where P is the state transition probability matrix.

4. Analytical Solution

Through algebraic transformation:

$v - \gamma P v = r$

$(I - \gamma P)v = r$

$v = (I - \gamma P)^{-1} r$

Note: This analytical solution is only practical when the state space is small, as matrix inversion has complexity O(S³).

Practical Example

Assume three states {A, B, C}, γ=0.9, reward function and transition probabilities as follows:

State	Reward	To A	To B	To C
A	1	0.2	0.5	0.3
B	2	0.1	0.6	0.3
C	-1	0.4	0.4	0.2

Set up the Bellman equation:

$v(A) = 1 + 0.9(0.2v(A) + 0.5v(B) + 0.3v(C))$

$v(B) = 2 + 0.9(0.1v(A) + 0.6v(B) + 0.3v(C))$

$v(C) = -1 + 0.9(0.4v(A) + 0.4v(B) + 0.2v(C))$

Convert to matrix form:

$\begin{bmatrix} v(A) \\ v(B) \\ v(C) \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \\ -1 \end{bmatrix} + 0.9 \begin{bmatrix} 0.2 & 0.5 & 0.3 \\ 0.1 & 0.6 & 0.3 \\ 0.4 & 0.4 & 0.2 \end{bmatrix} \begin{bmatrix} v(A) \\ v(B) \\ v(C) \end{bmatrix}$

Solve:

$v = (I - 0.9P)^{-1} r$

Through calculation, the specific value of each state can be obtained.

Markov Decision Process

Markov Decision Process = Markov Reward Process + Actions

Definition: MDP is a tuple $\langle \mathcal{S}, \mathcal{A}, P, r, \gamma \rangle$

$\mathcal{S}$ is a finite set of states.
$\mathcal{A}$ is a finite set of actions.
$P$ is the state transition kernel, $p_{sas'} = P(S_{t+1} = s' \mid S_t = s, A_t = a)$ .
$r$ is the reward function, $r_{sa} = \mathbb{E}[R_{t+1} \mid S_t = s, A_t = a]$ .
$\gamma$ is the discount factor, $\gamma \in [0, 1]$ .

Policy

Policy $\pi$ is a distribution over actions given a state:

$\pi(a \mid s) = P(A_t = a \mid S_t = s)$

This is called a stochastic policy.

Optimality Conditions in MDP

Finite Horizon Case and Bellman Operator

Goal: Maximize $\mathbb{E}[\sum_{t=0}^{T-1} \gamma^t R_{t+1}]$

Using DP algorithms, we have:

$J_{N-t}(s) = \max_{a \in \mathcal{A}} \left( \gamma^{N-t} r_{sa} + \sum_{s' \in \mathcal{S}} p_{sas'} \cdot J_{N-t+1}(s') \right)$

$J_N(s) = \gamma^N \times 0 = 0$

Normalization: $v_t(s) = \frac{J_{N-t}(s)}{\gamma^{N-t}}$

$v_{t+1}(s) = \max_{a \in \mathcal{A}} \left( r_{sa} + \gamma \sum_{s' \in \mathcal{S}} p_{sas'} \cdot v_t(s') \right)$

$v_0(s) = 0$

Bellman Operator

Define the Bellman operator $\mathcal{T}$ :

$(\mathcal{T}v)(s) = \max_{a \in \mathcal{A}} \left( r_{sa} + \gamma \sum_{s' \in \mathcal{S}} p_{sas'} \cdot v(s') \right)$

Define $\mathcal{T}^k v(s) = \mathcal{T}(\mathcal{T}^{k-1} v)(s)$ (where $\mathcal{T}^0 v(s) = v(s)$ )

We can solve the finite horizon problem with $\mathcal{T}^N v_0 = v_N = J_0$

For a static deterministic policy $\pi$ , define the Bellman operator under policy $\mathcal{T}_\pi$ :

$(\mathcal{T}_\pi v)(s) = r_{s\pi(s)} + \gamma \sum_{s' \in \mathcal{S}} p_{s\pi(s)s'} \cdot v(s')$

Properties of the Bellman Operator

Monotonicity Lemma

Consider $v, v' \in \mathbb{R}^S$ , where $v(s) \leq v'(s)$ , $\forall s \in \mathcal{S}$ :

$(\mathcal{T}^k v)(s) \leq (\mathcal{T}^k v')(s), \quad s \in \mathcal{S}, k = 1, 2, \ldots$

Similarly for any static policy $\pi$ .

If $v(s) \leq (\mathcal{T}v)(s)$ , $\forall s \in \mathcal{S}$ , then:

$(\mathcal{T}^k v)(s) \leq (\mathcal{T}^{k+1} v)(s), \quad s \in \mathcal{S}, k = 1, 2, \ldots$

Constant Shift Lemma

For any static policy $\pi$ , scalar $b$ , $v \in \mathbb{R}^S$ :

$(\mathcal{T}^k (v + b \cdot 1_S))(s) = (\mathcal{T}^k v)(s) + \gamma^k b, \quad s \in \mathcal{S}, k = 1, 2, \ldots$

$(\mathcal{T}_\pi^k (v + b \cdot 1_S))(s) = (\mathcal{T}_\pi^k v)(s) + \gamma^k b, \quad s \in \mathcal{S}, k = 1, 2, \ldots$

where $1_S \in \mathbb{R}^S$ is the all-ones vector.

Proposition: Convergence of DP Algorithm

Assumptions

Immediate rewards $R_t$ are bounded, i.e., $|R_t| \leq M$ for some constant M.
$0 < \gamma < 1$

Proposition

For any initial guess $v_0 \in \mathbb{R}^S$ , for any $s \in \mathcal{S}$ :

$v^\star(s) = \lim_{N \to \infty} (\mathcal{T}^N v_0)(s)$

where $v^\star$ is the optimal value function, i.e.:

$v^\star(s_0) = \max_\pi \lim_{K \to \infty} \mathbb{E} \left[ \sum_{t=0}^K \gamma^t R_{t+1} \mid S_0 = s_0 \right]$

Similarly, for a fixed policy $\pi$ :

$v_\pi^\star(s) = \lim_{N \to \infty} (\mathcal{T}_\pi^N v_0)(s)$

In-depth Understanding: This proposition shows that no matter what initial value estimate we start with, by repeatedly applying the Bellman operator, we will eventually converge to the true optimal value function.

Proposition: Bellman Equation

Proposition

The optimal value function $v^\star$ satisfies for all $s \in \mathcal{S}$ :

$v^\star(s) = \max_{a \in \mathcal{A}} \left( r_{sa} + \gamma \sum_{s' \in \mathcal{S}} p_{sas'} \cdot v^\star(s') \right)$

or equivalently:

$v^\star = \mathcal{T} v^\star$

Moreover, $v^\star$ is the unique solution to this equation (thus, the solution to this equation must be the optimal value function).

Similarly, for any $v \in \mathbb{R}^S$ where $v \geq \mathcal{T}v$ (or $v \leq \mathcal{T}v$ ), we have $v \geq v^\star$ (or $v \leq v^\star$ , respectively).

For a fixed policy $\pi$ , the associated value function $v_\pi$ satisfies:

$v_\pi(s) = r_{s\pi(s)} + \gamma \sum_{s' \in \mathcal{S}} p_{s\pi(s)s'} \cdot v_\pi(s')$

or equivalently:

$v_\pi = \mathcal{T}_\pi v_\pi$

Proposition: Necessary and Sufficient Conditions for Optimality

Proposition

A static policy $\pi$ is optimal if and only if $\pi(s)$ achieves the maximum in the Bellman equation for each $s \in \mathcal{S}$ , i.e.:

$\mathcal{T} v^\star = \mathcal{T}_\pi v^\star$

Implications

First compute $v^\star$ .
For each $s \in \mathcal{S}$ , choose:

$a_s \in \arg\max_{a \in \mathcal{A}} \left( r_{sa} + \gamma \sum_{s' \in \mathcal{S}} p_{sas'} \cdot v^\star(s') \right)$

Construct a policy $\pi$ such that $\pi(s) = a_s$ .
By construction, $\mathcal{T} v^\star = \mathcal{T}_\pi v^\star$ .
$\pi$ is an optimal static deterministic policy.

Uniqueness and Stochastic Policies

The optimal policy is not unique.
For any $s \in \mathcal{S}$ , there may be multiple $a_s$ satisfying the above condition.
Randomly choosing any action from $\{a_{s1}, a_{s2}, \ldots, a_{sK}\}$ in state s is optimal.
Stochastic policy: $\pi(a \mid s) = P(A_t = a \mid S_t = s)$ .
Optimal stochastic policy: If $a \in \{a_{s1}, a_{s2}, \ldots, a_{sK}\}$ , then $\pi(a \mid s) > 0$ , otherwise $\pi(a \mid s) = 0$ .

Key Conclusion: In the MDP setting, the best deterministic policy performs as well as the best stochastic policy; stochastic policies do not provide additional advantage.