ReturnGt is the total discounted reward starting from time step t:
Gt=Rt+1+γRt+2+⋯=k=0∑∞γkRt+k+1
State Value Functionv(s) is the expected return starting from state s:
v(s)=E[Gt∣St=s]
Detailed Explanation of the Bellman Equation for MRP
The Bellman equation for Markov Reward Processes (MRP) is a core equation in reinforcement learning, establishing the relationship between state values and subsequent state values.
1. Equation Derivation
Starting from the definition of the value function:
v(s)=E[Gt∣St=s]=E[Rt+1+γRt+2+γ2Rt+3+⋯∣St=s]
Decomposing it into immediate reward and discounted future rewards:
v(s)=E[Rt+1+γ(Rt+2+γRt+3+⋯)∣St=s]
v(s)=E[Rt+1+γGt+1∣St=s]
Using the Markov property and linearity of expectation:
v(s)=E[Rt+1∣St=s]+γE[Gt+1∣St=s]
v(s)=rs+γs′∈S∑pss′E[Gt+1∣St+1=s′]
v(s)=rs+γs′∈S∑pss′v(s′)
2. Parameter Explanation
v(s): The value of state s, representing the expected cumulative reward starting from state s.
rs: The expected immediate reward, rs=E[Rt+1∣St=s].
γ: Discount factor (0 ≤ γ ≤ 1)
γ = 0: Only consider immediate rewards.
γ → 1: Increasingly value future rewards.
Typically set between 0.9 and 0.99.
pss′: Probability of transitioning from state s to state s’.
S: The set of all possible states.
3. Matrix Form
Organizing the value of each state into a vector v=[v(1),v(2),…,v(S)]T,
rewards into a vector r=[r1,r2,…,rS]T,
we get the matrix form:
v=r+γPv
where P is the state transition probability matrix.
4. Analytical Solution
Through algebraic transformation:
v−γPv=r
(I−γP)v=r
v=(I−γP)−1r
Note: This analytical solution is only practical when the state space is small, as matrix inversion has complexity O(S³).
Practical Example
Assume three states {A, B, C}, γ=0.9, reward function and transition probabilities as follows:
We can solve the finite horizon problem with TNv0=vN=J0
For a static deterministic policy π, define the Bellman operator under policy Tπ:
(Tπv)(s)=rsπ(s)+γs′∈S∑psπ(s)s′⋅v(s′)
Properties of the Bellman Operator
Monotonicity Lemma
Consider v,v′∈RS, where v(s)≤v′(s), ∀s∈S:
(Tkv)(s)≤(Tkv′)(s),s∈S,k=1,2,…
Similarly for any static policy π.
If v(s)≤(Tv)(s), ∀s∈S, then:
(Tkv)(s)≤(Tk+1v)(s),s∈S,k=1,2,…
Constant Shift Lemma
For any static policy π, scalar b, v∈RS:
(Tk(v+b⋅1S))(s)=(Tkv)(s)+γkb,s∈S,k=1,2,…
(Tπk(v+b⋅1S))(s)=(Tπkv)(s)+γkb,s∈S,k=1,2,…
where 1S∈RS is the all-ones vector.
Proposition: Convergence of DP Algorithm
Assumptions
Immediate rewards Rt are bounded, i.e., ∣Rt∣≤M for some constant M.
0<γ<1
Proposition
For any initial guess v0∈RS, for any s∈S:
v⋆(s)=N→∞lim(TNv0)(s)
where v⋆ is the optimal value function, i.e.:
v⋆(s0)=πmaxK→∞limE[t=0∑KγtRt+1∣S0=s0]
Similarly, for a fixed policy π:
vπ⋆(s)=N→∞lim(TπNv0)(s)
In-depth Understanding: This proposition shows that no matter what initial value estimate we start with, by repeatedly applying the Bellman operator, we will eventually converge to the true optimal value function.
Proposition: Bellman Equation
Proposition
The optimal value function v⋆ satisfies for all s∈S:
v⋆(s)=a∈Amax(rsa+γs′∈S∑psas′⋅v⋆(s′))
or equivalently:
v⋆=Tv⋆
Moreover, v⋆ is the unique solution to this equation (thus, the solution to this equation must be the optimal value function).
Similarly, for any v∈RS where v≥Tv (or v≤Tv), we have v≥v⋆ (or v≤v⋆, respectively).
For a fixed policy π, the associated value function vπ satisfies:
vπ(s)=rsπ(s)+γs′∈S∑psπ(s)s′⋅vπ(s′)
or equivalently:
vπ=Tπvπ
Proposition: Necessary and Sufficient Conditions for Optimality
Proposition
A static policy π is optimal if and only if π(s) achieves the maximum in the Bellman equation for each s∈S, i.e.:
Tv⋆=Tπv⋆
Implications
First compute v⋆.
For each s∈S, choose:
as∈arga∈Amax(rsa+γs′∈S∑psas′⋅v⋆(s′))
Construct a policy π such that π(s)=as.
By construction, Tv⋆=Tπv⋆.
π is an optimal static deterministic policy.
Uniqueness and Stochastic Policies
The optimal policy is not unique.
For any s∈S, there may be multiple as satisfying the above condition.
Randomly choosing any action from {as1,as2,…,asK} in state s is optimal.
Stochastic policy: π(a∣s)=P(At=a∣St=s).
Optimal stochastic policy: If a∈{as1,as2,…,asK}, then π(a∣s)>0, otherwise π(a∣s)=0.
Key Conclusion: In the MDP setting, the best deterministic policy performs as well as the best stochastic policy; stochastic policies do not provide additional advantage.