强化学习理论入门(Trust Region Policy Optimization介绍)
介绍本文主要介绍Trust Region Policy Optimization这篇文章,这篇文章主要回答了如下2个问题:两个不同策略的value function,他们的差异是多少?有什么办法可以保证,一个策略相比于另外一个策略一定能够提升呢?针对这两个问题,我们先定义一些基本的概念,基本定义下图是一个较为一般的强化学习MDP框架下的概率图模型注意,这个图并不一定通用,特别是reward(比如s
介绍
本文主要介绍Trust Region Policy Optimization这篇文章,这篇文章主要回答了如下2个问题:
- 两个不同策略的value function,他们的差异是多少?
- 有什么办法可以保证,一个策略相比于另外一个策略一定能够提升呢?
针对这两个问题,我们先定义一些基本的概念,
基本定义
下图是一个较为一般的强化学习MDP框架下的概率图模型
注意,这个图并不一定通用,特别是reward(比如 s t + 1 s_{t+1} st+1可以不指向 r t + 1 r_{t+1} rt+1),可能是需要考虑具体场景,但这是一个较为一般化结构。基于此,我们有:
马尔科夫性质
P ( S t + 1 = s ′ , R t + 1 = r ′ ∣ s t , a t , r t , … , r 1 , s 0 , a 0 ) = P ( S t + 1 = s ′ , R t + 1 = r ′ ∣ s t , a t ) P\left( S_{t+1} =s^{\prime } ,R_{t+1} =r^{\prime } \mid s_{t} ,a_{t} ,r_{t} ,\dotsc ,r_{1} ,s_{0} ,a_{0}\right) =P\left( S_{t+1} =s^{\prime } ,R_{t+1} =r^{\prime } \mid s_{t} ,a_{t}\right) P(St+1=s′,Rt+1=r′∣st,at,rt,…,r1,s0,a0)=P(St+1=s′,Rt+1=r′∣st,at)
状态转移矩阵:
P s s ′ a = ∑ r ′ P ( S t + 1 = s ′ , R t + 1 = r ′ ∣ S t = s , A t = a ) \mathcal{P}_{ss'}^{a} =\sum _{r'} P\left( S_{t+1} =s^{\prime } ,R_{t+1} =r^{\prime } \mid S_{t} =s,A_{t} =a\right) Pss′a=r′∑P(St+1=s′,Rt+1=r′∣St=s,At=a)
下一时刻的reward:
R s s ′ a = E [ R t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] = 1 P s s ′ a ∑ r ′ r ′ P ( S t + 1 = s ′ , R t + 1 = r ′ ∣ S t = s , A t = a ) \mathcal{R}_{ss'}^{a} =E[ R_{t+1} \mid S_{t} =s,A_{t} =a,S_{t+1} =s'] =\frac{1}{\mathcal{P}_{ss'}^{a}}\sum _{r'} r'P\left( S_{t+1} =s^{\prime } ,R_{t+1} =r^{\prime } \mid S_{t} =s,A_{t} =a\right) Rss′a=E[Rt+1∣St=s,At=a,St+1=s′]=Pss′a1r′∑r′P(St+1=s′,Rt+1=r′∣St=s,At=a)
轨迹序列联合概率:
p ( τ ∣ π ) = p ( s 0 ) ∏ t = 0 T − 1 p ( s t + 1 , r t + 1 ∣ s t , a t ) π ( a t ∣ s t ) p(\tau \mid \pi )=p( s_{0})\prod _{t=0}^{T-1} p( s_{t+1} ,r_{t+1} \mid s_{t} ,a_{t}) \ \pi ( a_{t} \mid s_{t}) p(τ∣π)=p(s0)t=0∏T−1p(st+1,rt+1∣st,at) π(at∣st)
State Value Function:
V π ( s ) = E τ ∼ π [ ∑ l = 0 ∞ γ l r t + l + 1 ∣ S t = s ] = ∑ a t , r t + 1 , s t + 1 . . . . p ( s t ) ∏ t ∞ p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) ∑ l = 0 ∞ γ l r t + l + 1 = ∑ a t , r t + 1 , s t + 1 . . . . p ( s t ) ∏ t ∞ p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) ( r t + 1 + ∑ l = 1 ∞ γ l r t + l + 1 ) = ∑ a t , r t + 1 , s t + 1 p ( s t ) p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) r t + 1 + ∑ a t , r t + 1 , s t + 1 p ( s t ) p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) ∑ l = 1 ∞ γ l r t + l + 1 = ∑ a t , r t + 1 , s t + 1 p ( s t ) p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) r t + 1 + ∑ a t , r t + 1 , s t + 1 p ( s t ) p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) ∑ a t + 1 , r t + 2 , s t + 2 . . . . p ( s t + 1 ) ∏ t ∞ p ( s t + 2 ∣ s t + 1 , a t + 1 ) p ( r t + 2 ∣ s t + 1 , a t + 1 , s t + 2 ) π ( a t + 1 ∣ s t + 1 ) ∑ l = 1 ∞ γ l r t + l + 1 = ∑ a t , r t + 1 , s t + 1 p ( s t ) p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) r t + 1 + ∑ a t , r t + 1 , s t + 1 p ( s t ) p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) γ V ( s t + 1 ) = ∑ a t , r t + 1 , s t + 1 p ( s t ) p ( s t + 1 ∣ s t , a t ) p ( r t + 1 ∣ s t , a t , s t + 1 ) π ( a t ∣ s t ) ( r t + 1 + γ V ( s t + 1 ) ) = E r t + 1 [ ∑ a t , s t + 1 p ( s t ) p ( s t + 1 ∣ s t , a t ) π ( a t ∣ s t ) ( r t + 1 + γ V ( s t + 1 ) ) ] = E τ ∼ π [ r t + 1 + γ V ( s t + 1 ) ∣ S t = s ] \begin{aligned} & V_{\pi }( s)\\ = & \ \mathbb{E}_{\tau \sim \pi }\left[\sum _{l=0}^{\infty } \gamma ^{l} r_{t+l+1} |S_{t} =s\right]\\ = & \sum _{a_{t} ,r_{t+1} ,s_{t+1} ....} p( s_{t})\prod _{t}^{\infty } p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t})\sum _{l=0}^{\infty } \gamma ^{l} r_{t+l+1}\\ = & \sum _{a_{t} ,r_{t+1} ,s_{t+1} ....} p( s_{t})\prod _{t}^{\infty } p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t})\left( r_{t+1} +\sum _{l=1}^{\infty } \gamma ^{l} r_{t+l+1}\right)\\ = & \sum _{a_{t} ,r_{t+1} ,s_{t+1}} p( s_{t}) p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t}) r_{t+1} +\\ & \sum _{a_{t} ,r_{t+1} ,s_{t+1}} p( s_{t}) p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t})\sum _{l=1}^{\infty } \gamma ^{l} r_{t+l+1}\\ = & \sum _{a_{t} ,r_{t+1} ,s_{t+1}} p( s_{t}) p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t}) r_{t+1} +\\ & \sum _{a_{t} ,r_{t+1} ,s_{t+1}} p( s_{t}) p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t})\sum _{a_{t+1} ,r_{t+2} ,s_{t+2} ....} p( s_{t+1})\prod _{t}^{\infty } p( s_{t+2} \mid s_{t+1} ,a_{t+1}) p( r_{t+2} \mid s_{t+1} ,a_{t+1} ,s_{t+2}) \ \pi ( a_{t+1} \mid s_{t+1})\sum _{l=1}^{\infty } \gamma ^{l} r_{t+l+1}\\ = & \sum _{a_{t} ,r_{t+1} ,s_{t+1}} p( s_{t}) p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t}) r_{t+1} +\\ & \sum _{a_{t} ,r_{t+1} ,s_{t+1}} p( s_{t}) p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t}) \gamma V( s_{t+1})\\ = & \sum _{a_{t} ,r_{t+1} ,s_{t+1}} p( s_{t}) p( s_{t+1} |s_{t} ,a_{t}) p( r_{t+1} \mid s_{t} ,a_{t} ,s_{t+1}) \pi ( a_{t} \mid s_{t})( r_{t+1} +\gamma V( s_{t+1}))\\ = & E_{r_{t+1}}\left[\sum _{a_{t} ,s_{t+1}} p( s_{t}) p( s_{t+1} |s_{t} ,a_{t}) \pi ( a_{t} \mid s_{t})( r_{t+1} +\gamma V( s_{t+1}))\right]\\ = & \ \mathbb{E}_{\tau \sim \pi }[ r_{t+1} +\gamma V( s_{t+1}) |S_{t} =s] \end{aligned} =========Vπ(s) Eτ∼π[l=0∑∞γlrt+l+1∣St=s]at,rt+1,st+1....∑p(st)t∏∞p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)l=0∑∞γlrt+l+1at,rt+1,st+1....∑p(st)t∏∞p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)(rt+1+l=1∑∞γlrt+l+1)at,rt+1,st+1∑p(st)p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)rt+1+at,rt+1,st+1∑p(st)p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)l=1∑∞γlrt+l+1at,rt+1,st+1∑p(st)p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)rt+1+at,rt+1,st+1∑p(st)p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)at+1,rt+2,st+2....∑p(st+1)t∏∞p(st+2∣st+1,at+1)p(rt+2∣st+1,at+1,st+2) π(at+1∣st+1)l=1∑∞γlrt+l+1at,rt+1,st+1∑p(st)p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)rt+1+at,rt+1,st+1∑p(st)p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)γV(st+1)at,rt+1,st+1∑p(st)p(st+1∣st,at)p(rt+1∣st,at,st+1)π(at∣st)(rt+1+γV(st+1))Ert+1[at,st+1∑p(st)p(st+1∣st,at)π(at∣st)(rt+1+γV(st+1))] Eτ∼π[rt+1+γV(st+1)∣St=s]
其中 s t = s \displaystyle s_{t} =s st=s,一个细节, s t \displaystyle s_{t} st是没有被求和的,因为他是给定的。另外一个细节是,这里没有考虑 R t \displaystyle R_{t} Rt时刻的reward的,reward是从 t + 1 t+1 t+1开始算起的,原因是reward往往需要上一时刻的 s t s_t st来计算,而初始时,并没有上一时刻的值所以不会计算。此外,reward的取值有时候是确定的,并没有随机性,那么可以简单用一个函数代替,如 r t + 1 = r ( s t ) r_{t+1}=r(s_t) rt+1=r(st),在这种时候,对于 V π ( s t ) \displaystyle V_{\pi }( s_{t}) Vπ(st),可以简写为
V π ( s t ) = E π [ ∑ l = 0 ∞ γ l r ( s t + l ) ] V_{\pi }( s_{t}) =\mathbb{E}_{\pi }\left[\sum _{l=0}^{\infty } \gamma ^{l} r( s_{t+l})\right] Vπ(st)=Eπ[l=0∑∞γlr(st+l)]
Trust Region Policy Optimization
接下来,我们介绍TRPO这篇文章。这篇论文主要研究了一种如何让你的policy在累计reward上一定能够得到提升的方法,要研究这个东西我们首先需要搞清楚一个问题,那就是,两个不同的polocy他们的效果的区别在哪里?只有找到两者的联系,我们才能设计一种更好的更新策略。
首先,这里我们研究的是expected discounted reward:
η ( π ) = E s 0 , a 0 , … [ ∑ t = 0 ∞ γ t r ( s t ) ] , where s 0 ∼ ρ 0 ( s 0 ) , a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) . \begin{aligned} & \eta (\pi )=\mathbb{E}_{s_0 ,a_{0} ,\dotsc }\left[\sum _{t=0}^{\infty } \gamma ^{t} r(s_{t} )\right] ,\text{ where}\\ & s_{0} \sim \rho _{0} (s_{0} ),\ a_{t} \sim \pi (a_{t} |s_{t} ),\ s_{t+1} \sim P(s_{t+1} |s_{t} ,a_{t} ). \end{aligned} η(π)=Es0,a0,…[t=0∑∞γtr(st)], wheres0∼ρ0(s0), at∼π(at∣st), st+1∼P(st+1∣st,at).
注意,这里 η ( π ) \displaystyle \eta ( \pi ) η(π)与value function V π ( s ) \displaystyle V^{\pi }( s) Vπ(s)有点不同,其不同之处在于, η \displaystyle \eta η需要进一步对初始状态 s \displaystyle s s求期望,即需要考虑初始状态下每个状态的概率,我们用 ρ 0 ( s ) : = P ( s 0 = s ) \displaystyle \rho _{0} (s):=P( s_{0} =s) ρ0(s):=P(s0=s)表示初始状态为s的概率,于是, η \displaystyle \eta η与 V \displaystyle V V的联系为:
E s 0 [ V π ( s 0 ) ] = ∑ s 0 ∈ S ρ 0 ( s 0 ) V π ( s 0 ) = η ( π ) E_{s_{0}}\left[ V^{\pi }( s_{0})\right] =\sum _{s_{0} \in S} \rho _{0} (s_{0} )V^{\pi }( s_{0}) =\eta ( \pi ) Es0[Vπ(s0)]=s0∈S∑ρ0(s0)Vπ(s0)=η(π)
那么,这篇文章第一个结论就是,两个不同polocy,expected discounted reward有如下关系:
η ( π ~ ) = η ( π ) + E τ ∼ π ~ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] \eta (\tilde{\pi } )=\eta (\pi )+\mathbb{E}_{\tau \sim \tilde{\pi }}\left[\sum _{t=0}^{\infty } \gamma ^{t} A_{\pi }( s_{t} ,a_{t})\right] η(π~)=η(π)+Eτ∼π~[t=0∑∞γtAπ(st,at)]
其中
Q π ( s t , a t ) = E s t + 1 , a t + 1 , … [ ∑ l = 0 ∞ γ l r ( s t + l ) ] V π ( s t ) = E a t , s t + 1 , … [ ∑ l = 0 ∞ γ l r ( s t + l ) ] A π ( s , a ) = Q π ( s , a ) − V π ( s ) , where a t ∼ π ( a t ∣ s t ) , s t + 1 ∼ P ( s t + 1 ∣ s t , a t ) for t ≥ 0 \begin{aligned} & Q_{\pi }( s_{t} ,a_{t}) =\mathbb{E}_{s_{t+1} ,a_{t+1} ,\dotsc }\left[\sum _{l=0}^{\infty } \gamma ^{l} r( s_{t+l})\right]\\ & V_{\pi }( s_{t}) =\mathbb{E}_{a_{t} ,s_{t+1} ,\dotsc }\left[\sum _{l=0}^{\infty } \gamma ^{l} r( s_{t+l})\right]\\ & A_{\pi } (s,a)=Q_{\pi } (s,a)-V_{\pi } (s),\text{ where }\\ & a_{t} \sim \pi ( a_{t} \mid s_{t}) ,s_{t+1} \sim P( s_{t+1} \mid s_{t} ,a_{t})\text{ for } t\geq 0 \end{aligned} Qπ(st,at)=Est+1,at+1,…[l=0∑∞γlr(st+l)]Vπ(st)=Eat,st+1,…[l=0∑∞γlr(st+l)]Aπ(s,a)=Qπ(s,a)−Vπ(s), where at∼π(at∣st),st+1∼P(st+1∣st,at) for t≥0
注意一些细节,在Q的定义中,期望是从 a t + 1 \displaystyle a_{t+1} at+1开始取的,而V中的期望是从 a t \displaystyle a_{t} at就开始取的,所以
Q π ( s t , a t ) − V π ( s t ) = r ( s t ) + ∑ s t + 1 p ( s t + 1 ∣ s t , a t ) V ( s t + 1 ) − V π ( s t ) = r ( s t ) + E s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) [ V ( s t + 1 ) ] − V π ( s t ) Q_{\pi } (s_{t} ,a_{t} )-V_{\pi } (s_{t} )=r(s_t)+\sum _{s_{t+1}} p( s_{t+1} |s_{t} ,a_{t}) V( s_{t+1}) -V_{\pi }( s_{t}) =r(s_t)+E_{s_{t+1} \sim p( s_{t+1} |s_{t} ,a_{t})}[ V( s_{t+1})] -V_{\pi }( s_{t}) Qπ(st,at)−Vπ(st)=r(st)+st+1∑p(st+1∣st,at)V(st+1)−Vπ(st)=r(st)+Est+1∼p(st+1∣st,at)[V(st+1)]−Vπ(st)
直觉上, Q π ( s , a ) − V π ( s ) \displaystyle Q_{\pi } (s,a)-V_{\pi } (s) Qπ(s,a)−Vπ(s)衡量了只在当前step选择a所得到的额外好处,接下来利用这个A我们就可以推导出,两种不同polocy之间的差异,正是由A决定的:
E τ ∣ π ~ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] = E τ ∣ π ~ [ ∑ t = 0 ∞ γ t E s t + 1 ∼ p ( s t + 1 ∣ s t , a t ) [ r ( s t ) + γ V π ( s t + 1 ) − V π ( s t ) ] ] = E τ ∣ π ~ [ ∑ t = 0 ∞ γ t r ( s t ) + E s 1 , s 2 , . . . , [ γ V π ( s 1 ) − V π ( s 0 ) + γ 2 V π ( s 2 ) − γ V π ( s 1 ) + γ 3 V π ( s 3 ) − γ 2 V π ( s 2 ) + . . . ] ] = E τ ∣ π ~ [ ∑ t = 0 ∞ γ t r ( s t ) − V π ( s 0 ) ] = − E s 0 [ V π ( s 0 ) ] + E τ ∣ π ~ [ ∑ t = 0 ∞ γ t r ( s t ) ] = − η ( π ) + η ( π ~ ) \begin{aligned} & \mathbb{E}_{\tau \mid \tilde{\pi }}\left[\sum _{t=0}^{\infty } \gamma ^{t} A_{\pi }( s_{t} ,a_{t})\right]\\ & =\mathbb{E}_{\tau \mid \tilde{\pi }}\left[\sum _{t=0}^{\infty } \gamma ^{t} E_{s_{t+1} \sim p( s_{t+1} |s_{t} ,a_{t})}[ r( s_{t}) +\gamma V_{\pi }( s_{t+1}) -V_{\pi }( s_{t})]\right]\\ & =\mathbb{E}_{\tau \mid \tilde{\pi }}\left[\sum _{t=0}^{\infty } \gamma ^{t} r( s_{t}) +E_{s_{1} ,s_{2} ,...,}\left[ \gamma V_{\pi }( s_{1}) -V_{\pi }( s_{0}) +\gamma ^{2} V_{\pi }( s_{2}) -\gamma V_{\pi }( s_{1}) +\gamma ^{3} V_{\pi }( s_{3}) -\gamma ^{2} V_{\pi }( s_{2}) +...\right]\right]\\ & =\mathbb{E}_{\tau \mid \tilde{\pi }}\left[\sum _{t=0}^{\infty } \gamma ^{t} r( s_{t}) -V_{\pi }( s_{0})\right]\\ & =-\mathbb{E}_{s_{0}}[ V_{\pi }( s_{0})] +\mathbb{E}_{\tau \mid \tilde{\pi }}\left[\sum _{t=0}^{\infty } \gamma ^{t} r( s_{t})\right]\\ & =-\eta (\pi )+\eta (\tilde{\pi } ) \end{aligned} Eτ∣π~[t=0∑∞γtAπ(st,at)]=Eτ∣π~[t=0∑∞γtEst+1∼p(st+1∣st,at)[r(st)+γVπ(st+1)−Vπ(st)]]=Eτ∣π~[t=0∑∞γtr(st)+Es1,s2,...,[γVπ(s1)−Vπ(s0)+γ2Vπ(s2)−γVπ(s1)+γ3Vπ(s3)−γ2Vπ(s2)+...]]=Eτ∣π~[t=0∑∞γtr(st)−Vπ(s0)]=−Es0[Vπ(s0)]+Eτ∣π~[t=0∑∞γtr(st)]=−η(π)+η(π~)
此外,在 η ( π ) \eta(\pi) η(π)的原始定义,我们需要对每个序列的时间步求积分,但如果S里面的取值是有限的,那么 s \displaystyle s s将会相互重复,注意到,这里的r是由 s t + l \displaystyle s_{t+l} st+l决定的,因此,对于相同状态的reward是可以被合并的。举个例子:
不妨设 t = 0 , 1 , 2 \displaystyle t=0,1,2 t=0,1,2,于是
η ( π ) = E s 0 , s 1 , s 2 [ ∑ t = 0 2 γ t r ( s t ) ] = ∑ s 0 p ( s 0 ) γ 0 r ( s 0 ) + ∑ a 0 , s 0 , s 1 p ( s 0 ) π ( a 0 ∣ s 0 ) p ( s 1 ∣ s 0 , a 0 ) γ r ( s 1 ) + ∑ a 0 , s 0 , a 1 s 1 , s 2 p ( s 0 ) π ( a 0 ∣ s 0 ) p ( s 1 ∣ s 0 , a 0 ) p ( s 2 ∣ s 1 , a 1 ) π ( a 1 ∣ s 1 ) γ 2 r ( s 2 ) = P ( s 0 = 1 ) γ 0 r ( 1 ) + ∑ s 0 , a 0 p ( s 1 = 1 ∣ s 0 , a 0 ) p ( s 0 ) π ( a 0 ∣ s 0 ) γ 1 r ( 1 ) + ∑ s 0 , a 0 , s 1 , a 1 p ( s 2 = 1 ∣ s 1 , a 1 ) p ( s 1 ∣ s 0 , a 0 ) p ( s 0 ) π ( a 1 ∣ s 1 ) γ 2 r ( 1 ) + P ( s 0 = 2 ) γ 0 r ( 2 ) + ∑ s 0 , a 0 p ( s 1 = 2 ∣ s 0 , a 0 ) p ( s 0 ) π ( a 0 ∣ s 0 ) γ 1 r ( 2 ) + ∑ s 0 , a 0 , s 1 , a 1 p ( s 2 = 2 ∣ s 1 , a 1 ) p ( s 1 ∣ s 0 , a 0 ) p ( s 0 ) π ( a 1 ∣ s 1 ) γ 2 r ( 2 ) + P ( s 0 = 3 ) γ 0 r ( 3 ) + ∑ s 0 , a 0 p ( s 1 = 3 ∣ s 0 , a 0 ) p ( s 0 ) π ( a 0 ∣ s 0 ) γ 1 r ( 3 ) + ∑ s 0 , a 0 , s 1 , a 1 p ( s 2 = 3 ∣ s 1 , a 1 ) p ( s 1 ∣ s 0 , a 0 ) p ( s 0 ) π ( a 1 ∣ s 1 ) γ 2 r ( 3 ) = ∑ s ρ ( s ) r ( s ) \begin{aligned} \eta( \pi ) & =\mathbb{E}_{s_{0} ,s_{1} ,s_{2}}\left[\sum ^{2}_{t=0} \gamma ^{t} r( s_{t})\right]\\ & =\sum _{s_{0}} p( s_{0}) \gamma ^{0} r( s_{0})\\ & +\sum _{a_{0} ,s_{0} ,s_{1}} p( s_{0}) \pi ( a_{0} \mid s_{0}) p( s_{1} |s_{0} ,a_{0}) \gamma r( s_{1})\\ & +\sum _{a_{0} ,s_{0} ,a_{1} s_{1} ,s_{2}} p( s_{0}) \pi ( a_{0} \mid s_{0}) p( s_{1} |s_{0} ,a_{0}) p( s_{2} |s_{1} ,a_{1}) \pi ( a_{1} \mid s_{1}) \gamma ^{2} r( s_{2})\\ & =P( s_{0} =1) \gamma ^{0} r( 1) +\sum _{s_{0} ,a_{0}} p( s_{1} =1|s_{0} ,a_{0}) p( s_{0}) \pi ( a_{0} |s_{0}) \gamma ^{1} r( 1) +\sum _{s_{0} ,a_{0} ,s_{1} ,a_{1}} p( s_{2} =1|s_{1} ,a_{1}) p( s_{1} |s_{0} ,a_{0}) p( s_{0}) \pi ( a_{1} |s_{1}) \gamma ^{2} r( 1)\\ & +P( s_{0} =2) \gamma ^{0} r( 2) +\sum _{s_{0} ,a_{0}} p( s_{1} =2|s_{0} ,a_{0}) p( s_{0}) \pi ( a_{0} |s_{0}) \gamma ^{1} r( 2) +\sum _{s_{0} ,a_{0} ,s_{1} ,a_{1}} p( s_{2} =2|s_{1} ,a_{1}) p( s_{1} |s_{0} ,a_{0}) p( s_{0}) \pi ( a_{1} |s_{1}) \gamma ^{2} r( 2)\\ & +P( s_{0} =3) \gamma ^{0} r( 3) +\sum _{s_{0} ,a_{0}} p( s_{1} =3|s_{0} ,a_{0}) p( s_{0}) \pi ( a_{0} |s_{0}) \gamma ^{1} r( 3) +\sum _{s_{0} ,a_{0} ,s_{1} ,a_{1}} p( s_{2} =3|s_{1} ,a_{1}) p( s_{1} |s_{0} ,a_{0}) p( s_{0}) \pi ( a_{1} |s_{1}) \gamma ^{2} r( 3)\\ & =\sum _{s} \rho ( s) r( s) \end{aligned} η(π)=Es0,s1,s2[t=0∑2γtr(st)]=s0∑p(s0)γ0r(s0)+a0,s0,s1∑p(s0)π(a0∣s0)p(s1∣s0,a0)γr(s1)+a0,s0,a1s1,s2∑p(s0)π(a0∣s0)p(s1∣s0,a0)p(s2∣s1,a1)π(a1∣s1)γ2r(s2)=P(s0=1)γ0r(1)+s0,a0∑p(s1=1∣s0,a0)p(s0)π(a0∣s0)γ1r(1)+s0,a0,s1,a1∑p(s2=1∣s1,a1)p(s1∣s0,a0)p(s0)π(a1∣s1)γ2r(1)+P(s0=2)γ0r(2)+s0,a0∑p(s1=2∣s0,a0)p(s0)π(a0∣s0)γ1r(2)+s0,a0,s1,a1∑p(s2=2∣s1,a1)p(s1∣s0,a0)p(s0)π(a1∣s1)γ2r(2)+P(s0=3)γ0r(3)+s0,a0∑p(s1=3∣s0,a0)p(s0)π(a0∣s0)γ1r(3)+s0,a0,s1,a1∑p(s2=3∣s1,a1)p(s1∣s0,a0)p(s0)π(a1∣s1)γ2r(3)=s∑ρ(s)r(s)
类似的,我们利用有限的state的性质来重写value function:
η ( π ~ ) = η ( π ) + E τ ∣ π ~ [ ∑ t = 0 ∞ γ t A π ( s t , a t ) ] = η ( π ) + ∑ t = 0 ∞ γ t ∑ s P ( s t = s ∣ π ~ ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) = η ( π ) + ∑ t = 0 ∞ γ t ∑ s P ( s t = s ∣ π ~ ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) = η ( π ) + ∑ s ∑ t = 0 ∞ γ t P ( s t = s ∣ π ~ ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) = η ( π ) + ∑ s ρ π ~ ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \begin{aligned} \eta (\tilde{\pi } ) & =\mathbb{\eta (\pi )+E}_{\tau \mid \tilde{\pi }}\left[\sum _{t=0}^{\infty } \gamma ^{t} A_{\pi }( s_{t} ,a_{t})\right]\\ & =\mathbb{\eta (\pi )+}\sum _{t=0}^{\infty } \gamma ^{t}\sum _{s} P\left( s_{t} =s\mid \tilde{\pi }\right)\sum _{a}\tilde{\pi } (a\mid s)A_{\pi } (s,a)\\ & =\eta (\pi )+\sum _{t=0}^{\infty } \gamma ^{t}\sum _{s} P\left( s_{t} =s\mid \tilde{\pi }\right)\sum _{a}\tilde{\pi } (a\mid s)A_{\pi } (s,a)\\ & =\eta (\pi )+\sum _{s}\sum _{t=0}^{\infty } \gamma ^{t} P\left( s_{t} =s\mid \tilde{\pi }\right)\sum _{a}\tilde{\pi } (a\mid s)A_{\pi } (s,a)\\ & =\eta (\pi )+\sum _{s} \rho _{\tilde{\pi }} (s)\sum _{a}\tilde{\pi } (a\mid s)A_{\pi } (s,a) \end{aligned} η(π~)=η(π)+Eτ∣π~[t=0∑∞γtAπ(st,at)]=η(π)+t=0∑∞γts∑P(st=s∣π~)a∑π~(a∣s)Aπ(s,a)=η(π)+t=0∑∞γts∑P(st=s∣π~)a∑π~(a∣s)Aπ(s,a)=η(π)+s∑t=0∑∞γtP(st=s∣π~)a∑π~(a∣s)Aπ(s,a)=η(π)+s∑ρπ~(s)a∑π~(a∣s)Aπ(s,a)
其中
ρ π ( s ) = P ( s 0 = s ) + γ P ( s 1 = s ) + γ 2 P ( s 2 = s ) + … \rho _{\pi } (s)=P( s_{0} =s) +\gamma P( s_{1} =s) +\gamma ^{2} P( s_{2} =s) +\dotsc ρπ(s)=P(s0=s)+γP(s1=s)+γ2P(s2=s)+…
根据这个等式,我们发现,只要能够保证 ∑ a π ~ ( a ∣ s ) A π ( s , a ) ⩾ 0 \displaystyle \sum _{a}\tilde{\pi } (a\mid s)A_{\pi } (s,a)\geqslant 0 a∑π~(a∣s)Aπ(s,a)⩾0,那么 π → π ~ \displaystyle \pi \rightarrow \tilde{\pi } π→π~就一定有提升,这意味着一种传统的贪婪方法:
π ~ = a r g max a A π ( s , a ) \tilde{\pi } =arg\max_{a} A_{\pi }( s,a) π~=argamaxAπ(s,a)
只要A中有有某个状态与动作是正数,那么 π ~ \displaystyle \tilde{\pi } π~就一定能提升(因为取最大值总能找到这个数)。但问题在于,我们并不知道更新了policy后, ρ π ~ ( s ) \displaystyle \rho _{\tilde{\pi }} (s) ρπ~(s)会变成什么样,这个东西只能采样估计,那么有没有一些近似的方法呢从直觉上讲,只要两个分布 π , π ~ \displaystyle \pi ,\tilde{\pi } π,π~的距离不太远, ρ π ~ ( s ) \displaystyle \rho _{\tilde{\pi }} (s) ρπ~(s)和 ρ π ( s ) \displaystyle \rho _{\pi } (s) ρπ(s)应该相差不远才对,也就是说,如果我们能够限制两个 π \displaystyle \pi π的距离,那么 ρ π ( s ) \displaystyle \rho _{\pi } (s) ρπ(s)之间的距离也应该会被限制住,所以,我们是不是可以用一种近似的方法来学习出 η ( π ~ ) \displaystyle \eta (\tilde{\pi } ) η(π~):
L π ( π ~ ) = η ( π ) + ∑ s ρ π ( s ) ∑ a π ~ ( a ∣ s ) A π ( s , a ) \begin{aligned} L_{\pi } (\tilde{\pi } ) & =\eta (\pi )+\sum _{s} \rho _{\pi } (s)\sum _{a}\tilde{\pi } (a\mid s)A_{\pi } (s,a) \end{aligned} Lπ(π~)=η(π)+s∑ρπ(s)a∑π~(a∣s)Aπ(s,a)
这里,我们将 ρ π ~ ( s ) \displaystyle \rho _{\tilde{\pi }} (s) ρπ~(s)换成了 ρ π ( s ) \displaystyle \rho _{\pi } (s) ρπ(s)。
但问题在于,它与真实的 η ( π ~ ) \eta(\tilde{\pi }) η(π~)的差距有多大呢?我们将证明当 π \displaystyle \pi π与 π ~ \displaystyle \tilde{\pi } π~的距离有限时, L ( π ~ ) \displaystyle L(\tilde{\pi } ) L(π~)与真实 η ( π ~ ) \displaystyle \eta (\tilde{\pi } ) η(π~)的差距将会被bound住:
定理1 令 α = max s D T V ( π ( ⋅ ∣ s ) ∥ π ~ ( ⋅ ∣ s ) ) \displaystyle \alpha =\max_{s} D_{TV}\left( \pi ( \cdot |s) \| \tilde{\pi }( \cdot |s)\right) α=smaxDTV(π(⋅∣s)∥π~(⋅∣s)),则下面的不等式成立
η ( π new ) ≥ L π old ( π new ) − 4 ϵ γ ( 1 − γ ) 2 α 2 where ϵ = max s , a ∣ A π ( s , a ) ∣ , \left. \begin{array}{ c } \eta (\pi _{\text{ new }} )\geq L_{\pi _{\text{ old }}} (\pi _{\text{ new }} )-\frac{4\epsilon \gamma }{(1-\gamma )^{2}} \alpha ^{2}\\ \text{ where } \epsilon =\operatorname{max}_{s,a} |A_{\pi } (s,a)|, \end{array}\right. η(π new )≥Lπ old (π new )−(1−γ)24ϵγα2 where ϵ=maxs,a∣Aπ(s,a)∣,
这里 D T V \displaystyle D_{TV} DTV是Total variation distance of probability measures。为了证明这个东西,我们先看下Total variation distance的相关知识。
Total variation distance of probability measures
Total variation distance定义如下:
D T C ( μ , ν ) = sup A ∈ F ∣ μ ( A ) − ν ( A ) ∣ . D_{TC} (\mu ,\nu )=\sup _{A\in \mathcal{F}}| \mu (A)-\nu(A)| . DTC(μ,ν)=A∈Fsup∣μ(A)−ν(A)∣.
其中 F \mathcal{ F} F是sigma-algebra。他衡量了两个概率分布的最大差异,即存在某个事件A使得他们概率相差最大。如果sigma-algebra是可数的(如离散,自然数等),那么TC可以写成l1范数
∥ μ − ν ∥ T V = 1 2 ∑ x ∈ Ω ∣ μ ( x ) − ν ( x ) ∣ \| \mu -\nu \| _{TV} =\frac{1}{2}\sum _{x\in \Omega } |\mu ( x) -\nu ( x) | ∥μ−ν∥TV=21x∈Ω∑∣μ(x)−ν(x)∣
直觉上,两个分布的最大差异一定是发生在区域 B = { x ∣ μ ( x ) > ν ( x ) } \displaystyle B=\{x|\mu ( x) >\nu ( x)\} B={x∣μ(x)>ν(x)}中,或在区域中 B ‾ = { x ∣ μ ( x ) < ν ( x ) } \displaystyle \overline{B} =\{x|\mu ( x) < \nu ( x)\} B={x∣μ(x)<ν(x)}。换句话说,当发生的事件恰好为 B B B时,这就是差异最大的时候,于是,直觉上
∥ μ − ν ∥ T V = μ ( B ) − ν ( B ) 或 ν ( B ‾ ) − μ ( B ‾ ) \| \mu -\nu \| _{TV} =\mu ( B) -\nu ( B) 或\nu (\overline{B}) -\mu (\overline{B}) ∥μ−ν∥TV=μ(B)−ν(B)或ν(B)−μ(B)

实际上,你会发现,
μ ( B ) − ν ( B ) = ν ( B ‾ ) − μ ( B ‾ ) \mu ( B) -\nu ( B) =\nu (\overline{B}) -\mu (\overline{B}) μ(B)−ν(B)=ν(B)−μ(B)
这是因为,
μ ( B ) + μ ( B ‾ ) = ν ( B ‾ ) + ν ( B ) = 1 \mu ( B) +\mu (\overline{B}) =\nu (\overline{B}) +\nu ( B) =1 μ(B)+μ(B)=ν(B)+ν(B)=1
所以
∥ μ − ν ∥ T V = 1 2 ( μ ( B ) − ν ( B ) + ν ( B ‾ ) − μ ( B ‾ ) ) = 1 2 ∑ x ∈ Ω ∣ μ ( x ) − ν ( x ) ∣ = 1 2 ∥ μ − ν ∥ 1 \begin{aligned} \| \mu -\nu \| _{TV} & =\frac{1}{2}( \mu ( B) -\nu ( B) +\nu (\overline{B}) -\mu (\overline{B}))\\ & =\frac{1}{2}\sum _{x\in \Omega } |\mu ( x) -\nu ( x) |\\ & =\frac{1}{2} \| \mu -\nu \| _{1} \end{aligned} ∥μ−ν∥TV=21(μ(B)−ν(B)+ν(B)−μ(B))=21x∈Ω∑∣μ(x)−ν(x)∣=21∥μ−ν∥1
以上就证明了,在离散或者可数集下,TV距离可以使用l1范数代替。
Pertubation Theory Proof of Policy Improvement Bound
接下来我们开始证明,论文里面提供两种证明方法,我们这里主要介绍第二种基于Pertubation Theory的证明方法。
定理1 令 α = max s D T V ( π ( ⋅ ∣ s ) ∥ π ~ ( ⋅ ∣ s ) ) \displaystyle \alpha =\max_{s} D_{TV}\left( \pi ( \cdot |s) \| \tilde{\pi }( \cdot |s)\right) α=smaxDTV(π(⋅∣s)∥π~(⋅∣s)),则下面的不等式成立
η ( π new ) ≥ L π old ( π new ) − 4 ϵ γ ( 1 − γ ) 2 α 2 where ϵ = max s , a ∣ A π ( s , a ) ∣ , \left. \begin{array}{ c } \eta (\pi _{\text{ new }} )\geq L_{\pi _{\text{ old }}} (\pi _{\text{ new }} )-\frac{4\epsilon \gamma }{(1-\gamma )^{2}} \alpha ^{2}\\ \text{ where } \epsilon =\operatorname{max}_{s,a} |A_{\pi } (s,a)|, \end{array}\right. η(π new )≥Lπ old (π new )−(1−γ)24ϵγα2 where ϵ=maxs,a∣Aπ(s,a)∣,
证明:令 P π ( s ′ ∣ s ) = ∑ a ∈ A M ( s ′ ∣ s , a ) π ( a ∣ s ) P_{\pi }( s'|s) =\sum _{a\in A} M( s'|s,a) \pi ( a|s) Pπ(s′∣s)=∑a∈AM(s′∣s,a)π(a∣s),这里我们用 M \displaystyle M M表示的这个MDP model的概率转移分布, P π ( s ′ ∣ s ) \displaystyle P_{\pi }( s'|s) Pπ(s′∣s)表示在给定某个策略下而形成的state之间的转移概率。且 P π \displaystyle P_{\pi } Pπ是一个 ∣ S ∣ × ∣ S ∣ \displaystyle |S|\times |S| ∣S∣×∣S∣的矩阵。首先回忆一下
η ( π ) = E s 0 , s 1 , s 2 . . . [ ∑ t = 0 ∞ γ t r ( s t ) ] = ∑ s 0 P ( s 0 ) γ 0 r ( s 0 ) + ∑ s 0 p ( s 0 ) ∑ s 1 ∑ a 0 P ( s 1 ∣ s 0 , a 0 ) π ( a 0 ∣ s 0 ) γ 1 r ( s 1 ) + ∑ s 0 p ( s 0 ) ∑ s 1 ∑ a 0 P ( s 1 ∣ s 0 , a 0 ) π ( a 0 ∣ s 0 ) ∑ s 2 ∑ a 1 p ( s 2 ∣ s 1 , a 1 ) π ( a 1 ∣ s 1 ) γ 2 r ( s 2 ) + . . . = ∑ s ∈ S P ( s 0 = s ) γ 0 r ( s ) + ∑ s ∈ S p ( s 0 = s ) ∑ s ′ ∈ S ∑ a ∈ A P ( s ′ ∣ s , a ) π ( a ∣ s ) ⏟ P π ( s ′ ∣ s ) γ 1 r ( s ′ ) + ∑ s ∈ S p ( s 0 = s ) ∑ s ′ ∈ S ∑ a ∈ A P ( s ′ ∣ s , a ) π ( a ∣ s ) ⏟ P π ( s ′ ∣ s ) ∑ s ′ ′ ∈ S ∑ a ′ ∈ A p ( s ′ ′ ∣ s ′ , a ′ ) π ( a ′ ∣ s ′ ) ⏟ P π ( s ′ ′ ∣ s ′ ) γ 2 r ( s ′ ′ ) + . . . = ( r ( 1 ) r ( 2 ) ⋯ r ( n ) ) ( I + γ ( P π ( 1 ∣ 1 ) P π ( 1 ∣ 2 ) ⋯ P π ( 1 ∣ n ) P π ( 2 ∣ 1 ) P π ( 2 ∣ 2 ) ⋯ P π ( 2 ∣ n ) ⋮ ⋮ ⋮ P π ( n ∣ 1 ) P π ( n ∣ 2 ) ⋯ P π ( n ∣ n ) ) + γ 2 ( P π ( 1 ∣ 1 ) P π ( 1 ∣ 2 ) ⋯ P π ( 1 ∣ n ) P π ( 2 ∣ 1 ) P π ( 2 ∣ 2 ) ⋯ P π ( 2 ∣ n ) ⋮ ⋮ ⋮ P π ( n ∣ 1 ) P π ( n ∣ 2 ) ⋯ P π ( n ∣ n ) ) 2 + . . . ) ( P ( s 0 = 1 ) P ( s 0 = 2 ) ⋮ P ( s 0 = n ) ) = r ( I − γ P π ) − 1 ρ 0 \begin{aligned} \eta ( \pi ) & =\mathbb{E}_{s_{0} ,s_{1} ,s_{2} ...}\left[\sum _{t=0}^{\infty } \gamma ^{t} r( s_{t})\right]\\ & =\sum _{s_{0}} P( s_{0}) \gamma ^{0} r( s_{0}) +\sum _{s_{0}} p( s_{0})\sum _{s_{1}}\sum _{a_{0}} P( s_{1} |s_{0} ,a_{0}) \pi ( a_{0} |s_{0}) \gamma ^{1} r( s_{1}) +\sum _{s_{0}} p( s_{0})\sum _{s_{1}}\sum _{a_{0}} P( s_{1} |s_{0} ,a_{0}) \pi ( a_{0} |s_{0})\sum _{s_{2}}\sum _{a_{1}} p( s_{2} |s_{1} ,a_{1}) \pi ( a_{1} |s_{1}) \gamma ^{2} r( s_{2}) +...\\ & =\sum _{s\in S} P( s_{0} =s) \gamma ^{0} r( s) +\sum _{s\in S} p( s_{0} =s)\sum _{s'\in S}\sum _{a\in A}\underbrace{P( s'|s,a) \pi ( a|s)}_{P_{\pi }( s'|s)} \gamma ^{1} r( s') +\sum _{s\in S} p( s_{0} =s)\sum _{s'\in S}\sum _{a\in A}\underbrace{P( s'|s,a) \pi ( a|s)}_{P_{\pi }( s'|s)}\sum _{s''\in S}\sum _{a'\in A}\underbrace{p( s''|s',a') \pi ( a'|s')}_{P_{\pi }( s''|s')} \gamma ^{2} r( s'') +...\\ & =\begin{pmatrix} r( 1) & r( 2) & \cdots & r( n) \end{pmatrix}\left( I+\gamma \begin{pmatrix} P_{\pi }( 1|1) & P_{\pi }( 1|2) & \cdots & P_{\pi }( 1|n)\\ P_{\pi }( 2|1) & P_{\pi }( 2|2) & \cdots & P_{\pi }( 2|n)\\ \vdots & \vdots & & \vdots \\ P_{\pi }( n|1) & P_{\pi }( n|2) & \cdots & P_{\pi }( n|n) \end{pmatrix} +\gamma ^{2}\begin{pmatrix} P_{\pi }( 1|1) & P_{\pi }( 1|2) & \cdots & P_{\pi }( 1|n)\\ P_{\pi }( 2|1) & P_{\pi }( 2|2) & \cdots & P_{\pi }( 2|n)\\ \vdots & \vdots & & \vdots \\ P_{\pi }( n|1) & P_{\pi }( n|2) & \cdots & P_{\pi }( n|n) \end{pmatrix}^{2} +...\right)\begin{pmatrix} P( s_{0} =1)\\ P( s_{0} =2)\\ \vdots \\ P( s_{0} =n) \end{pmatrix}\\ & =r( I-\gamma P_{\pi })^{-1} \rho _{0} \end{aligned} η(π)=Es0,s1,s2...[t=0∑∞γtr(st)]=s0∑P(s0)γ0r(s0)+s0∑p(s0)s1∑a0∑P(s1∣s0,a0)π(a0∣s0)γ1r(s1)+s0∑p(s0)s1∑a0∑P(s1∣s0,a0)π(a0∣s0)s2∑a1∑p(s2∣s1,a1)π(a1∣s1)γ2r(s2)+...=s∈S∑P(s0=s)γ0r(s)+s∈S∑p(s0=s)s′∈S∑a∈A∑Pπ(s′∣s) P(s′∣s,a)π(a∣s)γ1r(s′)+s∈S∑p(s0=s)s′∈S∑a∈A∑Pπ(s′∣s) P(s′∣s,a)π(a∣s)s′′∈S∑a′∈A∑Pπ(s′′∣s′) p(s′′∣s′,a′)π(a′∣s′)γ2r(s′′)+...=(r(1)r(2)⋯r(n))⎝⎜⎜⎜⎛I+γ⎝⎜⎜⎜⎛Pπ(1∣1)Pπ(2∣1)⋮Pπ(n∣1)Pπ(1∣2)Pπ(2∣2)⋮Pπ(n∣2)⋯⋯⋯Pπ(1∣n)Pπ(2∣n)⋮Pπ(n∣n)⎠⎟⎟⎟⎞+γ2⎝⎜⎜⎜⎛Pπ(1∣1)Pπ(2∣1)⋮Pπ(n∣1)Pπ(1∣2)Pπ(2∣2)⋮Pπ(n∣2)⋯⋯⋯Pπ(1∣n)Pπ(2∣n)⋮Pπ(n∣n)⎠⎟⎟⎟⎞2+...⎠⎟⎟⎟⎞⎝⎜⎜⎜⎛P(s0=1)P(s0=2)⋮P(s0=n)⎠⎟⎟⎟⎞=r(I−γPπ)−1ρ0
从求和到矩阵的转换可能有点烧脑,我们讲解一下。首先第一个等号是正常的展开,我们对每个时间求期望。第二个等式则是利用了状态数量有限(可数)的性质,对于相同的state,虽然发生的时间不同,但因为分布P和 π \displaystyle \pi π其实跟时间无关,所以我们写成编列所有state。假设state一种有n种,于是我们就可以得到了矩阵的表达式,我们拿一项出来验证一下对不对:
( r ( 1 ) r ( 2 ) ⋯ r ( n ) ) γ ( P π ( 1 ∣ 1 ) P π ( 1 ∣ 2 ) ⋯ P π ( 1 ∣ n ) P π ( 2 ∣ 1 ) P π ( 2 ∣ 2 ) ⋯ P π ( 2 ∣ n ) ⋮ ⋮ ⋮ P π ( n ∣ 1 ) P π ( n ∣ 2 ) ⋯ P π ( n ∣ n ) ) ( P ( s 0 = 1 ) P ( s 0 = 2 ) ⋮ P ( s 0 = n ) ) = γ ( ∑ s ′ P π ( s ′ ∣ 1 ) r ( s ′ ) ∑ s ′ P π ( s ′ ∣ 2 ) r ( s ′ ) ⋯ ∑ s ′ P π ( s ′ ∣ n ) r ( s ′ ) ) ( P ( s 0 = 1 ) P ( s 0 = 2 ) ⋮ P ( s 0 = n ) ) = ∑ s P ( s 0 = s ) ∑ s ′ P π ( s ′ ∣ s ) γ r ( s ′ ) (1) \begin{aligned} & \begin{pmatrix} r( 1) & r( 2) & \cdots & r( n) \end{pmatrix} \gamma \begin{pmatrix} P_{\pi }( 1|1) & P_{\pi }( 1|2) & \cdots & P_{\pi }( 1|n)\\ P_{\pi }( 2|1) & P_{\pi }( 2|2) & \cdots & P_{\pi }( 2|n)\\ \vdots & \vdots & & \vdots \\ P_{\pi }( n|1) & P_{\pi }( n|2) & \cdots & P_{\pi }( n|n) \end{pmatrix}\begin{pmatrix} P( s_{0} =1)\\ P( s_{0} =2)\\ \vdots \\ P( s_{0} =n) \end{pmatrix}\\ = & \gamma \begin{pmatrix} \sum _{s'} P_{\pi }( s'|1) r( s') & \sum _{s'} P_{\pi }( s'|2) r( s') & \cdots & \sum _{s'} P_{\pi }( s'|n) r( s') \end{pmatrix}\begin{pmatrix} P( s_{0} =1)\\ P( s_{0} =2)\\ \vdots \\ P( s_{0} =n) \end{pmatrix}\\ = & \sum _{s} P( s_{0} =s)\sum _{s'} P_{\pi }( s'|s) \gamma r( s') \end{aligned} \tag{1} ==(r(1)r(2)⋯r(n))γ⎝⎜⎜⎜⎛Pπ(1∣1)Pπ(2∣1)⋮Pπ(n∣1)Pπ(1∣2)Pπ(2∣2)⋮Pπ(n∣2)⋯⋯⋯Pπ(1∣n)Pπ(2∣n)⋮Pπ(n∣n)⎠⎟⎟⎟⎞⎝⎜⎜⎜⎛P(s0=1)P(s0=2)⋮P(s0=n)⎠⎟⎟⎟⎞γ(∑s′Pπ(s′∣1)r(s′)∑s′Pπ(s′∣2)r(s′)⋯∑s′Pπ(s′∣n)r(s′))⎝⎜⎜⎜⎛P(s0=1)P(s0=2)⋮P(s0=n)⎠⎟⎟⎟⎞s∑P(s0=s)s′∑Pπ(s′∣s)γr(s′)(1)
你会发现这个东西跟公式的第二项对应上了。我们看看第三项,
( r ( 1 ) r ( 2 ) ⋯ r ( n ) ) γ 2 ( P π ( 1 ∣ 1 ) P π ( 1 ∣ 2 ) ⋯ P π ( 1 ∣ n ) P π ( 2 ∣ 1 ) P π ( 2 ∣ 2 ) ⋯ P π ( 2 ∣ n ) ⋮ ⋮ ⋮ P π ( n ∣ 1 ) P π ( n ∣ 2 ) ⋯ P π ( n ∣ n ) ) 2 ( P ( s 0 = 1 ) P ( s 0 = 2 ) ⋮ P ( s 0 = n ) ) = γ 2 ( ∑ s ′ ′ P π ( s ′ ′ ∣ 1 ) r ( s ′ ′ ) ∑ s ′ ′ P π ( s ′ ′ ∣ 2 ) r ( s ′ ′ ) ⋯ ∑ s ′ ′ P π ( s ′ ′ ∣ n ) r ( s ′ ′ ) ) ( P π ( 1 ∣ 1 ) P π ( 1 ∣ 2 ) ⋯ P π ( 1 ∣ n ) P π ( 2 ∣ 1 ) P π ( 2 ∣ 2 ) ⋯ P π ( 2 ∣ n ) ⋮ ⋮ ⋮ P π ( n ∣ 1 ) P π ( n ∣ 2 ) ⋯ P π ( n ∣ n ) ) ( P ( s 0 = 1 ) P ( s 0 = 2 ) ⋮ P ( s 0 = n ) ) = γ 2 ( ∑ s ′ P π ( s ′ ∣ 1 ) ∑ s ′ ′ P π ( s ′ ′ ∣ s ′ ) r ( s ′ ′ ) ∑ s ′ P π ( s ′ ∣ 2 ) ∑ s ′ ′ P π ( s ′ ′ ∣ s ′ ) r ( s ′ ′ ) ⋯ ∑ s ′ P π ( s ′ ∣ n ) ∑ s ′ ′ P π ( s ′ ′ ∣ s ′ ) r ( s ′ ′ ) ) ( P ( s 0 = 1 ) P ( s 0 = 2 ) ⋮ P ( s 0 = n ) ) = ∑ s P ( s 0 = s ) ∑ s ′ P π ( s ′ ∣ s ) ∑ s ′ ′ P π ( s ′ ′ ∣ s ′ ) γ 2 r ( s ′ ′ ) (2) \begin{aligned} & \begin{pmatrix} r( 1) & r( 2) & \cdots & r( n) \end{pmatrix} \gamma ^{2}\begin{pmatrix} P_{\pi }( 1|1) & P_{\pi }( 1|2) & \cdots & P_{\pi }( 1|n)\\ P_{\pi }( 2|1) & P_{\pi }( 2|2) & \cdots & P_{\pi }( 2|n)\\ \vdots & \vdots & & \vdots \\ P_{\pi }( n|1) & P_{\pi }( n|2) & \cdots & P_{\pi }( n|n) \end{pmatrix}^{2}\begin{pmatrix} P( s_{0} =1)\\ P( s_{0} =2)\\ \vdots \\ P( s_{0} =n) \end{pmatrix}\\ = & \gamma ^{2}\begin{pmatrix} \sum _{s''} P_{\pi }( s''|1) r( s'') & \sum _{s''} P_{\pi }( s''|2) r( s'') & \cdots & \sum _{s''} P_{\pi }( s''|n) r( s'') \end{pmatrix}\begin{pmatrix} P_{\pi }( 1|1) & P_{\pi }( 1|2) & \cdots & P_{\pi }( 1|n)\\ P_{\pi }( 2|1) & P_{\pi }( 2|2) & \cdots & P_{\pi }( 2|n)\\ \vdots & \vdots & & \vdots \\ P_{\pi }( n|1) & P_{\pi }( n|2) & \cdots & P_{\pi }( n|n) \end{pmatrix}\begin{pmatrix} P( s_{0} =1)\\ P( s_{0} =2)\\ \vdots \\ P( s_{0} =n) \end{pmatrix}\\ = & \gamma ^{2}\begin{pmatrix} \sum _{s'} P_{\pi }( s'|1)\sum _{s''} P_{\pi }( s''|s') r( s'') & \sum _{s'} P_{\pi }( s'|2)\sum _{s''} P_{\pi }( s''|s') r( s'') & \cdots & \sum _{s'} P_{\pi }( s'|n)\sum _{s''} P_{\pi }( s''|s') r( s'') \end{pmatrix}\begin{pmatrix} P( s_{0} =1)\\ P( s_{0} =2)\\ \vdots \\ P( s_{0} =n) \end{pmatrix}\\ = & \sum _{s} P( s_{0} =s)\sum _{s'} P_{\pi }( s'|s)\sum _{s''} P_{\pi }( s''|s') \gamma ^{2} r( s'') \end{aligned} \tag{2} ===(r(1)r(2)⋯r(n))γ2⎝⎜⎜⎜⎛Pπ(1∣1)Pπ(2∣1)⋮Pπ(n∣1)Pπ(1∣2)Pπ(2∣2)⋮Pπ(n∣2)⋯⋯⋯Pπ(1∣n)Pπ(2∣n)⋮Pπ(n∣n)⎠⎟⎟⎟⎞2⎝⎜⎜⎜⎛P(s0=1)P(s0=2)⋮P(s0=n)⎠⎟⎟⎟⎞γ2(∑s′′Pπ(s′′∣1)r(s′′)∑s′′Pπ(s′′∣2)r(s′′)⋯∑s′′Pπ(s′′∣n)r(s′′))⎝⎜⎜⎜⎛Pπ(1∣1)Pπ(2∣1)⋮Pπ(n∣1)Pπ(1∣2)Pπ(2∣2)⋮Pπ(n∣2)⋯⋯⋯Pπ(1∣n)Pπ(2∣n)⋮Pπ(n∣n)⎠⎟⎟⎟⎞⎝⎜⎜⎜⎛P(s0=1)P(s0=2)⋮P(s0=n)⎠⎟⎟⎟⎞γ2(∑s′Pπ(s′∣1)∑s′′Pπ(s′′∣s′)r(s′′)∑s′Pπ(s′∣2)∑s′′Pπ(s′′∣s′)r(s′′)⋯∑s′Pπ(s′∣n)∑s′′Pπ(s′′∣s′)r(s′′))⎝⎜⎜⎜⎛P(s0=1)P(s0=2)⋮P(s0=n)⎠⎟⎟⎟⎞s∑P(s0=s)s′∑Pπ(s′∣s)s′′∑Pπ(s′′∣s′)γ2r(s′′)(2)
令 G = ( 1 + γ P π + ( γ P π ) 2 + … ) = ( 1 − γ P π ) − 1 G=\left( 1+\gamma P_{\pi } +( \gamma P_{\pi })^{2} +\dotsc \right) =( 1-\gamma P_{\pi })^{-1} G=(1+γPπ+(γPπ)2+…)=(1−γPπ)−1, G ~ = ( 1 + γ P π ~ + ( γ P π ~ ) 2 + … ) = ( 1 − γ P π ~ ) − 1 \tilde{G} =\left( 1+\gamma P_{\tilde{\pi }} +( \gamma P_{\tilde{\pi }})^{2} +\dotsc \right) =( 1-\gamma P_{\tilde{\pi }})^{-1} G~=(1+γPπ~+(γPπ~)2+…)=(1−γPπ~)−1,根据上面的分析,我们有, η ( π ) = r ( I − γ P π ) − 1 ρ 0 \displaystyle \eta (\pi )=r( I-\gamma P_{\pi })^{-1} \rho _{0} η(π)=r(I−γPπ)−1ρ0, η ( π ~ ) = r ( I − γ P π ~ ) − 1 ρ 0 \displaystyle \eta (\tilde{\pi } )=r( I-\gamma P_{\tilde{\pi }})^{-1} \rho _{0} η(π~)=r(I−γPπ~)−1ρ0,
为了方便我们设
G − 1 − G ~ − 1 = ( 1 − γ P π ) − ( 1 − γ P π ~ ) = γ ( P π ~ − P π ) = γ Δ G^{-1} -\tilde{G}^{-1} =( 1-\gamma P_{\pi }) -( 1-\gamma P_{\tilde{\pi }}) =\gamma ( P_{\tilde{\pi }} -P_{\pi }) =\gamma \Delta G−1−G~−1=(1−γPπ)−(1−γPπ~)=γ(Pπ~−Pπ)=γΔ
于是对其左乘 G \displaystyle G G,又乘 G ~ \displaystyle \tilde{G} G~:
G ~ − G = γ G Δ G ~ G ~ = G + γ G Δ G ~ \tilde{G} -G=\gamma G\Delta \tilde{G}\\ \tilde{G} =G+\gamma G\Delta \tilde{G} G~−G=γGΔG~G~=G+γGΔG~
将 G ~ \displaystyle \tilde{G} G~代进去
G ~ − G = γ G Δ ( G + γ G Δ G ~ ) G ~ = G + γ G Δ G + γ 2 G Δ G Δ G ~ \tilde{G} -G=\gamma G\Delta \left( G+\gamma G\Delta \tilde{G}\right)\\ \tilde{G} =G+\gamma G\Delta G+\gamma ^{2} G\Delta G\Delta \tilde{G} G~−G=γGΔ(G+γGΔG~)G~=G+γGΔG+γ2GΔGΔG~
于是
η ( π ~ ) − η ( π ) = r ( G ~ − G ) ρ 0 = γ r G Δ G ρ 0 + γ 2 r G Δ G Δ G ~ ρ 0 \eta (\tilde{\pi } )-\eta (\pi )=r\left(\tilde{G} -G\right) \rho _{0} =\gamma rG\Delta G\rho _{0} +\gamma ^{2} rG\Delta G\Delta \tilde{G} \rho _{0} η(π~)−η(π)=r(G~−G)ρ0=γrGΔGρ0+γ2rGΔGΔG~ρ0
对于第一项, r G = v \displaystyle rG=v rG=v是state value vector,见公式(1)(2),显然其每个元素对应着不同state的value function,另外 G ρ 0 = ρ π \displaystyle G\rho _{0} =\rho _{\pi } Gρ0=ρπ. 所以 γ r G Δ G ρ 0 = γ v Δ ρ π \displaystyle \gamma rG\Delta G\rho _{0} =\gamma v\Delta \rho _{\pi } γrGΔGρ0=γvΔρπ,接下来我们将证明 L π ( π ~ ) − L π ( π ) \displaystyle L_{\pi } (\tilde{\pi } )-L_{\pi } (\pi ) Lπ(π~)−Lπ(π)两者的差距正是 γ v Δ ρ π \displaystyle \gamma v\Delta \rho _{\pi } γvΔρπ:
L π ( π ~ ) − L π ( π ) = ∑ s ρ π ( s ) ∑ a ( π ~ ( a ∣ s ) − π ( a ∣ s ) ) A π ( s , a ) = ∑ s ρ π ( s ) ∑ a ( π ~ ( a ∣ s ) − π ( a ∣ s ) ) ( Q π ( s , a ) − V π ( s ) ) = ∑ s ρ π ( s ) ∑ a ( π ~ ( a ∣ s ) − π ( a ∣ s ) ) ( E s ′ ∼ p ( s ′ ∣ s , a ) [ r ( s ) + γ v π ( s ′ ) − v π ( s ) ] ) = ∑ s ρ π ( s ) ∑ a ( π ~ ( a ∣ s ) − π ( a ∣ s ) ) [ r ( s ) + ∑ s ′ p ( s ′ ∣ s , a ) γ v π ( s ′ ) − v π ( s ) ] = ∑ s ρ π ( s ) ∑ a ( π ~ ( a ∣ s ) − π ( a ∣ s ) ) ⏟ = ∑ a π ~ ( a ∣ s ) − ∑ a π ( a ∣ s ) = 1 − 1 = 0 r ( s ) + ∑ s ρ π ( s ) ∑ s ′ ∑ a ( π ~ ( a ∣ s ) − π ( a ∣ s ) ) p ( s ′ ∣ s , a ) γ v π ( s ′ ) − ∑ s ρ π ( s ) ∑ a ( π ~ ( a ∣ s ) − π ( a ∣ s ) ) ⏟ = ∑ a π ~ ( a ∣ s ) − ∑ a π ( a ∣ s ) = 1 − 1 = 0 v π ( s ′ ) = ∑ s ρ π ( s ) ∑ s ′ ∑ a ( π ( a ∣ s ) − π ~ ( a ∣ s ) ) p ( s ′ ∣ s , a ) γ v π ( s ′ ) = ∑ s ρ π ( s ) ∑ s ′ ( p π ( s ′ ∣ s ) − p π ~ ( s ′ ∣ s ) ) γ v π ( s ′ ) = γ v Δ ρ π \begin{aligned} L_{\pi } (\tilde{\pi } )-L_{\pi } (\pi ) & =\sum _{s} \rho _{\pi } (s)\sum _{a} (\tilde{\pi } (a\mid s)-\pi (a\mid s))A_{\pi } (s,a)\\ & =\sum _{s} \rho _{\pi } (s)\sum _{a} (\tilde{\pi } (a\mid s)-\pi (a\mid s))\left( Q^{\pi }( s,a) -V^{\pi }( s)\right)\\ & =\sum _{s} \rho _{\pi } (s)\sum _{a} (\tilde{\pi } (a\mid s)-\pi (a\mid s))( E_{s'\sim p( s'|s,a)}[ r( s) +\gamma v_{\pi }( s') -v_{\pi }( s)])\\ & =\sum _{s} \rho _{\pi } (s)\sum _{a}\left(\tilde{\pi } (a\mid s)-\pi (a\mid s)\right)\left[ r(s)+\sum _{s^{\prime }} p\left( s^{\prime } \mid s,a\right) \gamma v_{\pi }\left( s^{\prime }\right) -v_{\pi } (s)\right]\\ & =\sum _{s} \rho _{\pi } (s)\underbrace{\sum _{a}\left(\tilde{\pi } (a\mid s)-\pi (a\mid s)\right)}_{=\sum _{a}\tilde{\pi } (a\mid s)-\sum _{a} \pi (a\mid s)=1-1=0} r(s)\\ & +\sum _{s} \rho _{\pi } (s)\sum _{s^{\prime }}\sum _{a}\left(\tilde{\pi } (a\mid s)-\pi (a\mid s)\right) p\left( s^{\prime } \mid s,a\right) \gamma v_{\pi }\left( s^{\prime }\right)\\ & -\sum _{s} \rho _{\pi } (s)\underbrace{\sum _{a}\left(\tilde{\pi } (a\mid s)-\pi (a\mid s)\right)}_{=\sum _{a}\tilde{\pi } (a\mid s)-\sum _{a} \pi (a\mid s)=1-1=0} v_{\pi }\left( s^{\prime }\right)\\ & =\sum _{s} \rho _{\pi } (s)\sum _{s^{\prime }}\sum _{a} (\pi (a\mid s)-\tilde{\pi } (a\mid s))p\left( s^{\prime } \mid s,a\right) \gamma v_{\pi }\left( s^{\prime }\right)\\ & =\sum _{s} \rho _{\pi } (s)\sum _{s^{\prime }}\left( p_{\pi }\left( s^{\prime } \mid s\right) -p_{\tilde{\pi }}\left( s^{\prime } \mid s\right)\right) \gamma v_{\pi }\left( s^{\prime }\right)\\ & =\gamma v\Delta \rho _{\pi } \end{aligned} Lπ(π~)−Lπ(π)=s∑ρπ(s)a∑(π~(a∣s)−π(a∣s))Aπ(s,a)=s∑ρπ(s)a∑(π~(a∣s)−π(a∣s))(Qπ(s,a)−Vπ(s))=s∑ρπ(s)a∑(π~(a∣s)−π(a∣s))(Es′∼p(s′∣s,a)[r(s)+γvπ(s′)−vπ(s)])=s∑ρπ(s)a∑(π~(a∣s)−π(a∣s))[r(s)+s′∑p(s′∣s,a)γvπ(s′)−vπ(s)]=s∑ρπ(s)=∑aπ~(a∣s)−∑aπ(a∣s)=1−1=0 a∑(π~(a∣s)−π(a∣s))r(s)+s∑ρπ(s)s′∑a∑(π~(a∣s)−π(a∣s))p(s′∣s,a)γvπ(s′)−s∑ρπ(s)=∑aπ~(a∣s)−∑aπ(a∣s)=1−1=0 a∑(π~(a∣s)−π(a∣s))vπ(s′)=s∑ρπ(s)s′∑a∑(π(a∣s)−π~(a∣s))p(s′∣s,a)γvπ(s′)=s∑ρπ(s)s′∑(pπ(s′∣s)−pπ~(s′∣s))γvπ(s′)=γvΔρπ
于是我们就有了表达式:
η ( π ~ ) − η ( π ) = L π ( π ~ ) − L π ( π ) + γ 2 r G Δ G Δ G ~ ρ 0 \eta (\tilde{\pi } )-\eta (\pi )=L_{\pi } (\tilde{\pi } )-L_{\pi } (\pi )+\gamma ^{2} rG\Delta G\Delta \tilde{G} \rho _{0} η(π~)−η(π)=Lπ(π~)−Lπ(π)+γ2rGΔGΔG~ρ0
又因为 L π ( π ) = η ( π ) \displaystyle L_{\pi } (\pi )=\eta (\pi ) Lπ(π)=η(π),于是我们有
η ( π ~ ) = L π ( π ~ ) + γ 2 r G Δ G Δ G ~ ρ 0 \begin{aligned} \eta (\tilde{\pi } ) & =L_{\pi } (\tilde{\pi } )+\gamma ^{2} rG\Delta G\Delta \tilde{G} \rho _{0} \end{aligned} η(π~)=Lπ(π~)+γ2rGΔGΔG~ρ0
因此,我们只要能够确定最后一项的范围,我们就能找到 η ( π ~ ) \displaystyle \eta (\tilde{\pi } ) η(π~)和 L π ( π ~ ) \displaystyle L_{\pi } (\tilde{\pi } ) Lπ(π~)之间的关系,即
∣ η ( π ~ ) − L π ( π ~ ) ∣ = ∣ γ 2 r G Δ G Δ G ~ ρ 0 ∣ ⩽ ? \begin{aligned} |\eta (\tilde{\pi } )-L_{\pi } (\tilde{\pi } )| & =|\gamma ^{2} rG\Delta G\Delta \tilde{G} \rho _{0} |\\ & \leqslant ? \end{aligned} ∣η(π~)−Lπ(π~)∣=∣γ2rGΔGΔG~ρ0∣⩽?
接下来我们研究下最后一项范围:
γ 2 ∣ r G Δ G Δ G ~ ρ ∣ ≤ γ ∥ γ r G Δ ∥ ∞ ∥ G Δ G ~ ρ 0 ∥ 1 ≤ γ ∥ v Δ ∥ ∞ ∥ G Δ G ~ ρ 0 ∥ 1 \begin{aligned} \gamma ^{2} |rG\Delta G\Delta \tilde{G} \rho | & \leq \gamma \| \gamma rG\Delta \| _{\infty } \| G\Delta \tilde{G} \rho _{0} \| _{1}\\ & \leq \gamma \| v\Delta \| _{\infty } \| G\Delta \tilde{G} \rho _{0} \| _{1} \end{aligned} γ2∣rGΔGΔG~ρ∣≤γ∥γrGΔ∥∞∥GΔG~ρ0∥1≤γ∥vΔ∥∞∥GΔG~ρ0∥1
对于第一项 ∥ v Δ ∥ ∞ \displaystyle \| v\Delta \| _{\infty } ∥vΔ∥∞,我们只需要知道其中的最大值即可,那么其中的某个s元素为:
∣ ( γ v Δ ) s ∣ = ∣ ∑ s ′ ( p π ( s ′ ∣ s ) − p π ~ ( s ′ ∣ s ) ) γ v π ( s ′ ) ∣ = ∣ ∑ s ′ ( ∑ a π ( s , a ) p ( s ′ ∣ s , a ) − ∑ a π ~ ( s , a ) p ( s ′ ∣ s , a ) ) γ v π ( s ′ ) ∣ = ∣ ∑ a π ( s , a ) ∑ s ′ p ( s ′ ∣ s , a ) γ v π ( s ′ ) − ∑ a π ~ ( s , a ) ∑ s ′ p ( s ′ ∣ s , a ) γ v π ( s ′ ) ∣ = ∣ ∑ a π ( s , a ) ( Q ( s , a ) − r ( s ) ) − ∑ a π ~ ( s , a ) ( Q ( s , a ) − r ( s ) ) ∣ = ∣ ∑ a ( π ( s , a ) − π ~ ( s , a ) ) Q ( s , a ) − ∑ a ( π ( s , a ) − π ~ ( s , a ) ) ⏟ = ∑ a π ( s , a ) − ∑ a π ~ ( s , a ) = 0 r ( s ) ∣ = ∣ ∑ a ( π ~ ( s , a ) − π ( s , a ) ) Q π ( s , a ) ∣ = ∣ ∑ a ( π ~ ( s , a ) − π ( s , a ) ) Q π ( s , a ) − ∑ a ( π ~ ( s , a ) − π ( s , a ) ) V π ( s ) ∣ = ∣ ∑ a ( π ~ ( s , a ) − π ( s , a ) ) A π ( s , a ) ∣ ≤ ∑ a ∣ π ~ ( s , a ) − π ( s , a ) ∣ ⋅ max a ∣ A π ( s , a ) ∣ ≤ 2 α ϵ \begin{aligned} | (\gamma v\Delta )_{s}| & =\left| \sum _{s^{\prime }}\left( p_{\pi }\left( s^{\prime } \mid s\right) -p_{\tilde{\pi }}\left( s^{\prime } \mid s\right)\right) \gamma v_{\pi }\left( s^{\prime }\right)\right| \\ & =\left| \sum _{s^{\prime }}\left(\sum _{a} \pi (s,a)p\left( s^{\prime } \mid s,a\right) -\sum _{a}\tilde{\pi } (s,a)p\left( s^{\prime } \mid s,a\right)\right) \gamma v_{\pi }\left( s^{\prime }\right)\right| \\ & =\left| \sum _{a} \pi (s,a)\sum _{s^{\prime }} p\left( s^{\prime } \mid s,a\right) \gamma v_{\pi }\left( s^{\prime }\right) -\sum _{a}\tilde{\pi } (s,a)\sum _{s^{\prime }} p\left( s^{\prime } \mid s,a\right) \gamma v_{\pi }\left( s^{\prime }\right)\right| \\ & =\left| \sum _{a} \pi (s,a)( Q( s,a) -r( s)) -\sum _{a}\tilde{\pi } (s,a)( Q( s,a) -r( s))\right| \\ & =\left| \sum _{a}\left( \pi (s,a)-\tilde{\pi } (s,a)\right) Q( s,a) -\underbrace{\sum _{a}\left( \pi (s,a)-\tilde{\pi } (s,a)\right)}_{=\sum _{a} \pi (s,a)-\sum _{a}\tilde{\pi } (s,a)=0} r( s)\right| \\ & =\left| \sum _{a} (\tilde{\pi } (s,a)-\pi (s,a))Q_{\pi } (s,a)\right| \\ & =\left| \sum _{a} (\tilde{\pi } (s,a)-\pi (s,a))Q_{\pi } (s,a)-\sum _{a} (\tilde{\pi } (s,a)-\pi (s,a))V^{\pi }( s)\right| \\ & =\left| \sum _{a} (\tilde{\pi } (s,a)-\pi (s,a))A_{\pi } (s,a)\right| \\ & \leq \sum _{a} |\tilde{\pi } (s,a)-\pi (s,a)|\cdot \max_{a}| A_{\pi } (s,a)| \\ & \leq 2\alpha \epsilon \end{aligned} ∣(γvΔ)s∣=∣∣∣∣∣s′∑(pπ(s′∣s)−pπ~(s′∣s))γvπ(s′)∣∣∣∣∣=∣∣∣∣∣s′∑(a∑π(s,a)p(s′∣s,a)−a∑π~(s,a)p(s′∣s,a))γvπ(s′)∣∣∣∣∣=∣∣∣∣∣a∑π(s,a)s′∑p(s′∣s,a)γvπ(s′)−a∑π~(s,a)s′∑p(s′∣s,a)γvπ(s′)∣∣∣∣∣=∣∣∣∣∣a∑π(s,a)(Q(s,a)−r(s))−a∑π~(s,a)(Q(s,a)−r(s))∣∣∣∣∣=∣∣∣∣∣∣∣∣∣∣a∑(π(s,a)−π~(s,a))Q(s,a)−=∑aπ(s,a)−∑aπ~(s,a)=0 a∑(π(s,a)−π~(s,a))r(s)∣∣∣∣∣∣∣∣∣∣=∣∣∣∣∣a∑(π~(s,a)−π(s,a))Qπ(s,a)∣∣∣∣∣=∣∣∣∣∣a∑(π~(s,a)−π(s,a))Qπ(s,a)−a∑(π~(s,a)−π(s,a))Vπ(s)∣∣∣∣∣=∣∣∣∣∣a∑(π~(s,a)−π(s,a))Aπ(s,a)∣∣∣∣∣≤a∑∣π~(s,a)−π(s,a)∣⋅amax∣Aπ(s,a)∣≤2αϵ
这里 α = max s D T V ( π ( ⋅ ∣ s ) ∥ π ~ ( ⋅ ∣ s ) ) , ϵ = max a , s ∣ A π ( s , a ) ∣ \displaystyle \alpha =\max_{s} D_{TV}\left( \pi ( \cdot |s) \| \tilde{\pi }( \cdot |s)\right) ,\epsilon =\max_{a,s}| A_{\pi } (s,a)| α=smaxDTV(π(⋅∣s)∥π~(⋅∣s)),ϵ=a,smax∣Aπ(s,a)∣. 显然 2 α ϵ \displaystyle 2\alpha \epsilon 2αϵ就是所有元素中的最大值。接下来分析第二项
∥ G Δ G ~ ρ ∥ 1 ≤ ∥ G ∥ 1 ∥ Δ ∥ 1 ∥ G ~ ∥ 1 ∥ ρ ∥ 1 \begin{aligned} \| G\Delta \tilde{G} \rho \| _{1} & \leq \| G\| _{1} \| \Delta \| _{1} \| \tilde{G} \| _{1} \| \rho \| _{1} \end{aligned} ∥GΔG~ρ∥1≤∥G∥1∥Δ∥1∥G~∥1∥ρ∥1
首先,矩阵的l1 norm的定义为
∥ A ∥ 1 = sup ρ { ∥ A ρ ∥ 1 ∥ ρ ∥ 1 } \| A\| _{1} =\sup _{\rho }\left\{\frac{\| A\rho \| _{1}}{\| \rho \| _{1}}\right\} ∥A∥1=ρsup{∥ρ∥1∥Aρ∥1}
直觉上,就是找一个最优向量使得 ∥ A ρ ∥ 1 \displaystyle \| A\rho \| _{1} ∥Aρ∥1最大,又因为l1往往会得到稀疏解,所有其解出来的 ρ \displaystyle \rho ρ往往是一种one-hot向量,所以,其有着性质,就是选择绝对值求和最大的一列,即
∥ A ∥ 1 = max 1 ≤ j ≤ n ∑ i = 1 m ∣ a i j ∣ , {\displaystyle \| A\| _{1} =\max_{1\leq j\leq n}\sum _{i=1}^{m} |a_{ij} |,} ∥A∥1=1≤j≤nmaxi=1∑m∣aij∣,
接下来,我们看看 ∥ G ∥ 1 \displaystyle \| G\| _{1} ∥G∥1的取值是多少
G = ( 1 + γ P π + ( γ P π ) 2 + … ) = ( 1 + γ P 11 + γ 2 ∑ j P 1 j P j 1 + γ 3 ∑ j 2 P j 2 1 ∑ j 1 P 1 j 1 P j 1 j 2 + . . . . . . 1 + γ P 1 n + γ 2 ∑ j P 1 j P j n + γ 3 ∑ j 2 P j 2 n ∑ j 1 P 1 j 1 P j 1 j 2 + . . . γ P 21 + γ 2 ∑ j P 2 j P j 1 + γ 3 ∑ j 2 P j 2 1 ∑ j 1 P 2 j 1 P j 1 j 2 + . . . . . . 1 + γ P 2 n + γ 2 ∑ j P 2 j P j n + γ 3 ∑ j 2 P j 2 n ∑ j 1 P 2 j 1 P j 1 j 2 + . . . ⋮ ⋮ γ P n 1 + γ 2 ∑ j P n j P j 1 + γ 3 ∑ j 2 P j 2 1 ∑ j 1 P n j 1 P j 1 j 2 + . . . . . . 1 + γ P n n + γ 2 ∑ j P n j P j n + γ 3 ∑ j 2 P j 2 n ∑ j 1 P n j 1 P j 1 j 2 + . . . ) G=\left( 1+\gamma P_{\pi } +( \gamma P_{\pi })^{2} +\dotsc \right)\\ =\begin{pmatrix} 1+\gamma P_{11} +\gamma ^{2}\sum _{j} P_{1j} P_{j1} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} 1}\sum _{j_{1}} P_{1j_{1}} P_{j_{1} j_{2}} +... & ... & 1+\gamma P_{1n} +\gamma ^{2}\sum _{j} P_{1j} P_{jn} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} n}\sum _{j_{1}} P_{1j_{1}} P_{j_{1} j_{2}} +...\\ \gamma P_{21} +\gamma ^{2}\sum _{j} P_{2j} P_{j1} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} 1}\sum _{j_{1}} P_{2j_{1}} P_{j_{1} j_{2}} +... & ... & 1+\gamma P_{2n} +\gamma ^{2}\sum _{j} P_{2j} P_{jn} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} n}\sum _{j_{1}} P_{2j_{1}} P_{j_{1} j_{2}} +...\\ \vdots & & \vdots \\ \gamma P_{n1} +\gamma ^{2}\sum _{j} P_{nj} P_{j1} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} 1}\sum _{j_{1}} P_{nj_{1}} P_{j_{1} j_{2}} +... & ... & 1+\gamma P_{nn} +\gamma ^{2}\sum _{j} P_{nj} P_{jn} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} n}\sum _{j_{1}} P_{nj_{1}} P_{j_{1} j_{2}} +... \end{pmatrix} G=(1+γPπ+(γPπ)2+…)=⎝⎜⎜⎜⎛1+γP11+γ2∑jP1jPj1+γ3∑j2Pj21∑j1P1j1Pj1j2+...γP21+γ2∑jP2jPj1+γ3∑j2Pj21∑j1P2j1Pj1j2+...⋮γPn1+γ2∑jPnjPj1+γ3∑j2Pj21∑j1Pnj1Pj1j2+............1+γP1n+γ2∑jP1jPjn+γ3∑j2Pj2n∑j1P1j1Pj1j2+...1+γP2n+γ2∑jP2jPjn+γ3∑j2Pj2n∑j1P2j1Pj1j2+...⋮1+γPnn+γ2∑jPnjPjn+γ3∑j2Pj2n∑j1Pnj1Pj1j2+...⎠⎟⎟⎟⎞
对于第一列的求和:
= 1 + γ ∑ i = 1 n P i 1 + γ 2 ∑ j ∑ i = 1 n P i j P j 1 + γ 3 ∑ j 2 P j 2 1 ∑ j 1 ∑ i = 1 n P i j 1 P j 1 j 2 + . . . = 1 + γ + γ 2 ∑ j P j 1 + γ 3 ∑ j 2 P j 2 1 ∑ j 1 P j 1 j 2 + . . . = 1 + γ + γ 2 + γ 3 ∑ j 2 P j 2 1 + . . . = 1 + γ + γ 2 + γ 3 + . . . = 1 1 − γ \begin{aligned} & =1+\gamma \sum _{i=1}^{n} P_{i1} +\gamma ^{2}\sum _{j}\sum _{i=1}^{n} P_{ij} P_{j1} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} 1}\sum _{j_{1}}\sum _{i=1}^{n} P_{ij_{1}} P_{j_{1} j_{2}} +...\\ & =1+\gamma +\gamma ^{2}\sum _{j} P_{j1} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} 1}\sum _{j_{1}} P_{j_{1} j_{2}} +...\\ & =1+\gamma +\gamma ^{2} +\gamma ^{3}\sum _{j_{2}} P_{j_{2} 1} +...\\ & =1+\gamma +\gamma ^{2} +\gamma ^{3} +...\\ & =\frac{1}{1-\gamma } \end{aligned} =1+γi=1∑nPi1+γ2j∑i=1∑nPijPj1+γ3j2∑Pj21j1∑i=1∑nPij1Pj1j2+...=1+γ+γ2j∑Pj1+γ3j2∑Pj21j1∑Pj1j2+...=1+γ+γ2+γ3j2∑Pj21+...=1+γ+γ2+γ3+...=1−γ1
显然每一列的求和均为 1 1 − γ \displaystyle \frac{1}{1-\gamma } 1−γ1,于是 ∥ G ∥ 1 = ∥ G ~ ∥ 1 = 1 1 − γ \displaystyle \| G\| _{1} =\| \tilde{G} \| _{1} =\frac{1}{1-\gamma } ∥G∥1=∥G~∥1=1−γ1,对于
∥ Δ ∥ 1 = ∥ P π ~ − P π ∥ 1 P π ~ − P π = ( P π ( 1 ∣ 1 ) P π ( 1 ∣ 2 ) ⋯ P π ( 1 ∣ n ) P π ( 2 ∣ 1 ) P π ( 2 ∣ 2 ) ⋯ P π ( 2 ∣ n ) ⋮ ⋮ ⋮ P π ( n ∣ 1 ) P π ( n ∣ 2 ) ⋯ P π ( n ∣ n ) ) − ( P π ~ ( 1 ∣ 1 ) P π ~ ( 1 ∣ 2 ) ⋯ P π ~ ( 1 ∣ n ) P π ~ ( 2 ∣ 1 ) P π ~ ( 2 ∣ 2 ) ⋯ P π ~ ( 2 ∣ n ) ⋮ ⋮ ⋮ P π ~ ( n ∣ 1 ) P π ~ ( n ∣ 2 ) ⋯ P π ~ ( n ∣ n ) ) = ( P π ( 1 ∣ 1 ) − P π ~ ( 1 ∣ 1 ) ⋯ P π ( 1 ∣ n ) − P π ~ ( 1 ∣ n ) P π ( 2 ∣ 1 ) − P π ~ ( 2 ∣ 1 ) ⋯ P π ( 2 ∣ n ) − P π ~ ( 2 ∣ n ) ⋮ ⋮ P π ( n ∣ 1 ) − P π ~ ( n ∣ 1 ) ⋯ P π ( n ∣ n ) − P π ~ ( n ∣ n ) ) \begin{aligned} \| \Delta \| _{1} & =\| P_{\tilde{\pi }} -P_{\pi } \| _{1}\\ & \\ P_{\tilde{\pi }} -P_{\pi } & =\begin{pmatrix} P_{\pi }( 1|1) & P_{\pi }( 1|2) & \cdots & P_{\pi }( 1|n)\\ P_{\pi }( 2|1) & P_{\pi }( 2|2) & \cdots & P_{\pi }( 2|n)\\ \vdots & \vdots & & \vdots \\ P_{\pi }( n|1) & P_{\pi }( n|2) & \cdots & P_{\pi }( n|n) \end{pmatrix} -\begin{pmatrix} P_{\tilde{\pi }}( 1|1) & P_{\tilde{\pi }}( 1|2) & \cdots & P_{\tilde{\pi }}( 1|n)\\ P_{\tilde{\pi }}( 2|1) & P_{\tilde{\pi }}( 2|2) & \cdots & P_{\tilde{\pi }}( 2|n)\\ \vdots & \vdots & & \vdots \\ P_{\tilde{\pi }}( n|1) & P_{\tilde{\pi }}( n|2) & \cdots & P_{\tilde{\pi }}( n|n) \end{pmatrix}\\ & =\begin{pmatrix} P_{\pi }( 1|1) -P_{\tilde{\pi }}( 1|1) & \cdots & P_{\pi }( 1|n) -P_{\tilde{\pi }}( 1|n)\\ P_{\pi }( 2|1) -P_{\tilde{\pi }}( 2|1) & \cdots & P_{\pi }( 2|n) -P_{\tilde{\pi }}( 2|n)\\ \vdots & & \vdots \\ P_{\pi }( n|1) -P_{\tilde{\pi }}( n|1) & \cdots & P_{\pi }( n|n) -P_{\tilde{\pi }}( n|n) \end{pmatrix} \end{aligned} ∥Δ∥1Pπ~−Pπ=∥Pπ~−Pπ∥1=⎝⎜⎜⎜⎛Pπ(1∣1)Pπ(2∣1)⋮Pπ(n∣1)Pπ(1∣2)Pπ(2∣2)⋮Pπ(n∣2)⋯⋯⋯Pπ(1∣n)Pπ(2∣n)⋮Pπ(n∣n)⎠⎟⎟⎟⎞−⎝⎜⎜⎜⎛Pπ~(1∣1)Pπ~(2∣1)⋮Pπ~(n∣1)Pπ~(1∣2)Pπ~(2∣2)⋮Pπ~(n∣2)⋯⋯⋯Pπ~(1∣n)Pπ~(2∣n)⋮Pπ~(n∣n)⎠⎟⎟⎟⎞=⎝⎜⎜⎜⎛Pπ(1∣1)−Pπ~(1∣1)Pπ(2∣1)−Pπ~(2∣1)⋮Pπ(n∣1)−Pπ~(n∣1)⋯⋯⋯Pπ(1∣n)−Pπ~(1∣n)Pπ(2∣n)−Pπ~(2∣n)⋮Pπ(n∣n)−Pπ~(n∣n)⎠⎟⎟⎟⎞
对于 P π ~ − P π P_{\tilde{\pi }} -P_{\pi } Pπ~−Pπ的某一列进行求和,我们有
∑ s ′ ( P π ( s ′ ∣ s ) − P π ~ ( s ′ ∣ s ) ) = ∑ s ′ ( P ( s ′ ∣ s , a ) π ( a ∣ s ) − P ( s ′ ∣ s , a ) π ~ ( a ∣ s ) ) = π ( a ∣ s ) − π ~ ( a ∣ s ) \begin{aligned} & \sum _{s'}( P_{\pi }( s'|s) -P_{\tilde{\pi }}( s'|s))\\ = & \sum _{s'}\left( P( s'|s,a) \pi ( a|s) -P( s'|s,a)\tilde{\pi }( a|s)\right)\\ = & \pi ( a|s) -\tilde{\pi }( a|s) \end{aligned} ==s′∑(Pπ(s′∣s)−Pπ~(s′∣s))s′∑(P(s′∣s,a)π(a∣s)−P(s′∣s,a)π~(a∣s))π(a∣s)−π~(a∣s)
所以 ∥ P π ~ − P π ∥ 1 \displaystyle \| P_{\tilde{\pi }} -P_{\pi } \| _{1} ∥Pπ~−Pπ∥1的取值应该是最大的那一列,使得 max j ∣ π ( a ∣ j ) − π ~ ( a ∣ j ) ∣ = max j 2 ∗ D T V ( π ( ⋅ ∣ j ) ∥ π ~ ( ⋅ ∣ j ) ) = 2 α \displaystyle \max_{j} |\pi ( a|j) -\tilde{\pi }( a|j) |=\max_{j} 2*D_{TV}\left( \pi ( \cdot |j) \| \tilde{\pi }( \cdot |j)\right) =2\alpha jmax∣π(a∣j)−π~(a∣j)∣=jmax2∗DTV(π(⋅∣j)∥π~(⋅∣j))=2α。最后,最后一项 ∥ ρ 0 ∥ 1 = ∑ s ∣ P ( s 0 = s ) ∣ = 1 \displaystyle \| \rho _{0} \| _{1} =\sum _{s} |P( s_{0} =s) |=1 ∥ρ0∥1=s∑∣P(s0=s)∣=1,因此,
∥ G Δ G ~ ρ ∥ 1 ≤ ∥ G ∥ 1 ∥ Δ ∥ 1 ∥ G ~ ∥ 1 ∥ ρ ∥ 1 = 1 1 − γ ⋅ 2 α ⋅ 1 1 − γ ⋅ 1 = 2 α ( 1 − γ ) 2 \begin{aligned} \| G\Delta \tilde{G} \rho \| _{1} & \leq \| G\| _{1} \| \Delta \| _{1} \| \tilde{G} \| _{1} \| \rho \| _{1}\\ & =\frac{1}{1-\gamma } \cdot 2\alpha \cdot \frac{1}{1-\gamma } \cdot 1\\ & =\frac{2\alpha }{(1-\gamma )^{2}} \end{aligned} ∥GΔG~ρ∥1≤∥G∥1∥Δ∥1∥G~∥1∥ρ∥1=1−γ1⋅2α⋅1−γ1⋅1=(1−γ)22α
最后,我们有
γ 2 ∣ r G Δ G Δ G ~ ρ 0 ∣ ≤ γ ∥ γ r G Δ ∥ ∞ ∥ G Δ G ~ ρ 0 ∥ 1 ≤ γ ∥ v Δ ∥ ∞ ∥ G Δ G ~ ρ 0 ∥ 1 ≤ γ ∥ v Δ ∥ ∞ ≤ γ ⋅ 2 α ϵ ⋅ 2 α ( 1 − γ ) 2 = 4 γ ϵ ( 1 − γ ) 2 α 2 \begin{aligned} \gamma ^{2} |rG\Delta G\Delta \tilde{G} \rho _{0} | & \leq \gamma \| \gamma rG\Delta \| _{\infty } \| G\Delta \tilde{G} \rho _{0} \| _{1}\\ & \leq \gamma \| v\Delta \| _{\infty } \| G\Delta \tilde{G} \rho _{0} \| _{1}\\ & \leq \gamma \| v\Delta \| _{\infty }\\ & \leq \gamma \cdot 2\alpha \epsilon \cdot \frac{2\alpha }{(1-\gamma )^{2}}\\ & =\frac{4\gamma \epsilon }{(1-\gamma )^{2}} \alpha ^{2} \end{aligned} γ2∣rGΔGΔG~ρ0∣≤γ∥γrGΔ∥∞∥GΔG~ρ0∥1≤γ∥vΔ∥∞∥GΔG~ρ0∥1≤γ∥vΔ∥∞≤γ⋅2αϵ⋅(1−γ)22α=(1−γ)24γϵα2
这就是最终结果了.
参考资料
Trust Region Policy Optimization
Theoretical Foundations of Reinforcement Learning
Approximately Optimal Approximate Reinforcement Learning
开放原子旋武开源社区由开放原子开源基金会孵化及运营,业务方向涉及操作系统、终端设备、安全技术、基础软件等关键领域,致力于推动Rust编程语言在中国的开源生态建设与产业落地,面向开发者全面宣传和推广Rust。
更多推荐


所有评论(0)