当模型太大,无法通过迭代得到最佳策略,我们就需要使用model free RL

Model-free prediction

Monre Carlo policy evaluation

  • Return: $Gt=R{t+1}+\gamma R{t+2}+\gamma^2R{t+3}+…underpolicy\pi$
  • vπ(s)=Eτπ[Gt|st=s], thus expectation over trajectories τTgenerated by following π
  • MC simulation: we can simply sample a lot of trajectories, computethe actual returns for all the trajectories, then average them
  • MC policy evaluation uses empirical mean return instead of expectedreturn
  • MC does not require MDP dynamics/rewards,no bootstrapping,anddoes not assume state is Markov.
  • Only applied to episodic MDPs (each episode terminates)

算法流程

To evaluate state v(s)

  • Every time-step t that state s is visited in an episode,
  • Increment counter N(s)N(s)1
  • Increment total return S(s)S(s)+Gt
  • Value is estimated by mean return v(s)=s(s)/N(s)By law of large numbers,v(s)vπ(s)asN(s)

蒙特卡洛的核心思想是通过多次迭代,取每个状态value的均值

  • Collect one episode (S1,A1,R1,…. St)
  • For each state st with computed return GtN(St)=N(St)+1v(St)=v(St)+1N(St)(Gtv(St))
  • Or use a running mean (old episodes are forgotten). Good fornon-stationary problems.v(St)=v(St)+α(Gtv(St))

α在这里可以看做学习率

Temporal Diffenence learning

  • Objective: learn vπ online from experience under policy π
  • Simplest TD algorithm: TD(0)
    • Update $v(St)towardestimatedreturR{t+1}+~v(St+1)$
v(St)=v(St)+α(Rt+1γv(St+1)v(St))
  • $R{t+1}+\gamma v(S{t+1})$ is called TD target
  • t=Rt+1+γv(St+1)v(St) is called the TD error
  • Comparison: Incremental Monte-Carlo
    • Update v(St) toward actual return Gt given an episode iv(St)=v(St)+α(Gtv(St))我们也使用可以n步的TD

Difference Between TD and MC

TD可以随时进行学习,MC只有在走完决策树后才能学习

  • DP

  • MC
  • TD

Model free control

Monte Carlo with ϵgreedy Exploration

  • Trade-off between exploration and exploitation (we will talk aboutthis in later lecture)
  • ϵGreedy Exploration: Ensuring continual exploration
    • All actions are tried with non-zero probability
    • With probability 1 - ϵ choose the greedy action
    • With probability ϵchoose an action at random
π(a|s)={ϵ/|A|+1ϵa=argmaxaAQ(s,aϵ/|A|otherwise

Sarsa:On-Policy TD Control

  • An episode consists of an alternating sequence of states andstate-action pairs:
  • ϵ-greedy policy for one step, then bootstrap the action value function:
Q(St,At)=Q(St,At)+α[Rt+1+Q(St+1,At+1)Q(st,At)]
  • The update is done after every transition from a nonterminal state St
  • TD target $\phit= R{t+1}+Q(S{t+1},A{t+1})$

On policy Learning off-policy Learning

  • On-policy learning: Learn about policy π from the experiencecollected from π
    • Behave non-optimally in order to explore all actions, then reduce the
      exploration ϵ-greedy
  • Another important approach is off-policy learning which essentiallyuses two different polices:
    • the one which is being learned about and becomes the optimal policy
    • the other one which is more exploratory and is used to generate
      trajectories
  • Off-policy learning: Learn about policy π from the experience sampledfrom another policy π
    • π: target policy
    • μ: behavior policy

off policy 会选择更加激进的策略,通过激进策略的反馈优化target policy

Off policy control with Q-Learning