当模型太大,无法通过迭代得到最佳策略,我们就需要使用model free RL

Model-free prediction

Monre Carlo policy evaluation

  • Return: $Gt=R{t+1}+\gamma R{t+2}+\gamma^2R{t+3}+…$ under policy$\pi$
  • $v^\pi(s)=E_{\tau-\pi}[G_t|s_t = s]$, thus expectation over trajectories $\tau$Tgenerated by following $\pi$
  • MC simulation: we can simply sample a lot of trajectories, computethe actual returns for all the trajectories, then average them
  • MC policy evaluation uses empirical mean return instead of expectedreturn
  • MC does not require MDP dynamics/rewards,no bootstrapping,anddoes not assume state is Markov.
  • Only applied to episodic MDPs (each episode terminates)

算法流程

To evaluate state v(s)

  • Every time-step t that state s is visited in an episode,
  • Increment counter $N(s)\to N(s)+1$
  • Increment total return $S(s)\to S(s)+G_t$
  • Value is estimated by mean return $v(s)= s(s)/N(s)$By law of large numbers,$v(s)\to v^\pi(s) $as$ N(s)\to \infty$

蒙特卡洛的核心思想是通过多次迭代,取每个状态value的均值

  • Collect one episode (S1,A1,R1,…. St)
  • For each state st with computed return Gt
  • Or use a running mean (old episodes are forgotten). Good fornon-stationary problems.

$\alpha$在这里可以看做学习率

Temporal Diffenence learning

  • Objective: learn $v_\pi$ online from experience under policy $\pi$
  • Simplest TD algorithm: TD(0)
    • Update $v(St)$ toward estimated retur$R{t+1}+~v(St+1)$
  • $R{t+1}+\gamma v(S{t+1})$ is called TD target
  • $t= R_{t+1}+\gamma v(S_t+1)- v(S_t)$ is called the TD error
  • Comparison: Incremental Monte-Carlo
    • Update $v(S_t)$ toward actual return $G_t$ given an episode i我们也使用可以n步的TD

Difference Between TD and MC

TD可以随时进行学习,MC只有在走完决策树后才能学习

  • DP

  • MC
  • TD

Model free control

Monte Carlo with $\epsilon$greedy Exploration

  • Trade-off between exploration and exploitation (we will talk aboutthis in later lecture)
  • $\epsilon$Greedy Exploration: Ensuring continual exploration
    • All actions are tried with non-zero probability
    • With probability 1 - $\epsilon$ choose the greedy action
    • With probability $\epsilon$choose an action at random

Sarsa:On-Policy TD Control

  • An episode consists of an alternating sequence of states andstate-action pairs:
  • $\epsilon$-greedy policy for one step, then bootstrap the action value function:
  • The update is done after every transition from a nonterminal state $S_t$
  • TD target $\phit= R{t+1}+Q(S{t+1},A{t+1})$

On policy Learning off-policy Learning

  • On-policy learning: Learn about policy $\pi$ from the experiencecollected from $\pi$
    • Behave non-optimally in order to explore all actions, then reduce the
      exploration $\epsilon$-greedy
  • Another important approach is off-policy learning which essentiallyuses two different polices:
    • the one which is being learned about and becomes the optimal policy
    • the other one which is more exploratory and is used to generate
      trajectories
  • Off-policy learning: Learn about policy $\pi$ from the experience sampledfrom another policy $\pi$
    • $\pi$: target policy
    • $\mu$: behavior policy

off policy 会选择更加激进的策略,通过激进策略的反馈优化target policy

Off policy control with Q-Learning