RL基础 - Deep Q-Learning, Policy gradient

status

type

date

slug

summary

DQN

基本就是用 conv+FC 来预测Q_value 主要作用当数据量大的时候替代查Q_value表

我们这里输入一组图像表示现在state(一组图像给我们一些时序信息 temporal info)

loss function compares our Q-value prediction and the Q-target and uses gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better

The Deep Q-Learning training algorithm has two phases:

Sampling: we perform actions and store the observed experience tuples in a replay memory.

Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step.

与Q_learning相比 DQN更不稳定毕竟使用的是 non-linear NN做预测和更新

Experience Replay to make more efficient use of experiences.

所保存的内容 (State_t, action_t → reward_t, state_t+1) → sample batch

更有效利用历史同时避免sequential sampling 带来的遗忘(只记得距离近的update)

随机选择buffer → 削减buffer关联性不会引入太多linearity带来的gradient干扰 RNN

avoid action values from oscillating or diverging catastrophically

Fixed Q-Target to stabilize the training.

D ← N 个sample

不能用same weight 来estimate both TD target and Q value (这两个值关联性太大 update会很一致很难缩短距离)

这是因为reducing TD error 的同时一直使用update的 Q func 来计算(Target 和 error 同时在变)

所以我们用初始weight去estimate target 保持值不变预测target的weight持续迭代

每过C step 同步 target weight → 目的就是让他保持稳定有一个暂时确定的学习目标

Double Deep Q-Learning, to handle the problem of the overestimation of Q-values.

how are we sure that the best action for the next state is the action with the highest Q-value?

→ the accuracy of Q-values depends on what action we tried and what neighboring states we explored

为什么是overestimate → 混淆optimal solution和high Q-value 刚开始训练时 Q-value noisy没有价值会引导向错误的方向学习

decouple action from target Q-value generation

DQN → 充当action policy

Target network → 计算 Q-value of next state

Policy Gradient

从value based 转到 policy based, 聚焦于parameterize the policy

policy gradient 属于 policy based

policy based基本上都是 On-policy Optimization 只能用trajectory 一次 with most recent version of model weight

两者的区别就是 policy gradient 我们直接把Objective function 的 gradent 作用在policy weight上

而policy based 我们有间接的手段来maximize objective

不需要额外存储value 同时

We don’t need to implement an exploration/exploitation trade-off by hand. Since we output a probability distribution over actions, the agent explores the state space without always taking the same trajectory.

We also get rid of the problem of perceptual aliasing. Perceptual aliasing is when two states seem (or are) the same but need different actions.

总体来说就是再多增加一些randomness

more effective in high-dimensional action spaces and continuous actions spaces

💡

The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state.

But what if we have an infinite possibility of actions?

For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). We’ll need to output a Q-value for each possible action! And taking the max action of a continuous output is an optimization problem itself!

这种时候直接输出 probability distribution over actions 就会很容易

同时我们防止 Q-value 不准对结果产生很大影响 → 主要因为选择的非连续性

💡

In value-based methods, we use an aggressive operator to change the value function: we take the maximum over Q-estimates. Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.

For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it’s right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.

On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) change smoothly over time.

训练目标:

这一页讲的很清楚 https://huggingface.co/learn/deep-rl-course/unit4/policy-gradient

Our goal with policy-gradient is to control the probability distribution of actions by tuning the policy such that good actions (that maximize the return) are sampled more frequently in the future

根据当前policy 跑完一个episode 算总return

根据总收益判断是否应该增加/减少路径上每个state,action pair的probability输出概率

我们使用 J 作为objective function 判断总收益

Probability of each possible trajectory based on model* reward of entire trajectory

对所有可能trajectory根据model output可能性的加权平均

maximize objective 是我们获得最好的action probabilistic distribution

https://huggingface.co/learn/deep-rl-course/unit4/pg-theorem

参考:

https://medium.com/data-from-the-trenches/curiosity-driven-learning-through-random-network-distillation-488ffd8e5938