status
type
date
slug
summary
tags
category
icon
password

DQN

 
基本就是用 conv+FC 来预测Q_value 主要作用 当数据量大的时候 替代查Q_value表
我们这里输入一组图像 表示现在state(一组图像给我们一些时序信息 temporal info)
notion image
 
loss function compares our Q-value prediction and the Q-target and uses gradient descent to update the weights of our Deep Q-Network to approximate our Q-values better
notion image
 
The Deep Q-Learning training algorithm has two phases:
  • Sampling: we perform actions and store the observed experience tuples in a replay memory.
  • Training: Select a small batch of tuples randomly and learn from this batch using a gradient descent update step.
 
notion image
 
与Q_learning相比 DQN更不稳定 毕竟使用的是 non-linear NN做预测和更新
 
  1. Experience Replay to make more efficient use of experiences.
    1. 所保存的内容 (State_t, action_t → reward_t, state_t+1) → sample batch
      更有效利用历史 同时避免sequential sampling 带来的遗忘(只记得距离近的update)
      随机选择buffer → 削减buffer关联性 不会 引入太多linearity带来的gradient干扰 RNN
      avoid action values from oscillating or diverging catastrophically
       
  1. Fixed Q-Target to stabilize the training.
    1.  
      D ← N 个sample
      不能用same weight 来estimate both TD target and Q value (这两个值关联性太大 update会很一致 很难缩短距离)
      这是因为reducing TD error 的同时 一直使用update的 Q func 来计算(Target 和 error 同时在变)
       
      所以我们用 初始weight去estimate target 保持值不变 预测target的weight持续迭代
      每过C step 同步 target weight → 目的就是让他保持稳定 有一个暂时确定的学习目标
 
  1. Double Deep Q-Learning, to handle the problem of the overestimation of Q-values.
 
how are we sure that the best action for the next state is the action with the highest Q-value?
→ the accuracy of Q-values depends on what action we tried and what neighboring states we explored
 
为什么是overestimate → 混淆optimal solution和high Q-value 刚开始训练时 Q-value noisy没有价值 会引导向错误的方向学习
 
decouple action from target Q-value generation
 
DQN → 充当action policy
Target network → 计算 Q-value of next state
 
 

Policy Gradient

 
从value based 转到 policy based, 聚焦于parameterize the policy
notion image
 
policy gradient 属于 policy based
policy based基本上 都是 On-policy Optimization 只能用trajectory 一次 with most recent version of model weight
 
两者的区别就是 policy gradient 我们直接把Objective function 的 gradent 作用在policy weight上
而policy based 我们有间接的手段 来maximize objective
 
不需要额外存储value 同时
  1. We don’t need to implement an exploration/exploitation trade-off by hand. Since we output a probability distribution over actions, the agent explores the state space without always taking the same trajectory.
  1. We also get rid of the problem of perceptual aliasing. Perceptual aliasing is when two states seem (or are) the same but need different actions.
 
总体来说就是再多增加一些randomness
more effective in high-dimensional action spaces and continuous actions spaces
 
💡
The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state.
But what if we have an infinite possibility of actions?
For instance, with a self-driving car, at each state, you can have a (near) infinite choice of actions (turning the wheel at 15°, 17.2°, 19,4°, honking, etc.). We’ll need to output a Q-value for each possible action! And taking the max action of a continuous output is an optimization problem itself!
 
这种时候 直接输出 probability distribution over actions 就会很容易
同时 我们防止 Q-value 不准对结果产生很大影响 → 主要因为选择的非连续性
💡
In value-based methods, we use an aggressive operator to change the value function: we take the maximum over Q-estimates. Consequently, the action probabilities may change dramatically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
For instance, if during the training, the best action was left (with a Q-value of 0.22) and the training step after it’s right (since the right Q-value becomes 0.23), we dramatically changed the policy since now the policy will take most of the time right instead of left.
On the other hand, in policy-gradient methods, stochastic policy action preferences (probability of taking action) change smoothly over time.
 
训练目标:
Our goal with policy-gradient is to control the probability distribution of actions by tuning the policy such that good actions (that maximize the return) are sampled more frequently in the future
 
根据当前policy 跑完一个episode 算总return
根据总收益判断是否应该增加/减少 路径上每个state,action pair的probability输出概率
 
我们使用 J 作为objective function 判断总收益
notion image
Probability of each possible trajectory based on model* reward of entire trajectory
对所有可能trajectory根据model output可能性的加权平均
 
maximize objective 是我们获得 最好的action probabilistic distribution
notion image
 
notion image
 
 
 
 
 
 
 
 
 
 
 
 
 
 
参考:
 
关于Docker network的一些补充RL 基础 - Monte Carlo, Temporal Difference, 和Q_learning
Loading...
ran2323
ran2323
忘掉名字吧
Latest posts
SFT + DPO 塔罗解读
2025-3-30
DPO 相关
2025-3-29
今日paper(3/25) - MAGPIE
2025-3-27
关于Docker network的一些补充
2025-3-26
PPO 相关, 关于损失函数的一些理解
2025-3-13
SAM 代码学习 [1]
2025-3-4
Announcement
 
 
 
 
暂时没有新的内容