c++ - 神经网络中的 Q 学习不是 'learning'

我用 C++ 制作了一个简单的 Tron 游戏和一个带有一个隐藏层的 MLP。我已经在这个神经网络中实现了 Q 学习，但是，随着时间的推移，它并没有导致代理赢得更多比赛(即使在 100 万场比赛之后)。我将尝试用文字解释我所做的事情，希望有人能够发现可能导致此问题的错误。

在每个州都有四种可能的移动(北、东、南、西)，奖励在游戏结束时颁发(-1 表示失败，0 表示平局，1 表示获胜)。

我初始化了 4 个 MLP，每个可能的 Action 都有一个，有 100 个输入节点(整个游戏网格 10x10)，其中如果玩家本身存在，则每个点为 1，如果该点为空，则为 0，如果对手为 -1已经访问过这一点。然后有 50 个隐藏节点和 1 个输出节点(我也尝试过一个具有 4 个输出节点的网络，但这也没有帮助)。权重在 -0.5 到 0.5 之间随机选择

在每个时期，我都会使用随机放置在网格中的 2 个代理来初始化游戏环境。然后我在 while 循环中运行游戏，直到游戏结束，然后重置游戏环境。在这个 while 循环中，我执行以下操作。

我向 MLP 提供当前状态并确定最高的 Q 值并以 90% 的几率去那里(10% 随机移动)。 Q 值是使用 sigmoid 或 RELU 激活函数确定的(我已经尝试过这两种函数)。
然后，我在新状态下计算 4 个 Q 值，并用它来训练我的第一步的网络，目标如下:目标 = 奖励 + gamma*(maxQnextState)。那么误差 = Target - 在先前状态计算的 qValue。
我使用带有 sigmoid 函数导数的反向传播以及高学习率和动量项来向后传播误差。

看起来我的 qValue 要么非常低(约为 0.0001)，要么非常接近 1 (0.999)。如果我查看每 10,000 场比赛的误差项，它似乎并没有减少。

我从可以学习 XOR 函数的 MLP 开始，现在将其用于 Q 学习。也许 XOR 情况中的一些基本假设不同并导致 Q 学习出现问题？

或者也许是稀疏输入(只是 0、1 或 -1 的 100 倍)导致无法学习？

非常感谢您的建议!

最佳答案

有几个因素使得将 MLP 与 Q-learning 结合起来很困难，特别是对于该领域的新手来说。 Rich Sutton(强化学习先驱之一)在他的常见问题解答中提出了一个问题 web site与你的问题有关。所以我建议您阅读该文档。

众所周知，Q-Learning + 前馈神经网络作为 Q 函数逼近器即使在简单问题上也可能会失败 [Boyan & Moore, 1995]。

一个可能的解释是 [Barreto & Anderson, 2008] 中描述的称为干扰的现象:

Interference happens when the update of one state–action pair changes the Q-values of other pairs, possibly in the wrong direction.

Interference is naturally associated with generalization, and also happens in conventional supervised learning. Nevertheless, in the reinforcement learning paradigm its effects tend to be much more harmful. The reason for this is twofold. First, the combination of interference and bootstrapping can easily become unstable, since the updates are no longer strictly local. The convergence proofs for the algorithms derived from (4) and (5) are based on the fact that these operators are contraction mappings, that is, their successive application results in a sequence converging to a fixed point which is the solution for the Bellman equation [14,36]. When using approximators, however, this asymptotic convergence is lost, [...]

Another source of instability is a consequence of the fact that in on-line reinforcement learning the distribution of the incoming data depends on the current policy. Depending on the dynamics of the system, the agent can remain for some time in a region of the state space which is not representative of the entire domain. In this situation, the learning algorithm may allocate excessive resources of the function approximator to represent that region, possibly “forgetting” the previous stored information.

总之，从 MLP 开始来近似 Q 函数并不是一个好主意。

引用文献

Boyan, J. A. 和 Moore, A. W. (1995) 强化学习的泛化:安全地逼近值(value)函数。 NIPS-7。加利福尼亚州圣马特奥:摩根考夫曼。

André da Motta Salles Barreto 和 Charles W. Anderson (2008) 强化学习中值函数逼近的限制梯度下降算法，人工智能 172 (2008) 454–482

关于c++ - 神经网络中的 Q 学习不是 'learning'，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40129386/

c++ - 神经网络中的 Q 学习不是 'learning'

上一篇：python - 从 python 执行 C++ 代码

下一篇：c++ - 为什么在实例化具有函数类型的 C++ 模板时得到 "Error: symbol is already defined"