artificial-intelligence - 强化学习中奖励政策的重要性是什么?

标签 artificial-intelligence reinforcement-learning q-learning

我们为达到目标分配 +1 奖励,为达到不需要的状态分配 -1 奖励。

是否有必要对接近目标的行动给予+0.01奖励,对未达到目标的行动给予-0.01奖励?

上述奖励政策会有哪些重大变化?

最佳答案

来自萨顿和巴托的书,Section 3.2 Goals and Rewards :

It is thus critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do.3.4For example, a chess- playing agent should be rewarded only for actually winning, not for achieving subgoals such taking its opponent's pieces or gaining control of the center of the board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent's pieces even at the cost of losing the game. The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

因此,一般来说,避免通过奖励函数引入先验知识是一个好主意,因为它可能会产生不期望的结果。

然而,众所周知,通过奖励函数指导智能体学习过程可以提高强化学习性能。事实上,在一些复杂的任务中,有必要首先引导智能体实现次要(更简单的)目标,然后更改奖励以学习主要目标。这种技术被称为奖励塑造。 Randløv 和 Alstrøm 的论文中可以找到一个古老但有趣的例子:Learning to Drive a Bicycle using Reinforcement Learning and Shaping .

关于artificial-intelligence - 强化学习中奖励政策的重要性是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47133913/

相关文章:

java - "Comparison method violates its general contract"当我改变计算方法时

javascript - 如何使用@tensorflow/tfjs-node v2 保存模型?

algorithm - 连续与离散人工神经网络

c++ - 关于简单游戏中人工智能的想法(例如 : Tic Tac Toe) C++

machine-learning - Qlearning - 定义状态和奖励

python - 【强化学习】为什么我的reward变成0就结束了?我在健身房环境方面遇到了一些麻烦

machine-learning - 使用神经网络进行强化学习

asynchronous - 关于openai基线A2C实现的问题

machine-learning - 为什么 DQN 会为所有观察的 Action 空间 (2) 中的所有 Action 提供相似的值

c++ - 3D 表面的 segmentation 算法