我们为达到目标分配 +1 奖励,为达到不需要的状态分配 -1 奖励。
是否有必要对接近目标的行动给予+0.01奖励,对未达到目标的行动给予-0.01奖励?
上述奖励政策会有哪些重大变化?
最佳答案
来自萨顿和巴托的书,Section 3.2 Goals and Rewards :
It is thus critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do.3.4For example, a chess- playing agent should be rewarded only for actually winning, not for achieving subgoals such taking its opponent's pieces or gaining control of the center of the board. If achieving these sorts of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. For example, it might find a way to take the opponent's pieces even at the cost of losing the game. The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.
因此,一般来说,避免通过奖励函数引入先验知识是一个好主意,因为它可能会产生不期望的结果。
然而,众所周知,通过奖励函数指导智能体学习过程可以提高强化学习性能。事实上,在一些复杂的任务中,有必要首先引导智能体实现次要(更简单的)目标,然后更改奖励以学习主要目标。这种技术被称为奖励塑造
。 Randløv 和 Alstrøm 的论文中可以找到一个古老但有趣的例子:Learning to Drive a Bicycle using Reinforcement Learning and Shaping .
关于artificial-intelligence - 强化学习中奖励政策的重要性是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47133913/