r - 如何在R中实现q-learning？

我正在学习 q-learning 并发现了一个维基百科帖子和这个 website .

根据教程和伪代码我用R写了这么多

#q-learning example
#http://mnemstudio.org/path-finding-q-learning-tutorial.htm
#https://en.wikipedia.org/wiki/Q-learning

set.seed(2016)
iter=100
dimension=5;
alpha=0.1 #learning rate
gamma=0.8 #exploration/ discount factor

# n x n matrix
Q=matrix( rep( 0, len=dimension*dimension), nrow = dimension)
Q

# R -1 is fire pit,0 safe path and 100 Goal state########
R=matrix( sample( -1:0, dimension*dimension,replace=T,prob=c(1,2)), nrow = dimension)
R[dimension,dimension]=100
R #reward matrix
################

for(i in 1:iter){
  row=sample(1:dimension,1)
  col=sample(1:dimension,1)
  I=Q[row,col] #randomly choosing initial state

  Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])
  #equation from wikipedia

}

但是我在 max(Qdash-Q[row,col] 中遇到问题，根据网站，它是 Max[Q(next state, all actions)] 如何我以编程方式搜索下一个状态的所有操作？

第二个问题是这个伪代码

Do While the goal state hasn't been reached.

Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all possible actions.
Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Set the next state as the current state.
End Do

是这个吗

while(Q<100){
Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])
}

最佳答案

这篇文章绝不是 R 中 Q-learning 的完整实现。它试图回答有关文章中链接的网站和维基百科中的算法描述的 OP。

这里的假设是奖励矩阵R正如网站中所描述的那样。也就是说，它将可能的操作的奖励值编码为非负数，矩阵中的 -1 代表空值(即，没有可能的操作可以转换到该状态)。

通过此设置，Q 的 R 实现更新为:

Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])

哪里

cs是路径中当前点的当前状态。
ns是基于当前状态(随机)选择的 Action 的新状态。该操作是从当前状态下可能操作的集合中选择的(即， R[cs,] > -1 )。由于这里状态转换本身是确定性的，因此操作就是向新状态的转换。
此操作导致 ns ，我们希望将其最大( future )值添加到 ns 可以采取的所有可能操作中。。这就是所谓的Max[Q(next state, all actions)]链接网站中的术语和维基百科中的“最佳 future 值(value)估计”。为了计算这个，我们想要最大化 ns第 Q 行但仅考虑 Q 的列R 的哪些列对应的ns -th 行是有效操作(即 R[ns,] > -1 )。因此，这是:
```
max(Q[ns, which(R[ns,] > -1)])
```
对该值的解释是一步前瞻值或动态规划中的成本估算。
链接网站中的方程是特殊情况，其中 alpha ，学习率是1 。我们可以将维基百科中的方程视为:
```
Q[cs,ns] <- (1-alpha)*Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]))
```
哪里alpha在旧值 Q[cs,ns] 之间“插值”和学习值R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) 。正如维基百科所述，

In fully deterministic environments, a learning rate of alpha=1 is optimal

将它们全部放在一个函数中:

q.learn <- function(R, N, alpha, gamma, tgt.state) {
  ## initialize Q to be zero matrix, same size as R
  Q <- matrix(rep(0,length(R)), nrow=nrow(R))
  ## loop over episodes
  for (i in 1:N) {
    ## for each episode, choose an initial state at random
    cs <- sample(1:nrow(R), 1)
    ## iterate until we get to the tgt.state
    while (1) {
      ## choose next state from possible actions at current state
      ## Note: if only one possible action, then choose it;
      ## otherwise, choose one at random
      next.states <- which(R[cs,] > -1)
      if (length(next.states)==1)
        ns <- next.states
      else
        ns <- sample(next.states,1)
      ## this is the update
      Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])
      ## break out of while loop if target state is reached
      ## otherwise, set next.state as current.state and repeat      
      if (ns == tgt.state) break
      cs <- ns
    }
  }
  ## return resulting Q normalized by max value
  return(100*Q/max(Q))
}

其中输入参数是:

R是博客中定义的奖励矩阵
N是要迭代的剧集数
alpha是学习率
gamma是折扣因子
tgt.state是问题的目标状态。

使用链接网站中的示例作为测试:

N <- 1000
alpha <- 1
gamma <- 0.8
tgt.state <- 6
R <- matrix(c(-1,-1,-1,-1,0,-1,-1,-1,-1,0,-1,0,-1,-1,-1,0,-1,-1,-1,0,0,-1,0,-1,0,-1,-1,0,-1,0,-1,100,-1,-1,100,100),nrow=6)
print(R)
##     [,1] [,2] [,3] [,4] [,5] [,6]
##[1,]   -1   -1   -1   -1    0   -1
##[2,]   -1   -1   -1    0   -1  100
##[3,]   -1   -1   -1    0   -1   -1
##[4,]   -1    0    0   -1    0   -1
##[5,]    0   -1   -1    0   -1  100
##[6,]   -1    0   -1   -1    0  100

Q <- q.learn(R,iter,alpha,gamma,tgt.state)
print(Q)
##     [,1] [,2] [,3] [,4] [,5]      [,6]
##[1,]    0    0  0.0    0   80   0.00000
##[2,]    0    0  0.0   64    0 100.00000
##[3,]    0    0  0.0   64    0   0.00000
##[4,]    0   80 51.2    0   80   0.00000
##[5,]   64    0  0.0   64    0 100.00000
##[6,]    0   80  0.0    0   80  99.99994

关于r - 如何在R中实现q-learning？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39353580/

r - 如何在R中实现q-learning？

上一篇：java - 如何将给定时间(字符串)转换为本地时间？

下一篇：delphi - 同步通用 TList 和 TListBox 时出现问题