python - 隔离游戏中的蒙特卡洛树搜索代理 - 调试建议

标签 python python-3.x artificial-intelligence monte-carlo-tree-search

TLDR

MCTS agent implementation runs without errors locally, achieving win-rates of >40% against heuristic driven minimax but fails the autograder - which is a requirement before the project can be submitted. Autograder throws IndexError: Cannot choose from an empty sequence. I'm looking for suggestions on the part of the code that is most likely to throw this exception.

您好,我目前被困在这个项目上,我需要在两周后完成我注册的项目之前清除这个项目。我已经完成的任务是在两个国际象棋骑士之间的隔离游戏中实现一个代理来对抗启发式驱动的极小极大代理。游戏的完整实现细节可见here .对于我的项目,游戏将在一 block 9 x 11 的板上进行,使用位板编码。我的 MCTS 实现很简单,紧跟 paper 中提供的伪代码。 (第 6 页)。

本质上,一般的 MCTS 方法包括这 4 个部分,它们分别由 CustomPlayer 类中的以下嵌套函数实现:

  1. 选择 - tree_policy
  2. 扩展 - best_child,扩展
  3. 模拟 - default_policy
  4. 反向传播 - backup_negamax, update_scores

    import math
    import random
    import time
    import logging
    
    from copy import deepcopy
    from collections import namedtuple
    
    from sample_players import DataPlayer
    
    
    class CustomPlayer(DataPlayer):
        """ Implement your own agent to play knight's Isolation
        The get_action() method is the only required method for this project.
        You can modify the interface for get_action by adding named parameters
        with default values, but the function MUST remain compatible with the
        default interface.
        **********************************************************************
        NOTES:
        - The test cases will NOT be run on a machine with GPU access, nor be
          suitable for using any other machine learning techniques.
        - You can pass state forward to your agent on the next turn by assigning
          any pickleable object to the self.context attribute.
        **********************************************************************
        """
        def get_action(self, state):
            """ Employ an adversarial search technique to choose an action
            available in the current state calls self.queue.put(ACTION) at least
            This method must call self.queue.put(ACTION) at least once, and may
            call it as many times as you want; the caller will be responsible
            for cutting off the function after the search time limit has expired.
            See RandomPlayer and GreedyPlayer in sample_players for more examples.
            **********************************************************************
            NOTE: 
            - The caller is responsible for cutting off search, so calling
              get_action() from your own code will create an infinite loop!
              Refer to (and use!) the Isolation.play() function to run games.
            **********************************************************************
            """
            logging.info("Move %s" % state.ply_count)
            self.queue.put(random.choice(state.actions()))
            i = 1
            statlist = []
    
        while (self.queue._TimedQueue__stop_time - 0.05) > time.perf_counter():
            next_action = self.uct_search(state, statlist, i)
            self.queue.put(next_action)
            i += 1
    
    
        def uct_search(self, state, statlist, i):
            plyturn = state.ply_count % 2
            Stat = namedtuple('Stat', 'state action utility visit nround')
    
            def tree_policy(state):
                statecopy = deepcopy(state)
    
                while not statecopy.terminal_test():
                    # All taken actions at this depth
                    tried = [s.action for s in statlist if s.state == statecopy]
                    # See if there's any untried actions left
                    untried = [a for a in statecopy.actions() if a not in tried]
    
                    topop = []
                    toappend = []
    
                    if len(untried) > 0:
                        next_action = random.choice(untried)
                        statecopy = expand(statecopy, next_action)
                        break
                    else:
                        next_action = best_child(statecopy, 1)
    
                        for k, s in enumerate(statlist):
                            if s.state == statecopy and s.action == next_action:
                                visit1 = statlist[k].visit + 1
                                news = statlist[k]._replace(visit=visit1)
                                news = news._replace(nround=i)
    
                                topop.append(k)
                                toappend.append(news)
                                break
    
                        update_scores(topop, toappend)
                        statecopy = statecopy.result(next_action)
                return statecopy
    
    
            def expand(state, action):
                """
                Returns a state resulting from taking an action from the list of untried nodes
                """
                statlist.append(Stat(state, action, 0, 1, i))
                return state.result(action)
    
    
            def best_child(state, c):
                """
                Returns the state resulting from taking the best action. c value between 0 (max score) and 1 (prioritize exploration)
                """
                # All taken actions at this depth
                tried = [s for s in statlist if s.state == state]
    
                maxscore = -999
                maxaction = []
                # Compute the score
                for t in tried:
                    score = (t.utility/t.visit) + c * math.sqrt(2 * math.log(i)/t.visit)
                    if score > maxscore:
                        maxscore = score
                        del maxaction[:]
                        maxaction.append(t.action)
                    elif score == maxscore:
                        maxaction.append(t.action)
    
                if len(maxaction) < 1:
                    logging.error("IndexError: maxaction is empty!")
    
                return random.choice(maxaction)
    
    
            def default_policy(state):
                """
                The simulation to run when visiting unexplored nodes. Defaults to uniform random moves
                """
                while not state.terminal_test():
                    state = state.result(random.choice(state.actions()))
    
                delta = state.utility(self.player_id)
                if abs(delta) == float('inf') and delta < 0:
                    delta = -1
                elif abs(delta) == float('inf') and delta > 0:
                    delta = 1
                return delta
    
    
            def backup_negamax(delta):
                """
                Propagates the terminal utility up the search tree
                """
                topop = []
                toappend = []
                for k, s in enumerate(statlist):
                    if s.nround == i:
                        if s.state.ply_count % 2 == plyturn:
                            utility1 = s.utility + delta
                            news = statlist[k]._replace(utility=utility1)
                        elif s.state.ply_count % 2 != plyturn:
                            utility1 = s.utility - delta
                            news = statlist[k]._replace(utility=utility1)
    
                        topop.append(k)
                        toappend.append(news)
    
                update_scores(topop, toappend)
                return
    
    
            def update_scores(topop, toappend):
                # Remove outdated tuples. Order needs to be in reverse or pop will fail!
                for p in sorted(topop, reverse=True):
                    statlist.pop(p)
                # Add the updated ones
                for a in toappend:
                    statlist.append(a)
                return
    
    
            next_state = tree_policy(state)
            if not next_state.terminal_test():
                delta = default_policy(next_state)
                backup_negamax(delta)
    
            return best_child(state, 0)
    

缺少颜色格式确实使代码很难阅读。所以,请随时查看我的github . 我在本地运行游戏没有任何问题,我的 MCTS 代理对 minimax 玩家的胜率超过 40%(低于 150 毫秒/移动限制)。但是,当我尝试将代码提交给自动评分器时,它被 IndexError: Cannot choose from an empty sequence 异常拒绝。

根据我与类(class)表示的讨论,我们认为该错误可能是由 random.choice() 的使用引起的。在我的实现中有 4 个使用它的实例:

  1. 第 39 行,在 MCTS 算法之前,向队列提供随机移动
  2. 第 66 行,随机选择一个尚未尝试过的 Action
  3. 第 114 行,在最佳 Action 得分相同时随机选择一个 Action
  4. 第 122 行,随机模拟游戏,直到所选 Action 的最终状态

我假设游戏实现是正确的,只要状态为终端,调用 state.actions() 将始终返回可能移动的列表。因此,唯一可以触发此异常的实例是第 3 项。第 1 项和第 4 项只是从可用操作中随机选择,同时进行了显式检查以确保 random.choice() 没有提供空列表。因此,我将日志记录应用到第 3 项(即使在本地运行时没有引发异常),果然,在 50 场比赛后没有捕获任何异常。

对于这篇冗长的帖子,我深表歉意,但我确实希望那里的人能够捕获我在实现过程中遗漏的一些东西。

最佳答案

我建议为您拥有的每个功能编写单元测试。验证您对代码行为的假设。尝试全面测试它,包括极端情况。仅仅需要输入抽象的测试数据通常就可以揭示解决方案的架构和细节。

此外,您可以尝试让您的代理玩任何其他合适的游戏。这些步骤将使您有机会发现代码中的任何错误,并使其更适合将来重用。

关于python - 隔离游戏中的蒙特卡洛树搜索代理 - 调试建议,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55117954/

相关文章:

python-3.x - 排序字典由列表组成

machine-learning - 如何处理不代表数量的数字

python - 如何以内存有效的方式在Python中将矩阵一分为二?

Python BeautifulSoup 从网页中抓取表格

python - 使用 keras 进行情感分析,包括中性推文

python - 我们可以根据单列数字来预测序列中的下一个数字吗?

python - threading.Thread 中的流控制

python - 如何将存储为字符串的元组转换回元组再转换为字符串?

algorithm - "Least frequently used"- 算法

c++ - 具有 alpha-beta 修剪问题的 Minimax