python - 使用 re 从 .pgn (字符串行)构建数据框

标签 python regex string pandas

我有大量国际象棋游戏的单个 .pgn(可移植游戏符号)。游戏包含在如下文件中:

    [Event"FIDE World Cup 2017"]
    [Site "Tbilisi GEO"]
    [Date "2017.09.05"]
    [Round "1.1"]
    [White "Carlsen, Magnus"]
    [Black "Balogun, Oluwafemi"]
    [Result "1-0"]
    [WhiteTitle "GM"]
    [BlackTitle "FM"]
    [WhiteElo "2822"]
    [BlackElo "2255"]
    [ECO "B00"]
    [Opening "King's pawn opening"]
    [WhiteFideId "1503014"]
    [BlackFideId "8501246"]
    [EventDate "2017.09.03"]

    1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O         
    8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8 
    14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7 
    20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7 
    26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32. 
    Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8 
    38. Nxd6 Kg6 39. Nf5 1-0

    [Event "FIDE World Cup 2018"]    
    etc...

我想使用此数据创建一个数据框,其中列标题是括号中重复的单词,例如“事件”或“站点”,数据是引号中的数据。另外,我想包括当然的 Action ,这些 Action 不在引号或括号中......

我认为我可以使用 re 模块首先构建一个字典,然后将字典转换为数据帧,但我无法这样做。你能帮我一下吗?

Nested_dict = { 
  "Game1": {"Event" : "FIDE World Cup 2017" , "Site " : "Tbilisi GEO" , "moves": "1. e4 d6"}
  "Game2": {"Event" : "FIDE World Cup 2018" , "Site " : "Astana GEO" , "moves": "1. e4 e5"}
}

我已经让自己能够获取字典的所有“值”,但我无法获取“键”:

import re

with open('lichess_game.pgn', 'r') as in_file:
    stripped = (line.strip() for line in in_file)
    lines = ((line.split(",") for line in stripped if line))
    for line in lines:
        stripped=str(line)
        stripped = (stripped.replace("]",""))
        stripped=stripped.replace("[","")
        values=str(re.findall(r'["](.*?)["]',stripped)) #Is there someting like re.finall(not(condition)? I mean, find everthing that is not inside this condition
        values = (values.replace("]",'"')) #Could I replace two diferent characters at the same time?
        values=values.replace("[",'"')
        key=stripped.replace((values),"") #this is me trying the get the key from the string minus the values....
        print(values) #Right!
        print(key)    #Nope

数据框的最终表格将包含所有比赛的行和每列每场比赛的信息,例如,第 1 列是“Event”,第一行是“FIDE World Cup 2017”

Event               Site    Date       Round White           Black  
FIDE World Cup 2017 Tbilisi 2017.09.05 1.1   Carlsen, Magnus Balogun, Oluwafemi

非常感谢您的帮助!!

最佳答案

我使用空行和变量moves来识别它是移动的行开头还是新游戏的行开头。

对于游戏,我使用 [1:-1] 删除 [ ] 以及后来的 ""。我使用 split(' ', 1)[ ] 内的文本拆分为两个元素,我将其用作 dict 中的键和值。

对于 Action ,我将它们保留在列表中,当它开始新游戏时,我将它们连接起来并放入字典中。当它开始新游戏时,我清除新行的字典和新 Action 的列表。

我使用模块json只是为了以更易读的方式显示字典。不需要处理数据。

text = '''[Event "FIDE World Cup 2017"]
    [Site "Tbilisi GEO"]
    [Date "2017.09.05"]
    [Round "1.1"]
    [White "Carlsen, Magnus"]
    [Black "Balogun, Oluwafemi"]
    [Result "1-0"]
    [WhiteTitle "GM"]
    [BlackTitle "FM"]
    [WhiteElo "2822"]
    [BlackElo "2255"]
    [ECO "B00"]
    [Opening "King's pawn opening"]
    [WhiteFideId "1503014"]
    [BlackFideId "8501246"]
    [EventDate "2017.09.03"]

    1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O         
    8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8 
    14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7 
    20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7 
    26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32. 
    Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8 
    38. Nxd6 Kg6 39. Nf5 1-0

[Event "FIDE World Cup 2017"]
    [Site "Tbilisi GEO"]
    [Date "2017.09.05"]
    [Round "1.1"]
    [White "Carlsen, Magnus"]
    [Black "Balogun, Oluwafemi"]
    [Result "1-0"]
    [WhiteTitle "GM"]
    [BlackTitle "FM"]
    [WhiteElo "2822"]
    [BlackElo "2255"]
    [ECO "B00"]
    [Opening "King's pawn opening"]
    [WhiteFideId "1503014"]
    [BlackFideId "8501246"]
    [EventDate "2017.09.03"]

    1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O         
    8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8 
    14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7 
    20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7 
    26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32. 
    Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8 
    38. Nxd6 Kg6 39. Nf5 1-0    
    '''

all_rows = dict()

# create dict/list for first game
row = dict()
moves = False
moves_lines = list()
game = 1

for line in text.split('\n'):
    line = line.strip()
    if line:
        if moves:
            moves_lines.append(line)
        elif line.startswith('['):
            line = line[1:-1]
            parts = line.split(' ', 1)
            parts[1] = parts[1][1:-1]
            row[parts[0]] = parts[1]
    else:
        # if empty line then it is beginning of moves or beginning of new game
        if not moves:
            # start lines with moves
            moves = True
        else:
            # end of lines with moves - so it is time to add it to dict
            row['moves'] = ' '.join(moves_lines)
            all_rows['Game{}'.format(game)] = row

            # create dict/list for next game
            row = dict()
            moves = False
            moves_lines = list()
            game += 1

# after loop there can be last row to fill            
if row: 
   row['moves'] = ' '.join(moves_lines)
   all_rows['Game{}'.format(game)] = row

import json
print(json.dumps(all_rows, indent=2))

结果(格式为json)

{
  "Game1": {
    "Event": "FIDE World Cup 2017",
    "Site": "Tbilisi GEO",
    "Date": "2017.09.05",
    "Round": "1.1",
    "White": "Carlsen, Magnus",
    "Black": "Balogun, Oluwafemi",
    "Result": "1-0",
    "WhiteTitle": "GM",
    "BlackTitle": "FM",
    "WhiteElo": "2822",
    "BlackElo": "2255",
    "ECO": "B00",
    "Opening": "King's pawn opening",
    "WhiteFideId": "1503014",
    "BlackFideId": "8501246",
    "EventDate": "2017.09.03",
    "moves": "1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O 8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8 14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7 20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7 26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32. Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8 38. Nxd6 Kg6 39. Nf5 1-0"
  },
  "Game2": {
    "Event": "FIDE World Cup 2017",
    "Site": "Tbilisi GEO",
    "Date": "2017.09.05",
    "Round": "1.1",
    "White": "Carlsen, Magnus",
    "Black": "Balogun, Oluwafemi",
    "Result": "1-0",
    "WhiteTitle": "GM",
    "BlackTitle": "FM",
    "WhiteElo": "2822",
    "BlackElo": "2255",
    "ECO": "B00",
    "Opening": "King's pawn opening",
    "WhiteFideId": "1503014",
    "BlackFideId": "8501246",
    "EventDate": "2017.09.03",
    "moves": "1. e4 d6 2. d4 g6 3. Bc4 Nf6 4. Qe2 Nc6 5. Nf3 Bg7 6. O-O Bg4 7. c3 O-O 8. h3 Bxf3 9. Qxf3 e5 10. Rd1 Qe8 11. d5 Ne7 12. Qe2 Nh5 13. Bb5 Qc8 14. Na3 a6 15. Ba4 f5 16. Bc2 f4 17. Qg4 Qxg4 18. hxg4 Nf6 19. g5 Nd7 20. Nc4 b6 21. b4 h6 22. gxh6 Bxh6 23. g4 Nf6 24. f3 Bg5 25. Kg2 Kg7 26. a4 Bh4 27. Bd2 g5 28. Rh1 Ng6 29. Kf1 Rh8 30. Ke2 Bg3 31. a5 b5 32. Na3 Ne7 33. c4 c6 34. dxc6 Nxc6 35. Bc3 Rxh1 36. Rxh1 bxc4 37. Nxc4 Rb8 38. Nxd6 Kg6 39. Nf5 1-0"
  }
}

关于python - 使用 re 从 .pgn (字符串行)构建数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58598314/

相关文章:

c - 在 C 中,两个相同的链并不将自己标识为相等

javascript - 请解释 RegExp javascript 行为

python - 在pygame中爬梯子

python - 过滤包含特定字符串的列表

python - “请求”对象不可迭代

javascript - 正则表达式 URL 没有数字

python - 使用 Regex re.sub 删除指定单词之前和包含的所有内容

Python读取带有相关子元素的xml

javascript - 将命令行拆分为参数

python - 如何使用 Python 格式化带有三重引号的多行字符串?