python - 如何有效地将 html 列表解析为字典?

标签 python beautifulsoup

我想知道如何简化这些困惑的代码并将输出放入一个漂亮的字典而不是元组列表中。我能否以更好的方式使用 BeautifulSoup,如何?

from bs4 import BeautifulSoup as soup
import requests


data = []
sample = []

player_page = requests.get('https://www.premierleague.com/players/10483/Rolando-Aarons/stats')
cont = soup(player_page.content)
for strong_tag in cont.find_all('span', 'stat'):
    sample.append(strong_tag.text)
    tempStats = [x.replace("\r\n",",") for x in sample]
    tempStats = [x.replace("\n","") for x in tempStats]
    tempStats = [x.replace(" ","") for x in tempStats]
    tempStats = [i.split(',', 1) for i in tempStats] 
    tempStats = list(map(lambda sublist: tuple(map(str, sublist)), tempStats))
    tempStats = [tuple(int(item) if item.strip().isnumeric() else item for item in group) for group in tempStats]   
data.append(tempStats)
print(data) 

我想要的输出是这样的:

PlayerName {stat1: 1, stat2: 2 , stat: 3, etc,etc}

采用这种结构的原因是我可以从多个播放器中提取特定键并比较值。

最佳答案

此脚本将创建页面上所有统计数据的字典:

from bs4 import BeautifulSoup as soup
import requests

player_page = requests.get('https://www.premierleague.com/players/10483/Rolando-Aarons/stats')
cont = soup(player_page.content, 'lxml')

data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))

from pprint import pprint
pprint(data)

打印:

{'Accurate long balls': '8',
 'Aerial battles lost': '12',
 'Aerial battles won': '7',
 'Appearances': '18',
 'Assists': '1',
 'Big chances created': '1',
 'Big chances missed': '0',
 'Blocked shots': '2',
 'Clearances': '11',
 'Cross accuracy %': '21%',
 'Crosses': '19',
 'Duels lost': '67',
 'Duels won': '54',
 'Errors leading to goal': '1',
 'Fouls': '11',
 'Freekicks scored': '0',
 'Goals': '2',
 'Goals per match': '0.11',
 'Goals with left foot': '1',
 'Goals with right foot': '0',
 'Headed Clearance': '6',
 'Headed goals': '1',
 'Hit woodwork': '1',
 'Interceptions': '8',
 'Losses': '12',
 'Offsides': '1',
 'Passes': '197',
 'Passes per match': '10.94',
 'Penalties scored': '0',
 'Recoveries': '43',
 'Red cards': '0',
 'Shooting accuracy %': '27%',
 'Shots': '11',
 'Shots on target': '3',
 'Successful 50/50s': '14',
 'Tackle success %': '70%',
 'Tackles': '20',
 'Through balls': '0',
 'Wins': '3',
 'Yellow cards': '2'}

编辑:要创建包含玩家姓名和他的数据的字典,您可以这样做(data 来自上面的脚本):

players = {cont.select_one('.playerDetails .name').get_text(strip=True): data}

from pprint import pprint
pprint(players)

打印:

{'Rolando Aarons': {'Accurate long balls': '8',
                    'Aerial battles lost': '12',
                    'Aerial battles won': '7',
                    'Assists': '1',
                    'Big chances created': '1',
                    'Big chances missed': '0',
                    'Blocked shots': '2',
...and so on.

关于python - 如何有效地将 html 列表解析为字典?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57514897/

相关文章:

python - 如何使用python从S3下载视频?

python - Pandas read_excel

python - 创建后如何移动 Qt QPainterPath 中的点?

python - 'y' Op 的输入 'Equal' 的 bool 类型与参数 'x' 的 float32 类型不匹配

Python BeautifulSoup 选择属性开头的所有元素

python - BeautifulSoup 未解析完整的 HTML - 这是因为动态 HTML 吗?

python - 抓取动态 HTML(YouTube 评论)

python - 在Python中: how to verify if file has been downloaded correctly before opening it

python - 如何遍历整个html表并转换为json数据?

python-2.7 - HTML 解析和错误\xa0