python - Python 中的 HTML 文件解析

标签 python html beautifulsoup nltk

我有一个很长的 html 文件,看起来完全像这样 - html file 。我希望能够解析该文件,以便获得 tuple 表单中的信息。

示例:

<tr>
      <td>Cech</td>
      <td>Chelsea</td>
      <td>30</td>
      <td>£6.4</td>
</tr>

以上信息将类似于 ("Cech", "Chelsea", 30, 6.4) 。但是,如果您仔细观察 link我发布了,我发布的 html 示例位于 <h2>Goalkeepers</h2> 下标签。我也需要这个标签。所以基本上结果元组看起来像 ("Cech", "Chelsea", 30, 6.4, Goalkeepers) 。在文件的更下方,有一群玩家属于 <h2>中场、后卫和前锋的标签。

我尝试使用 beautifulsoup 和 ntlk 库,但迷失了。所以现在我有以下代码:

import nltk
from urllib import urlopen

url = "http://fantasy.premierleague.com/player-list/"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print raw

它只是删除 html 文件中的所有标签并给出如下内容:

          Cech
          Chelsea
          30
          £6.4

尽管我可以编写一段糟糕的代码来读取每一行并将其分配给一个元组。我无法想出任何可以包含玩家位置的解决方案( <h2> 标签中存在的字符串)。任何解决方案/建议将不胜感激。

我倾向于使用元组的原因是我可以使用解包并计划使用解包值填充 MySQl 表。

最佳答案

from bs4 import BeautifulSoup
from pprint import pprint

soup = BeautifulSoup(html)
h2s = soup.select("h2") #get all h2 elements
tables = soup.select("table") #get all tables

first = True
title =""
players = []
for i,table in enumerate(tables):
    if first:
         #every h2 element has 2 tables. table size = 8, h2 size = 4
         #so for every 2 tables 1 h2
         title =  h2s[int(i/2)].text
    for tr in table.select("tr"):
        player = (title,) #create a player
        for td in tr.select("td"):
            player = player + (td.text,) #add td info in the player
        if len(player) > 1: 
            #If the tr contains a player and its not only ("Goalkeaper") add it
            players.append(player)
    first = not first
pprint(players)

输出:

[('Goalkeepers', 'Cech', 'Chelsea', '30', '£6.4'),
 ('Goalkeepers', 'Hart', 'Man City', '28', '£6.4'),
 ('Goalkeepers', 'Krul', 'Newcastle', '21', '£5.0'),
 ('Goalkeepers', 'Ruddy', 'Norwich', '25', '£5.0'),
 ('Goalkeepers', 'Vorm', 'Swansea', '19', '£5.0'),
 ('Goalkeepers', 'Stekelenburg', 'Fulham', '6', '£4.9'),
 ('Goalkeepers', 'Pantilimon', 'Man City', '0', '£4.9'),
 ('Goalkeepers', 'Lindegaard', 'Man Utd', '0', '£4.9'),
 ('Goalkeepers', 'Butland', 'Stoke City', '0', '£4.9'),
 ('Goalkeepers', 'Foster', 'West Brom', '13', '£4.9'),
 ('Goalkeepers', 'Viviano', 'Arsenal', '0', '£4.8'),
 ('Goalkeepers', 'Schwarzer', 'Chelsea', '0', '£4.7'),
 ('Goalkeepers', 'Boruc', 'Southampton', '42', '£4.7'),
 ('Goalkeepers', 'Myhill', 'West Brom', '15', '£4.5'),
 ('Goalkeepers', 'Fabianski', 'Arsenal', '0', '£4.4'),
 ('Goalkeepers', 'Gomes', 'Tottenham', '0', '£4.4'),
 ('Goalkeepers', 'Friedel', 'Tottenham', '0', '£4.4'),
 ('Goalkeepers', 'Henderson', 'West Ham', '0', '£4.0'),
 ('Defenders', 'Baines', 'Everton', '43', '£7.7'),
 ('Defenders', 'Vertonghen', 'Tottenham', '34', '£7.0'),
 ('Defenders', 'Taylor', 'Cardiff City', '14', '£4.5'),
 ('Defenders', 'Zverotic', 'Fulham', '0', '£4.5'),
 ('Defenders', 'Davies', 'Hull City', '28', '£4.5'),
 ('Defenders', 'Flanagan', 'Liverpool', '0', '£4.5'),
 ('Defenders', 'Dawson', 'West Brom', '0', '£3.9'),
 ('Defenders', 'Potts', 'West Ham', '0', '£3.9'),
 ('Defenders', 'Spence', 'West Ham', '0', '£3.9'),
 ('Midfielders', 'Özil', 'Arsenal', '24', '£10.6'),
 ('Midfielders', 'Redmond', 'Norwich', '20', '£5.0'),
 ('Midfielders', 'Mavrias', 'Sunderland', '5', '£5.0'),
 ('Midfielders', 'Gera', 'West Brom', '0', '£5.0'),
 ('Midfielders', 'Essien', 'Chelsea', '0', '£4.9'),
 ('Midfielders', 'Brown', 'West Brom', '0', '£4.3'),
 ('Forwards', 'van Persie', 'Man Utd', '24', '£13.9'),
 ('Forwards', 'Cornelius', 'Cardiff City', '1', '£5.4'),
 ('Forwards', 'Elmander', 'Norwich', '7', '£5.4'),
 ('Forwards', 'Murray', 'Crystal Palace', '0', '£5.3'),
 ('Forwards', 'Vydra', 'West Brom', '2', '£5.3'),
 ('Forwards', 'Proschwitz', 'Hull City', '0', '£4.3')]

关于python - Python 中的 HTML 文件解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19460403/

相关文章:

html - 为什么没有边距 :auto tag work on the "title" div?

html - 水平对齐 <label> 标签下的元素

python - 在 Django 应用程序的 Celery 任务中使用事务会导致问题吗?

python - Python结构模式匹配中如何区分元组和列表?

python - Django 在模板内进行查询

python - 使用 Pandas Python 更改数据框中数据透视数据的数据格式

javascript - 当脚本是从加载的脚本动态创建的 DOM 节点时,脚本 onload 和 window.onload 的顺序是否定义明确?

python - 使用选择器获取不同 "h"标签的内容时遇到问题

python - BeautifulSoup 检索图像 src 属性并进行比较时出现问题

python - 刮取巫妖的招式