python - 使用 beautifulsoup 提取 html

标签 python html web-scraping beautifulsoup

我正在尝试从以下站点的 html 中提取数据:

http://www.irishrugby.ie/guinnesspro12/results_and_fixtures_pro_12_section.php

我希望能够提取球队名称和得分,例如第一场比赛是Connacht vs Newport Gwent Dragons

我希望我的 python 程序也打印结果,即 Connacht Rugby 29 - 23 Newport Gwent Dragons

这是我也想从中提取的 html:

<!-- 207974 sfms -->
<tr class="odd match-result group_celtic_league" id="fixturerow0" onclick="if( c
lickpriority == 0 ) { redirect('/guinnesspro12/35435.php') }" onmouseout="classN
ame='odd match-result group_celtic_league';" onmouseover="clickpriority=0; class
Name='odd match-result group_celtic_league rollover';" style="">
 <td class="field_DateShort" style="">
  Fri 4 Sep
 </td>
 <td class="field_TimeLong" style="">
  19:30
 </td>
 <td class="field_CompStageAbbrev" style="">
  PRO12
 </td>
 <td class="field_LogoTeamA" style="">
  <img alt="Connacht Rugby" height="50" src="http://cdn.soticservers.net/tools/i
mages/teams/logos/50x50/16.png" width="50"/>
 </td>
 <td class="field_HomeDisplay" style="">
  Connacht Rugby
 </td>
 <td class="field_Score" style="">
  29 - 23
 </td>
 <td class="field_AwayDisplay" style="">
  Newport Gwent Dragons
 </td>
 <td class="field_LogoTeamB" style="">
  <img alt="Newport Gwent Dragons" height="50" src="http://cdn.soticservers.net/
tools/images/teams/logos/50x50/19.png" width="50"/>
 </td>
 <td class="field_HA" style="">
  H
 </td>
 <td class="field_OppositionDisplay" style="">
  <br/>
 </td>
 <td class="field_ResScore" style="">
  W 29-23
 </td>
 <td class="field_VenName" style="">
  Sportsground
 </td>
 <td class="field_BroadcastAttend" style="">
  3,624
 </td>
 <td class="field_Links" style="">
  <a href="/guinnesspro12/35435.php" onclick="clickpriority=1">
   Report
  </a>
 </td>
</tr>

这是我到目前为止的程序:

from httplib2 import Http
from bs4 import BeautifulSoup
# create a "web object"
h = Http()

# Request the specified web page
response, content = h.request('http://www.irishrugby.ie/guinnesspro12/results_and_fixtures_pro_12_section.php')

# display the response status
print(response.status)

# display the text of the web page
print(content.decode())

soup = BeautifulSoup(content)

# check the response
if response.status == 200:
    #print(soup.get_text())

    rows = soup.find_all('tr')[1:-2]

    for row in rows:
        data = row.find_all('td')
        #print(data)

else:
    print('Unable to connect:', response.status)
    print(soup.get_text()) 

最佳答案

而不是查找所有 <td>标签你应该更具体。我会转换这个:

for row in rows:
    data = row.find_all('td')

对此:

for row in rows:
    home = row.find("td",attrs={"class":"field_HomeDisplay")
    score = row.find("td",attrs={"class":"field_Score")
    away = row.find("td",attrs={"class":"field_AwayDisplay")
    print(home.get_text() + " " + score.get_text() + " " + away.get_text())

关于python - 使用 beautifulsoup 提取 html,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33501554/

相关文章:

html - 如何限制父标签内的文本或 div

javascript - Jquery 函数 - 元素上

node.js - NodeJS 抓取 .ashx 页面

python - 重写我的分数文本文件以确保它只有最后 4 个分数 (python)

javascript - 使用 Web SQL 数据库的客户端存储

ruby - 以 Ruby 可以理解的格式获取维基百科信息框

python - Scrapy:如何从Scrapy.Request获取返回值?

python - 如何从当前脚本上方的目录导入模块

python - 其中哪一种是正确的语法?

python - 如何检查给定文件是否为 FASTA?