python - 如何从 html 表格元素中解析文本

标签 python html xpath python-requests lxml

我目前正在使用 python requests 和 lxml 库编写一个小型测试网络爬虫。我正在尝试从 this site 的表行中提取文本使用 xpath 来唯一标识表。由于表本身只能通过其类名来标识,并且考虑到类名不是唯一的,因此我必须使用父 div 元素才能指定表。相关表格列出了《权力的游戏》的季序、拍摄和播出日期,我尝试通过以下路径进行选择:

tree.xpath('//div[@id = "mw-content-text"]//table[@class = "wikitable"]//text()')

由于某种原因,当我在 shell 中打印此路径时,它返回一个空列表。我相信打印此路径只会显示表中的所有文本,这是我试图执行的操作,以确保我实际上可以获得内容;但是,我实际上需要打印表格的每一行。

这个xpath有问题吗?如果是这样,打印表格内容的正确方法是什么?

最佳答案

wikitable 的类别过于宽泛,无法区分 wiki 页面上的表格。

我会依赖前面的适应计划标签:

import requests
from lxml.html import fromstring

url = "https://en.wikipedia.org/wiki/Game_of_Thrones"
response = requests.get(url)
root = fromstring(response.content)

table = root.xpath(".//h3[span = 'Adaptation schedule']/following-sibling::table")[0]
for row in table.xpath(".//tr")[1:]:
    print([cell.text_content() for cell in row.xpath(".//td")])

打印:

['Season 1', 'March 2, 2010[52]', 'Second half of 2010', 'April 17, 2011', 'June 19, 2011', 'A Game of Thrones']
['Season 2', 'April 19, 2011[53]', 'Second half of 2011', 'April 1, 2012', 'June 3, 2012', 'A Clash of Kings and some early chapters from A Storm of Swords[54]']
['Season 3', 'April 10, 2012[55]', 'Second half of 2012', 'March 31, 2013', 'June 9, 2013', 'About the first two-thirds of A Storm of Swords[56][57]']
['Season 4', 'April 2, 2013[58]', 'Second half of 2013', 'April 6, 2014', 'June 15, 2014', 'The remaining one-third of A Storm of Swords and some elements from A Feast for Crows and A Dance with Dragons[59]']
['Season 5', 'April 8, 2014[60]', 'Second half of 2014', 'April 12, 2015', 'June 14, 2015', 'A Feast for Crows, A Dance with Dragons and original content,[61] with some late chapters from A Storm of Swords[62] and elements from The Winds of Winter[63][64]']
['Season 6', 'April 8, 2014[60]', 'Second half of 2015', 'April 24, 2016', 'June 26, 2016', 'Original content and outlined from The Winds of Winter,[65][66] with some late elements from A Feast for Crows and A Dance with Dragons[67]']
['Season 7', 'April 21, 2016[50]', 'Second half of 2016[49]', 'Mid-2017[5]', 'Mid-2017[5]', 'Original content and outlined from The Winds of Winter and A Dream of Spring[66]']

关于python - 如何从 html 表格元素中解析文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38688137/

相关文章:

xpath - Scrapy:在Xpath中处理Abbr标签的问题

python - scrapy从嵌套表内的图像标题属性获取文本

python - 我在分析 Python 的 Dict 插入运行时间时遇到了什么错误?

python - Tkinter 使文本随时间变化

javascript - HTML5 和 Javascript 中的随机移动

html - 在 iPad 上查看的网站中未应用字体

java - 如何通过.xpath 在 Selenium Web Driver 中查找元素

python - Matplotlib 3D Quiver 图使线条颜色正确,但箭头颜色错误

python - 使用 pyodbc 将 pickle 对象插入数据库

html - WebGL模型简化