python - lxml - 难以解析 stackexchange rss 提要

你好

我在用 python 解析来自 stackexchange 的 rss 提要时遇到问题。当我尝试获取摘要节点时，返回一个空列表

我一直在尝试解决这个问题，但无法解决问题。

有人能帮忙吗？谢谢一个

在[3o]中:import lxml.etree, urllib2



在 [31] 中:url_cooking = 'http://cooking.stackexchange.com/feeds'

在 [32] 中:cooking_content = urllib2.urlopen(url_cooking)

在 [33] 中:cooking_parsed = lxml.etree.parse(cooking_content)

在 [34] 中:cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')

在 [35] 中:cooking_texts
出[35]:[]

最佳答案

看看这两个版本

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

如您所见，第二个版本不返回任何节点，但是 lxml.html版本工作正常。 etree版本不工作，因为它需要命名空间和 html version 正在运行，因为它忽略了 namespace 。部分下降http://lxml.de/lxmlhtml.html ，它说“HTML 解析器明显忽略了命名空间和其他一些 XMLisms”。

请注意，当您打印 etree 版本 ( print(data.getroot()) ) 的根节点时，您会得到类似于 <Element {http://www.w3.org/2005/Atom}feed at 0x22d1620> 的内容.这意味着它是一个命名空间为 http://www.w3.org/2005/Atom 的提要元素.这是 etree 代码的更正版本。

import lxml.html, lxml.etree

url_cooking = 'http://cooking.stackexchange.com/feeds'

ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}

data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

关于python - lxml - 难以解析 stackexchange rss 提要，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9409095/

python - lxml - 难以解析 stackexchange rss 提要

上一篇：python - 在 aws 上运行 django 项目的最佳方式是什么？

下一篇：python - 如何使用 pygtk 获取 gnome2 桌面上所有窗口的列表？