我正在使用 lxml 来抓取一些如下所示的 HTML:
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
我怎样才能得到表单中的数据
[ {'category': 'Football', 'title': 'Team A'},
{'category': 'Football', 'title': 'Team B'},
{'category': 'Baseball', 'title': 'Team C'},
{'category': 'Baseball', 'title': 'Team D'}]
到目前为止我有:
results = []
for (i,a) in enumerate(content[0].xpath('./a')):
data['text'] = a.text
results.append(data)
但我不知道如何通过拆分 font-size
并保留同级标签来获取类别名称 - 有什么建议吗?
谢谢!
最佳答案
我成功地使用了以下代码:
#!/usr/bin/env python
snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""
import lxml.html
html = lxml.html.fromstring(snippet)
body = html[1]
results = []
current_category = None
for element in body.xpath('./*'):
if element.tag == 'div':
current_category = element.xpath('./a')[0].text
elif element.tag == 'a':
results.append({ 'category' : current_category,
'title' : element.text })
print results
它将打印:
[{'category': 'Football', 'title': 'Team A'},
{'category': 'Football', 'title': 'Team B'},
{'category': 'Baseball', 'title': 'Team C'},
{'category': 'Baseball', 'title': 'Team D'}]
抓取是脆弱的。例如,这里我们明确依赖于元素的顺序和嵌套。然而,有时这种硬连线方法可能就足够了。
这是另一个(更面向 xpath 的方法)使用 preceding-sibling
轴:
#!/usr/bin/env python
snippet = """
<html><head></head><body>
<div align=center><a style="font-size: 1.1em">Football</a></div>
<a href="">Team A</a>
<a href="">Team B</a>
<div align=center><a style="font-size: 1.1em">Baseball</a></div>
<a href="">Team C</a>
<a href="">Team D</a>
</body></html>
"""
import lxml.html
html = lxml.html.fromstring(snippet)
body = html[1]
results = []
for e in body.xpath('./a'):
results.append(dict(
category=e.xpath('preceding-sibling::div/a')[-1].text,
title=e.text))
print results
关于python - lxml:拆分属性?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6330457/