我正在尝试制作一个Python脚本,以我有限的知识从网页中抓取特定信息。但我想我有限的知识还不够。 我需要提取7-8条信息。标签如下 -
1
<a class="ui-magnifier-glass" href="here goes the link that i want to extract" data-spm-anchor-id="0.0.0.0" style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"></a>
2
<a href="link to extract" title="title to extract" rel="category tag" data-spm-anchor-id="0.0.0.0">or maybe this word instead of title</a>
如果我知道如何从此类 href 标签中提取信息。剩下的工作我自己就能完成。
如果有人可以帮助我编写代码来在 csv 文件中添加此信息,我将不胜感激。
我已经开始使用这段代码
url = raw_input('url : ')
page = requests.get(url)
tree = html.fromstring(page.text)
productname = tree.xpath('//h1[@class="product-name"]/text()')
price = tree.xpath('//span[@id="sku-discount-price"]/text()')
print '\n' + productname[0]
print '\n' + price[0]
最佳答案
您可以使用 lxml 和 csv 模块来完成您想要的操作。 lxml支持xpath表达式来选择你想要的元素。
from lxml import etree
from StringIO import StringIO
from csv import DictWriter
f= StringIO('''
<html><body>
<a class="ui-magnifier-glass"
href="here goes the link that i want to extract"
data-spm-anchor-id="0.0.0.0"
style="width: 258px; height: 258px; position: absolute; left: -1px; top: -1px; display: none;"
></a>
<a href="link to extract"
title="title to extract"
rel="category tag"
data-spm-anchor-id="0.0.0.0"
>or maybe this word instead of title</a>
</body></html>
''')
doc = etree.parse(f)
data=[]
# Get all links with data-spm-anchor-id="0.0.0.0"
r = doc.xpath('//a[@data-spm-anchor-id="0.0.0.0"]')
# Iterate thru each element containing an <a></a> tag element
for elem in r:
# You can access the attributes with get
link=elem.get('href')
title=elem.get('title')
# and the text inside the tag is accessable with text
text=elem.text
data.append({
'link': link,
'title': title,
'text': text
})
with open('file.csv', 'w') as csvfile:
fieldnames=['link', 'title', 'text']
writer = DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in data:
writer.writerow(row)
关于python - 使用python LXML从html网页中提取信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31463569/