python - 抓取标签属性内的元素 - Scrapy

我正在使用 Scrapy 来抓取视频网站。我在抓取一些东西时遇到了一些困难。

例如。

<embed width="588" height="476" flashvars="id_video=7845976&amp;theskin=default&amp;url_bigthumb=http://sample.com/image.jpg&amp;key=4219e347d8fdc0be3103eb3cbb458258-1416371743&amp;categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf" wmode="transparent" id="flash-player-embed" type="application/x-shockwave-flash">

我目前可以使用以下语句来抓取 html 标签的属性:-

item['thumb'] = hxs.select("//embed[@id='flash-player-embed']/@flashvars").extract()[0]

上面的语句给出了以下结果:-

id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf

我想要一个 hxs.select 语句，这样它就可以从上面的嵌入代码中仅提取图像 url，如下所示:-

http://sample.com/image.jpg

我已经尝试过:-

item['thumb'] = hxs.select("//embed[@id='flash-player-embed']/@flashvars/@url_bigthumb").extract()[0]

但是它没有用，因为它不起作用。

非常感谢 Scrapy 或 Python 委员会的任何帮助，因为它将节省我宝贵的兆比特。

提前致谢。

最佳答案

urlparse还提供了一个很好的获取元素的解决方案:

>>from urlparse import parse_qs, urlparse
>>url = '?' + 'id_video=7845976&theskin=default&url_bigthumb=http://sample.com/image.jpg&key=4219e347d8fdc0be3103eb3cbb458258-1416371743&categories=cat1" allowscriptaccess="always" allowfullscreen="true" quality="high" src="http://static.sample.com/swf/xv-player.swf'

>>print parse_qs(urlparse(url).query)['url_bigthumb']
['http://sample.com/image.jpg']

关于python - 抓取标签属性内的元素 - Scrapy，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27009285/

python - 抓取标签属性内的元素 - Scrapy

上一篇：python - 使用 scikit-image 和 RANSAC RobuSTLy 估计多项式几何变换

下一篇：python - Selenium click 可以与标签一起使用吗？