python scrapy从网站中提取数据

标签 python web-scraping scrapy

我想从this page抓取数据。这是我当前的代码:

buf = cStringIO.StringIO()
c = pycurl.Curl()
c.setopt(c.URL, "http://www.guardalo.org/99407/")
c.setopt(c.VERBOSE, 0)
c.setopt(c.WRITEFUNCTION, buf.write)
c.setopt(c.CONNECTTIMEOUT, 15)
c.setopt(c.TIMEOUT, 15)
c.setopt(c.SSL_VERIFYPEER, 0)
c.setopt(c.SSL_VERIFYHOST, 0)
c.setopt(c.USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20100101 Firefox/8.0')
c.perform()
body = buf.getvalue()
c.close()

response = HtmlResponse(url='http://www.guardalo.org/99407/', body=body)
print Selector(response=response).xpath('//edindex/text()').extract()

它有效,但我需要标题、视频链接和描述作为单独的变量。我怎样才能实现这个目标?

最佳答案

标题可以使用//title/text()提取,视频源链接通过//video/source/@src:

selector = Selector(response=response)

title = selector.xpath('//title/text()').extract()[0]
description = selector.xpath('//edindex/text()').extract()
video_sources = selector.xpath('//video/source/@src').extract()[0]

code_url = selector.xpath('//meta[@name="EdImage"]/@content').extract()[0]
code = re.search(r'(\w+)-play-small.jpg$', code_url).group(1)

print title
print description
print video_sources
print code

打印:

Best Babies Laughing Video Compilation 2012 [HD] - Guardalo
[u'Best Babies Laughing Video Compilation 2012 [HD]', u"Ciao a tutti amici di guardalo,quello che propongo oggi \xe8 un video sui neonati buffi con risate travolgenti, facce molto buffe,iniziamo con una coppia di gemellini che se la ridono fra loro,per passare subito con una biondina che si squaqqera dalle risate al suono dello strappo della carta ed \xe8 solo l'inizio.", u'\r\nBuone risate a tutti', u'Elia ride', u'Funny Triplet Babies Laughing Compilation 2014 [NEW HD]', u'Real Talent Little girl Singing Listen by Beyonce .', u'Bimbo Napoletano alle Prese con il Distributore di Benzina', u'Telecamera nascosta al figlio guardate che fa,video bambini divertenti,video bambini divertentissimi']
http://static.guardalo.org/video_image/pre-roll-guardalo.mp4
L49VXZwfup8

关于python scrapy从网站中提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29055073/

相关文章:

python - 将参数列表作为输入传递给 SQL 查询

python - 如何使用 Python 从网站获取某些文本?

python - BeautifulSoup4 无法从表中抓取数据

python - Scrapy - 如何同时在S3和本地文件系统中保存json文件

python - 如何使用 Selenium 抓取动态内容?

Python::如何在非默认浏览器中打开页面

python - 如何在 Python 中捕获此特定异常 - amazon.api

python - python中的空心钻石

python - 如何只打印 BeautifulSoup 值?

python - 名称错误 : name 'Rule' is not defined in python scrapy