问题
我正在尝试抓取像 YouTube 这样的网站,该网站有一个包含大量视频的列表以及指向各个视频的链接。我想做的是在使用 parse_item() 进入特定视频之前抓取视频的缩略图。
问题是我不知道如何将“ ListView ”的 Response 对象带入 parse_item() 函数。我知道您可以使用 process_request 拦截请求并将元插入到 Request 对象中,但我不知道如何获取 ListView 响应。
这个问题有不同的方法吗?
我的代码:
import re
import datetime
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from ..items import ExampleItem
class ExampleSpider(CrawlSpider):
"""
Crawler for: www.example.com
"""
name = "example"
allowed_domains = ['www.example.com']
start_urls = ['http://www.example.com']
rules = (
Rule(SgmlLinkExtractor(
restrict_xpaths=["//div[@class='pagination']"]
)),
Rule(SgmlLinkExtractor(
restrict_xpaths=["//ul[@class='list']"],
deny=['/user/'],
), callback='parse_item', process_request='parent_url')
)
def parent_url(self, request):
request.meta['parent_page'] = '' # Get the parent response somehow?
return request
def parse_item(self, response):
sel = Selector(response)
item = ExampleItem()
duration = sel.css('.video span::text')[0].extract()
item['title'] = sel.css('.title::text')[0].extract()
item['description'] = sel.xpath('//div[@class="description"]/text()').extract()
item['duration'] = self._parse_duration(duration)
item['link'] = response.url
return item
def _parse_duration(self, string):
"""
Parse the duration field for times
return Datetime object
"""
if len(string) > 20:
return datetime.datetime.strptime(string, '%H hours %M min %S sec').time()
if '60 min' in string:
string.replace('60 min', '01 hours 00 min')
return datetime.datetime.strptime(string, '%H hours %M min %S sec')
return datetime.datetime.strptime(string, '%M min %S sec').time()
最佳答案
我假设您想知道从中提取链接(请求)的 URL。
您可以重写方法 _requests_to_follow为了传递请求的源页面:
def _requests_to_follow(self, response):
for req in super(ExampleSpider, self)._requests_to_follow(response):
req.meta['parent_page'] = response.url
yield req
关于python - 将 Response 对象从引用者带入 parse_item 回调,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21439062/