我在Python中使用scrapy,试图从网站获取一个值,然后用于迭代。我遇到的问题是,似乎只能通过 yield 发送请求,这使得无法获得返回的值。
class Spider(scrapy.Spider):
name = 'spider'
allowed_domains = ['domain.com']
start_urls = ['url1', 'url2', ...]
headers = ['id', 'name', 'description']
pageNumber = 0 #tried to use a global variable but that doesn't really work because of the sub process.
def start_requests(self):
for su in self.start_urls:
yield Request('http://url.com%s' % su,
self.parse_pageNumber)
for i in range(pageNumber):
page = su+str(pageNumber)
yield Request('http://url.com' % page,
self.parse_matches)
def parse_pageNumber(self, response):
finds page number
def.parse_matches(self, response):
does everything else and returns items
知道如何在不需要太多额外工作的情况下获取页码吗?
最佳答案
执行此操作的正确方法是使用 meta
字典。首先,您创建初始请求来获取页码,但将感兴趣的 url 保留在 meta
字典中。然后,您在 parse_pageNumber
内创建一个新请求,但这次将页码保留在 meta
数据内。然后,您可以从 parse_matches
响应中检索页码。您可以执行此操作,因为正是出于此目的,meta
从 Request
浅复制到 Response
。您的代码可能如下所示:
class Spider(scrapy.Spider):
name = 'spider'
allowed_domains = ['domain.com']
start_urls = ['url1', 'url2', ...]
headers = ['id', 'name', 'description']
def start_requests(self):
for su in self.start_urls:
yield Request('http://url.com%s' % su,
self.parse_pageNumber,
meta = {'su': su}
)
def parse_pageNumber(self, response):
pageNumber = response.xpath('get_page_number_expression')
su = response.meta['su']
for i in range(pageNumber):
page = su + str(pageNumber)
yield Request('http://url.com' % page,
self.parse_matches, meta={'page_number':str(pageNumber)})
def parse_matches(self, response):
pageNumber = response.meta['page_number']
# do everything else
摘自official documentation为了更好地理解元
:
meta
A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Scrapy components (extensions, middlewares, etc). So the data contained in this dict depends on the extensions you have enabled. See Request.meta special keys for a list of special meta keys recognized by Scrapy. This dict is shallow copied when the request is cloned using the copy() or replace() methods, and can also be accessed, in your spider, from the response.meta attribute.
注意:
尽管建议使用 meta
方法,但您的情况似乎更简单一些,因为您是通过直接使用页码构建请求 URL,在这种情况下,您可能只需使用urlparse
模块从 parse_matches()
方法中的 response.url
中提取此信息。但 meta
仍然是一种更强大的方法。
关于python - 使用scrapy从网站返回值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27355569/