callback - 设置 process_request 和回调参数时，Scrapy 规则不起作用

我有这个scrapy规则CrawlSpider

rules = [
        Rule(LinkExtractor(
                    allow= '/topic/\d+/organize$', 
                    restrict_xpaths = '//div[@id= "zh-topic-organize-child-editor"]'
                    ),
           process_request='request_tagPage', callback = "parse_tagPage", follow = True)
    ]

request_tagePage()指将cookie添加到请求中的函数和parse_tagPage()指解析目标页面的函数。根据documentation , CrawlSpider 应该使用 request_tagPage发出请求，一旦返回响应，它会调用 parse_tagPage()解析它。然而，我意识到当request_tagPage()使用，蜘蛛不调用 parse_tagPage()一点也不。所以在实际代码中，我手动添加parse_tagPage()回调函数作为 request_tagPage 中的回调，像这样:

def request_tagPage(self, request):
    return Request(request.url, meta = {"cookiejar": 1}, \ # attach cookie to the request otherwise I can't login
            headers = self.headers,\
            callback=self.parse_tagPage) # manually add a callback function.

它起作用了，但现在蜘蛛不使用规则来扩展它的爬行。它在抓取来自 start_urls 的链接后关闭.但是，在我手动设置 parse_tagPage() 之前作为回调到 request_tagPage() ，规则有效。所以我在想这可能是一个错误？是一种启用request_tagPage()的方法，我需要在请求中附加cookie，parse_tagPage() ，用于解析页面和 rules ，哪个指示蜘蛛爬行？

最佳答案

CrawlSpider 生成的请求规则使用 internal callbacks and use meta to do their "magic" .

我建议您不要在规则中从头开始重新创建请求'process_request hooks(或者你可能最终会重新实现 CrawlSpider 已经为你做的事情)。

相反，如果您只想添加 cookie 和特殊 header ，则可以使用 .replace() method on the request传递给 request_tagPage ，使CrawlSpider的“魔力”被保留。

这样的事情就足够了:

def request_tagPage(self, request):
    tagged = request.replace(headers=self.headers)
    tagged.meta.update(cookiejar=1)
    return tagged

关于callback - 设置 process_request 和回调参数时，Scrapy 规则不起作用，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38280133/

callback - 设置 process_request 和回调参数时，Scrapy 规则不起作用

上一篇：.net - Autofac 和 IDisposable 接口(interface)

下一篇：opengl-es - opengl 1.1 渲染缓冲区生成只有 960px 而不是 1136px 宽度