python - Scrapy Splash 总是返回相同的页面

标签 python web-scraping scrapy scrapy-splash splash-js-render

对于预先知道其个人资料 url 的几个 Disqus 用户中的每一个,我想抓取他们的姓名和他们的关注者的用户名。我正在使用 scrapysplash 这样做。但是,当我解析响应时,它似乎总是在抓取第一个用户的页面。我尝试将 wait 设置为 10 并将 dont_filter 设置为 True,但它不起作用。我现在该怎么办?

这是我的蜘蛛:

import scrapy
from disqus.items import DisqusItem

class DisqusSpider(scrapy.Spider):
    name = "disqusSpider"
    start_urls = ["https://disqus.com/by/disqus_sAggacVY39/", "https://disqus.com/by/VladimirUlayanov/", "https://disqus.com/by/Beasleyhillman/", "https://disqus.com/by/Slick312/"]
    splash_def = {"endpoint" : "render.html", "args" : {"wait" : 10}}

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url = url, callback = self.parse_basic, dont_filter = True, meta = {
                "splash" : self.splash_def,
                "base_profile_url" : url
            })

    def parse_basic(self, response):
        name = response.css("h1.cover-profile-name.text-largest.truncate-line::text").extract_first()
        disqusItem = DisqusItem(name = name)
        request = scrapy.Request(url = response.meta["base_profile_url"] + "followers/", callback = self.parse_followers, dont_filter = True, meta = {
            "item" : disqusItem,
            "base_profile_url" : response.meta["base_profile_url"],
            "splash": self.splash_def
        })
        print "parse_basic", response.url, request.url
        yield request

    def parse_followers(self, response):
        print "parse_followers", response.meta["base_profile_url"], response.meta["item"]
        followers = response.css("div.user-info a::attr(href)").extract()

DisqusItem 定义如下:

class DisqusItem(scrapy.Item):
    name = scrapy.Field()
    followers = scrapy.Field()

结果如下:

2017-08-07 23:09:12 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/disqus_sAggacVY39/ {'name': u'Trailer Trash'}
2017-08-07 23:09:14 [scrapy.extensions.logstats] INFO: Crawled 5 pages (at 5 pages/min), scraped 0 items (at 0 items/min)
2017-08-07 23:09:18 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/VladimirUlayanov/ {'name': u'Trailer Trash'}
2017-08-07 23:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/Beasleyhillman/ {'name': u'Trailer Trash'}
2017-08-07 23:09:40 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
parse_followers https://disqus.com/by/Slick312/ {'name': u'Trailer Trash'}

这是文件settings.py:

# -*- coding: utf-8 -*-

# Scrapy settings for disqus project
#

BOT_NAME = 'disqus'

SPIDER_MODULES = ['disqus.spiders']
NEWSPIDER_MODULE = 'disqus.spiders'

ROBOTSTXT_OBEY = False

SPLASH_URL = 'http://localhost:8050' 

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
DUPEFILTER_DEBUG = True

DOWNLOAD_DELAY = 10

最佳答案

我能够使用 SplashRequest 而不是 scrapy.Request 让它工作。

例如:

import scrapy
from disqus.items import DisqusItem
from scrapy_splash import SplashRequest


class DisqusSpider(scrapy.Spider):
    name = "disqusSpider"
    start_urls = ["https://disqus.com/by/disqus_sAggacVY39/", "https://disqus.com/by/VladimirUlayanov/", "https://disqus.com/by/Beasleyhillman/", "https://disqus.com/by/Slick312/"]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse_basic, dont_filter = True, endpoint='render.json',
                        args={
                            'wait': 2,
                            'html': 1
                        })

关于python - Scrapy Splash 总是返回相同的页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45555878/

相关文章:

python - 限制 IP 地址以访问您在 GAE 上的应用程序?

python - 使用 numpy 或 pandas 处理长格式的 csv 文件

python - 如何使用 Beautifulsoup 基于嵌套标签来切片和重新组合文本?

python scrapy 蜘蛛 : pass additional information in parse() method for each start_url

python - 如何像简单脚本一样以编程方式运行 scrapy 蜘蛛?

python - 树莓派上的多个热电偶

python - 尽管文件关联和路径正确,但 Windows 10 不会使用正确版本的 python 启动我的 python 脚本。

node.js - Puppeteer - 如何使用代理浏览 google.com?

python - 使用 Win10 任务调度程序批量调度 Scrapy Spider

python - scrapy exceptions.TypeError : 'int' object has no attribute '__getitem__'