python - Cloudflare 碎片

标签 python web-scraping scrapy scrapy-spider

我正在尝试使用 Scrapy 和 Cloudflare 抓取 URL,但我无法获得任何结果:

2018-07-09 22:14:00 [scrapy.core.engine] INFO: Spider opened
2018-07-09 22:14:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-09 22:14:00 [scrapy.extensions.httpcache] DEBUG: Using filesystem 
cache storage in C:\Users\Luis\Mister\.scrapy\httpcache
2018-07-09 22:14:00 [scrapy.extensions.telnet] DEBUG: Telnet console 
listening on 127.0.0.1:6023
2018-07-09 22:14:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.mister-auto.es/robots.txt> (referer: None) ['cached']
2018-07-09 22:14:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.mister-auto.es/global_search2.html? idx=prod_monoindex_ESes&q=FEBI+BILSTEIN> (referer: None) ['cached']
2018-07-09 22:14:00 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-09 22:14:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 633,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 20858,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 7, 9, 20, 14, 0, 833000),
 'httpcache/hit': 2,
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 7, 9, 20, 14, 0, 594000)}
2018-07-09 22:14:00 [scrapy.core.engine] INFO: Spider closed (finished)

由于网站受 Cloudflare 保护,我安装了这个: https://github.com/clemfromspace/scrapy-cloudflare-middleware

当我修改我的settings.py时,我得到了下一个错误:

Traceback (most recent call last):
  File "C:\Users\Luis\Anaconda2\lib\site-packages\twisted\internet\defer.py", 
line 1386, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\crawler.py", line 
98, in crawl six.reraise(*exc_info)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\crawler.py", line 
80, in crawl self.engine = self._create_engine()
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\crawler.py", line 
105,in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\core\engine.py", 
line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "C:\Users\Luis\Anaconda2\lib\site- 
packages\scrapy\core\downloader\__init__.py", line 88, in __init__
   self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\middleware.py", line 
58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\middleware.py", line 
34, in from_settings mwcls = load_object(clspath)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\utils\misc.py", line 
44, in load_object
    mod = import_module(module)
  File "C:\Users\Luis\Anaconda2\lib\importlib\__init__.py", line 37, in 
import_module__import__(name)
ImportError: No module named scraping_hub.middlewares

此时我被卡住了。我不知道是否必须更改 settings.pymiddlewares.py

你能帮帮我吗?我想提高我的技能。 ;)

附言我已经添加了我的 middlewares.py:

from scrapy import signals


class MercadoSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.

@classmethod
def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s

def process_spider_input(self, response, spider):
    # Called for each response that goes through the spider
    # middleware and into the spider.

    # Should return None or raise an exception.
    return None

def process_spider_output(self, response, result, spider):
    # Called with the results returned from the Spider, after
    # it has processed the response.

    # Must return an iterable of Request, dict or Item objects.
    for i in result:
        yield i

def process_spider_exception(self, response, exception, spider):
    # Called when a spider or process_spider_input() method
    # (from other spider middleware) raises an exception.

    # Should return either None or an iterable of Response, dict
    # or Item objects.
    pass

def process_start_requests(self, start_requests, spider):
    # Called with the start requests of the spider, and works
    # similarly to the process_spider_output() method, except
    # that it doesn’t have a response associated.

    # Must return only requests (not items).
    for r in start_requests:
        yield r

def spider_opened(self, spider):
    spider.logger.info('Spider opened: %s' % spider.name)


class MercadoDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.

@classmethod
def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s

def process_request(self, request, spider):
    # Called for each request that goes through the downloader
    # middleware.

    # Must either:
    # - return None: continue processing this request
    # - or return a Response object
    # - or return a Request object
    # - or raise IgnoreRequest: process_exception() methods of
    #   installed downloader middleware will be called
    return None

def process_response(self, request, response, spider):
    # Called with the response returned from the downloader.

    # Must either;
    # - return a Response object
    # - return a Request object
    # - or raise IgnoreRequest
    return response

def process_exception(self, request, exception, spider):
    # Called when a download handler or a process_request()
    # (from other downloader middleware) raises an exception.

    # Must either:
    # - return None: continue processing this exception
    # - return a Response object: stops process_exception() chain
    # - return a Request object: stops process_exception() chain
    pass

def spider_opened(self, spider):
    spider.logger.info('Spider opened: %s' % spider.name)

最佳答案

使用 scrapy_rotaing-proxies 逃脱:

pip install scrapy-rotating-proxies

关于python - Cloudflare 碎片,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51253538/

相关文章:

python - Django:如何在模板的 if 语句中使用变量?

python - Scrapy:如何从蜘蛛类的 __init__() 方法访问自定义的 CLI 传递设置?

python - 用 matplotlib 绘制 numpy datetime64

python - 来自 zope 模式的循环导入引用

python - 如何在带有opencv4的Google colab中使用SIFT和SURF算法?

python - 用scrapy下载整页

python - BeautifulSoup 忽略表内的嵌套表

python-2.7 - Python xpath抓取有问题

python - 在运行时生成 python 正则表达式以匹配从 'n' 到无限的数字

python - 如何在 Scrapy 中创建基于 href 的 LinkExtractor 规则