python - 作为脚本运行时,Scrapy 爬虫忽略 `DOWNLOADER_MIDDLEWARES`

标签 python scrapy

我想获取数据,使用 Scrapy ,来自几个不同的站点,并对这些数据执行一些分析。由于爬虫和分析数据的代码都与同一个项目相关,我想将所有内容存储在同一个 Git 存储库中。我创建了一个 minimal reproducible example on Github .

项目的结构如下所示:

./crawlers
./crawlers/__init__.py
./crawlers/myproject
./crawlers/myproject/__init__.py
./crawlers/myproject/myproject
./crawlers/myproject/myproject/__init__.py
./crawlers/myproject/myproject/items.py
./crawlers/myproject/myproject/pipelines.py
./crawlers/myproject/myproject/settings.py
./crawlers/myproject/myproject/spiders
./crawlers/myproject/myproject/spiders/__init__.py
./crawlers/myproject/myproject/spiders/example.py
./crawlers/myproject/scrapy.cfg
./scrapyScript.py

./crawlers/myproject 文件夹中,我可以通过键入以下内容来执行爬虫:

scrapy crawl example

爬虫使用了一些下载器中间件,具体来说,alecxe's优秀scrapy-fake-useragent .来自 settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
}

当使用 scrapy crawl ... 执行时,用户代理看起来像一个真正的浏览器。这是来自网络服务器的示例记录:

24.8.42.44 - - [16/Jun/2015:05:07:59 +0000] "GET / HTTP/1.1" 200 27161 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"

查看 documentation ,可以从脚本中执行相当于 scrapy crawl ... 的操作。基于文档的 scrapyScript.py 文件如下所示:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals

from scrapy.utils.project import get_project_settings
from crawlers.myproject.myproject.spiders.example import ExampleSpider

spider = ExampleSpider()
settings = get_project_settings()

crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)

crawler.start()
log.start()
reactor.run()

当我执行脚本时,我可以看到爬虫发出页面请求。不幸的是,它忽略了 DOWNLOADER_MIDDLEWARES。例如,用户代理不再被欺骗:

24.8.42.44 - - [16/Jun/2015:05:32:04 +0000] "GET / HTTP/1.1" 200 27161 "-" "Scrapy/0.24.6 (+http://scrapy.org)"

不知何故,当从脚本执行爬虫时,它似乎忽略了 settings.py 中的设置。

你能看出我做错了什么吗?

最佳答案

为了让 get_project_settings() 找到所需的 settings.py,设置 SCRAPY_SETTINGS_MODULEenvironment variable :

import os
import sys

# ...

sys.path.append(os.path.join(os.path.curdir, "crawlers/myproject"))
os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.settings'

settings = get_project_settings()

请注意,由于运行脚本的位置,您需要将 myproject 添加到 sys.path 中。或者,将 scrapyScript.py 移动到 myproject 目录下。

关于python - 作为脚本运行时,Scrapy 爬虫忽略 `DOWNLOADER_MIDDLEWARES`,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30836209/

相关文章:

python - Scrapy SgmlLinkExtractor 忽略允许的链接

python - 使用 beautifulsoup 仅从 html 页面中抓取以 .ece 结尾的超链接

python - 如果用户输入无效答案如何重做输入

获取值时出现 Python 错误

python - 抓取时获取错误实例方法没有属性 '__getitem__'

python - 如何使用scrapy检查网站是否支持http、htts和www前缀

python - 使用scrapy从xml中提取链接

php - 在 PHP 中使用列表

python - Apache Beam 中的并行度

python - 在 OS X 10.6 上安装 scrapy 的问题