python - Scrapy 和 Twisted 错误

标签 python scrapy twisted

我继承了一个项目,在尝试解决一个问题时,我必须升级该项目的所有软件包。在这样做的过程中,我遇到了更多的问题,我束手无策。

这是一个使用多个软件包的网络抓取项目,我已将 Scrapy 和 Twisted 更新到最新版本,现在当我从 cmd 行运行抓取程序时遇到以下错误。我尝试过降级扭曲并卸载/重新安装,但仍然遇到相同的错误。

我运行的是 Windows 8.1

错误如下:

    c:\RND\scraper\crawlers>scrapy crawl reuters
2015-08-24 12:40:34 [scrapy] INFO: Scrapy 1.0.3 started (bot: crawlers)
2015-08-24 12:40:34 [scrapy] INFO: Optional features available: ssl, http11
2015-08-24 12:40:34 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cr
awlers.spiders', 'DUPEFILTER_CLASS': 'crawlers.utils.DuplicateArticleFilter', 'S
PIDER_MODULES': ['crawlers.spiders.reuters', 'crawlers.spiders.bbc', 'crawlers.s
piders.canwildlife', 'crawlers.spiders.usgs'], 'BOT_NAME': 'crawlers', 'USER_AGE
NT': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like G
ecko) Chrome/32.0.1667.0 Safari/537.36', 'DOWNLOAD_DELAY': 1}
2015-08-24 12:40:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
c:\Python27\lib\site-packages\twisted\internet\endpoints.py:29: DeprecationWarni
ng: twisted.internet.interfaces.IStreamClientEndpointStringParser was deprecated
 in Twisted 14.0.0: This interface has been superseded by IStreamClientEndpointS
tringParserWithReactor.
  from twisted.internet.interfaces import (

2015-08-24 12:40:35 [py.warnings] WARNING: c:\Python27\lib\site-packages\twisted
\internet\endpoints.py:29: DeprecationWarning: twisted.internet.interfaces.IStre
amClientEndpointStringParser was deprecated in Twisted 14.0.0: This interface ha
s been superseded by IStreamClientEndpointStringParserWithReactor.
  from twisted.internet.interfaces import (

2015-08-24 12:40:36 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 12:40:36 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 12:40:36 [scrapy] INFO: Enabled item pipelines: MongodbExportPipeline

2015-08-24 12:40:36 [scrapy] INFO: Spider opened
Unhandled error in Deferred:
2015-08-24 12:40:36 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 1274, in
unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
  File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 1128, in
_inlineCallbacks
    result = g.send(result)
  File "c:\Python27\lib\site-packages\scrapy\crawler.py", line 73, in crawl
    yield self.engine.open_spider(self.spider, start_requests)
  File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 1274, in
unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 1128, in
_inlineCallbacks
    result = g.send(result)
  File "c:\Python27\lib\site-packages\scrapy\core\engine.py", line 232, in open_
spider
    scheduler = self.scheduler_cls.from_crawler(self.crawler)
  File "c:\Python27\lib\site-packages\scrapy\core\scheduler.py", line 28, in fro
m_crawler
    dupefilter = dupefilter_cls.from_settings(settings)
  File "c:\Python27\lib\site-packages\scrapy\dupefilters.py", line 44, in from_s
ettings
    return cls(job_dir(settings), debug)
exceptions.TypeError: __init__() takes at most 2 arguments (3 given)
2015-08-24 12:40:36 [twisted] CRITICAL:

这是我的点子列表:

amqp (1.4.6)
anyjson (0.3.3)
billiard (3.3.0.16)
celery (3.1.9)
cffi (1.1.2)
characteristic (14.3.0)
cryptography (0.9.3)
cssselect (0.9.1)
cython (0.20.1)
django (1.6.1)
django-extensions (1.3.
django-guardian (1.1.1)
django-userena (1.2.4)
dstk (0.50)
easy-thumbnails (1.4)
egenix (0.13.0-1.0.0j-1
enum34 (1.0.4)
geomet (0.1.0)
html2text (3.200.3)
idna (2.0)
ipaddress (1.0.12)
ipython (1.1.0)
kombu (3.0.24)
lxml (3.4.4)
mongoengine (0.8.7)
ndg-httpsclient (0.4.0)
pillow (2.3.0)
pip (7.1.0)
psycopg2 (2.5.2)
pyasn1 (0.1.8)
pyasn1-modules (0.0.5)
pycparser (2.14)
pymongo (2.6.3)
pyOpenSSL (0.15.1)
pyreadline (2.0)
python-dateutil (2.2)
pytz (2014.1)
queuelib (1.2.2)
requests (2.7.0)
Scrapy (1.0.3)
service-identity (14.0.
setuptools (18.2)
simplejson (3.3.3)
six (1.9.0)
south (0.8.4)
Twisted (15.3.0)
version (0.1.1)
w3lib (1.11.0)
zope.interface (4.1.2)

最佳答案

您的蜘蛛在 settings.py 文件中使用 DUPEFILTER ('DUPEFILTER_CLASS': 'crawlers.utils.DuplicateArticleFilter')

Scrapy 在尝试实例化 dupefilter 时抛出异常。尝试不使用 dupefilter 的蜘蛛,看看它是否会加载。

注意:在您将 dupefilter 更新为最新版本的 Scrapy/Twisted 之前,您的蜘蛛程序将无法正确过滤重复的 URL。但是,如果不知道您来自哪个版本的 Scrapy/Twisted,也没有看到 settings/dupefilter 的代码,我们无法确定为什么会抛出异常。

关于python - Scrapy 和 Twisted 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32187766/

相关文章:

python - 将 Pandas DataFrame 行与两个阈值进行比较

python - 了解如何使用 BeautifulSoup 进行网页抓取

python - 在 Win32 上如何/在哪里下载适用于 Python 2.7 的 Pylab?

python - Scrapy 爬虫,去除字符串中的逗号

扭曲的服务器,nc 客户端

python - twisted - 获取操作系统选择的监听端口

python - 生成所有组的固定长度组合

python - 如何使用 scrapy 捕获错误,以便在出现用户超时错误时执行某些操作?

python - 在终端中启动 scrapyd 时发生 fatal error

python - 如何通过工厂使用扭曲协议(protocol)发送数据