python - 为什么配置 NTLM 中间件后 Scrapy 无法获取我的 URL？

在打开 NTLM 下载器中间件之前，我尝试抓取 yahoo.com，效果非常好。但是，现在我的下载器中间件已在设置中打开，我收到一条错误消息“错误:下载时出错。

设置.py

    BOT_NAME = 'demo'

    SPIDER_MODULES = ['demo.spiders']
    NEWSPIDER_MODULE = 'demo.spiders'

    DOWNLOADER_MIDDLEWARES = { 'demo.ntlmauth.NtlmAuthMiddleware': 800, }

    ITEM_PIPELINES = [
                  'scrapysolr.SolrPipeline',
    ]

    SOLR_URL = 'solr_url'
    SOLR_MAPPING = {
       'id': 'url',
       'text': ['title', 'breadcrumbs', 'description'],
       'description': 'description',
       'keywords': 'breadcrumbs',
       'price': 'price',
       'title': 'title'
    }

ntlmauth.py。此代码也可以找到here .

    import os
    import urllib2
    from ntlm import HTTPNtlmAuthHandler
    from scrapy.http import TextResponse

      class NtlmAuthMiddleware(object):

          def process_request(self, request, spider):
               usr = '%s\%s' % (os.environ["USERDOMAIN"], getattr(spider,'http_user', ''))
               pwd = getattr(spider, 'http_pass', '')
               url = request.url

               passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
               passman.add_password(None, url, usr, pwd)

               # Create the NTLM authentication handler.
               auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)

               # Create and install the opener.
               opener = urllib2.build_opener(auth_NTLM)
               urllib2.install_opener(opener)

               # Retrieve the result.
               resp = urllib2.urlopen(url)
               msg = resp.info()

                return HtmlResponse(url=url, status=resp.getcode(), headers=msg.items(), body=resp.read())

demo_Spider.py

   import scrapy

   class DemoSpider(scrapy.Spider):
        http_user = 'DOMAIN\\USER'
        http_pass = 'PASSWORD'
        name = "demo"
        allowed_domains = ["yahoo.com"]
        start_urls = [ 
                "https://www.yahoo.com/" ]


        def parse(self, response):
                filename = response.url.split("/")[-2] + '.html'
                with open(filename, 'wb') as f:
                     f.write(response.body)

和here是我收到的错误!

最佳答案

看一下 ntlm 中间件的第 9 行:

usr = '%s\%s' % (os.environ["USERDOMAIN"], getattr(spider,'http_user', ''))

引发的错误是由于未设置环境变量 USERDOMAIN 造成的。

在您当前的代码中，usr 的值将是“OsUserDomain\DOMAIN\USER”，这不太可能是您想要的(它没有意义)。我建议您修改您的蜘蛛或中间件，以使用正确的“域\用户”格式。

关于python - 为什么配置 NTLM 中间件后 Scrapy 无法获取我的 URL？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31706327/

python - 为什么配置 NTLM 中间件后 Scrapy 无法获取我的 URL？

上一篇：python - 是否有更有效的方法将数据框(列和数据)写入列表？

下一篇：python - 多处理 Python 中的共享数组