在打开 NTLM 下载器中间件之前,我尝试抓取 yahoo.com,效果非常好。但是,现在我的下载器中间件已在设置中打开,我收到一条错误消息“错误:下载时出错。
设置.py
BOT_NAME = 'demo'
SPIDER_MODULES = ['demo.spiders']
NEWSPIDER_MODULE = 'demo.spiders'
DOWNLOADER_MIDDLEWARES = { 'demo.ntlmauth.NtlmAuthMiddleware': 800, }
ITEM_PIPELINES = [
'scrapysolr.SolrPipeline',
]
SOLR_URL = 'solr_url'
SOLR_MAPPING = {
'id': 'url',
'text': ['title', 'breadcrumbs', 'description'],
'description': 'description',
'keywords': 'breadcrumbs',
'price': 'price',
'title': 'title'
}
ntlmauth.py。此代码也可以找到here .
import os
import urllib2
from ntlm import HTTPNtlmAuthHandler
from scrapy.http import TextResponse
class NtlmAuthMiddleware(object):
def process_request(self, request, spider):
usr = '%s\%s' % (os.environ["USERDOMAIN"], getattr(spider,'http_user', ''))
pwd = getattr(spider, 'http_pass', '')
url = request.url
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, usr, pwd)
# Create the NTLM authentication handler.
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)
# Create and install the opener.
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)
# Retrieve the result.
resp = urllib2.urlopen(url)
msg = resp.info()
return HtmlResponse(url=url, status=resp.getcode(), headers=msg.items(), body=resp.read())
demo_Spider.py
import scrapy
class DemoSpider(scrapy.Spider):
http_user = 'DOMAIN\\USER'
http_pass = 'PASSWORD'
name = "demo"
allowed_domains = ["yahoo.com"]
start_urls = [
"https://www.yahoo.com/" ]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
和here是我收到的错误!
最佳答案
看一下 ntlm 中间件的第 9 行:
usr = '%s\%s' % (os.environ["USERDOMAIN"], getattr(spider,'http_user', ''))
引发的错误是由于未设置环境变量 USERDOMAIN
造成的。
在您当前的代码中,usr
的值将是“OsUserDomain\DOMAIN\USER”,这不太可能是您想要的(它没有意义)。我建议您修改您的蜘蛛或中间件,以使用正确的“域\用户”格式。
关于python - 为什么配置 NTLM 中间件后 Scrapy 无法获取我的 URL?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31706327/