python - Scrapy，验证码登录失败

我正在使用以下蜘蛛来抓取需要身份验证的 tinyz.us 网站。

from scrapy.spiders import BaseSpider
from scrapy.http import FormRequest
import urllib2


class Start(BaseSpider):
    name = 'test'
    start_urls = ["http://tinyz.us"]

    def parse(self, response):

        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
        headers = {'User-Agent': user_agent}
        imgRequest = urllib2.Request("http://tinyz.us/securimage/securimage_show.php", headers=headers)
        imgData = urllib2.urlopen(imgRequest).read()

        with open('captcha.png', 'wb') as f:
            f.write(imgData)

        captcha = raw_input("-----> Enter the captcha in manually :")

        return FormRequest.from_response(
            response=response,
            formdata={"login_user": "myusername",
                      "login_password": "mypass",
                      "captcha_code": captcha},
            formxpath="//*[@id='login-form']",
            callback=self.after_login,
            headers=headers)

    def after_login(self, response):
        print("AFTER LOGIN")
        with open('response.html', 'w') as f:
            f.write(response.body)

该网站使用恒定的 URL 来生成验证码，并且似乎每次都会生成一个新的 URL。我不熟悉相应的技术，但我倾向于解决这个问题的方法是保存验证码并手动传递它。

问题是它总是返回失败的响应，我不确定问题是因为 scrapy 将数据传递到 form 的方式还是因为验证码，我可以'找不到正确调试蜘蛛的方法。

最佳答案

好吧，这里的问题是验证码图像需要从实际响应中接收cookie，并且您正在使用urllib2来发出验证码请求，因此Scrapy默认情况下不处理该请求.

使用 scrapy 请求来检查验证码，例如:

def parse(self, response):
    yield Request(url="http://tinyz.us/securimage/securimage_show.php", callback=self.parse_captcha, meta={'previous_response': response})

def parse_captcha(self, response):
    with open('captcha.png', 'wb') as f:
        f.write(response.body)

    captcha = raw_input("-----> Enter the captcha in manually :")

    return FormRequest.from_response(
        response=response.meta['previous_response'],
        formdata={"login_user": "myusername",
                  "login_password": "mypass",
                  "captcha_code": captcha},
        formxpath="//*[@id='login-form']",
        callback=self.after_login)

关于python - Scrapy，验证码登录失败，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43068403/

python - Scrapy，验证码登录失败

上一篇：python - 带有 Excel 库的机器人框架错误 : "local variable ' my_sheet_index' referenced before assignment"

下一篇：python - 使用 Beautifulsoup-Python 进行抓取