python - 在 VPS 上运行 Selenium webdriver 时出现各种 Urllib2 错误

标签 python selenium-webdriver web-scraping vps headless-browser

我将 Selenium 与 Python 绑定(bind)结合使用,以使用 headless Firefox 从网页中抓取 AJAX 内容。它在我的本地机器上运行时完美运行。当我在我的 VPS 上运行完全相同的脚本时,错误会在看似随机(但一致)的行上抛出。我的本地和远程系统具有完全相同的操作系统/体系结构,因此我猜测差异与 VPS 相关。

对于这些回溯中的每一个,该行在抛出错误之前运行 4 次。

在执行 JavaScript 将元素滚动到 View 中时,我最常遇到此 URLError。

File "google_scrape.py", line 18, in _get_data
    driver.execute_script("arguments[0].scrollIntoView(true);", e)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 396, in execute_script
    {'script': script, 'args':converted_args})['value']
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
    response = opener.open(request)
  File "/usr/lib64/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>

从元素中读取文本时,我偶尔会收到此 BadStatusLine。

  File "google_scrape.py", line 19, in _get_data
    if e.text.strip():
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
    return self._parent.execute(command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
    response = opener.open(request)
  File "/usr/lib64/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 373, in _read_status
    raise BadStatusLine(line)
httplib.BadStatusLine: ''

有几次我遇到套接字错误:

  File "google_scrape.py", line 19, in _get_data
    if e.text.strip():
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 55, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 233, in _execute
    return self._parent.execute(command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 355, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/home/ryne/.virtualenvs/DEV/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 402, in _request
    response = opener.open(request)
  File "/usr/lib64/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib64/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib64/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib64/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib64/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/usr/lib64/python2.7/httplib.py", line 1045, in getresponse
    response.begin()
  File "/usr/lib64/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline(_MAXLINE + 1)
  File "/usr/lib64/python2.7/socket.py", line 476, in readline
    data = self._sock.recv(self._rbufsize)
socket.error: [Errno 104] Connection reset by peer

我在没有代理的情况下从谷歌抓取信息,所以我的第一个想法是我的 IP 地址被识别为 VPS 并受到 5 次页面操作限制或类似的限制。但我的初步研究表明,这些错误不会因被阻止而出现。

如果您能深入了解这些错误的共同含义,或从 VPS 发出 HTTP 请求时的必要注意事项,我们将不胜感激。

更新

经过一番思考并研究了 webdriver 的真正含义——自动浏览器输入——我应该对为什么 remote_connection.py 发出 urllib2 请求感到困惑根本。 WebElement 类的 text 方法似乎是 python 绑定(bind)的一个“额外”功能,它不是 Selenium 核心的一部分。这并不能解释上述错误,但它可能表明 text 方法不应该用于抓取。

更新2

我意识到,就我的目的而言,Selenium 的唯一功能是加载 ajax 内容。因此,在页面加载后,我将使用 lxml 解析源代码,而不是使用 Selenium 获取元素,即:

html = lxml.html.fromstring(driver.page_source)

但是,page_source 是另一种导致调用 urllib2 的方法,我第二次总是收到 BadStatusLine 错误用它。最大限度地减少 urllib2 请求绝对是朝着正确方向迈出的一步。

更新3

通过使用 javascript 获取源代码来消除 urllib2 请求更好:

html = lxml.html.fromstring(driver.execute_script("return window.document.documentElement.outerHTML"))

结论

可以通过在每隔几个请求之间执行一次 time.sleep(10) 来避免这些错误。我想到的最佳解释是 Google 的防火墙将我的 IP 识别为 VPS,因此将其置于一组更严格的阻止规则之下。

这是我最初的想法,但我仍然很难相信,因为我的网络搜索没有返回任何迹象表明上述错误可能是由防火墙引起的。

如果是这种情况,我认为可以使用代理来规避更严格的规则,尽管该代理可能必须是本地系统或 tor 才能避免相同的限制。

最佳答案

根据我们的谈话,您发现即使对于少量的日常抓取,Google 也有适当的反抓取阻止。解决方案是在每次提取之间延迟几秒钟。

在一般情况下,由于您在技术上将不可恢复的成本转移给第三方,因此最好尝试减少您对远程服务器施加的额外资源负载。如果 HTTP 提取之间没有暂停,快速的服务器和连接可能会导致远程拒绝服务,尤其是对于没有 Google 服务器资源的抓取目标。

关于python - 在 VPS 上运行 Selenium webdriver 时出现各种 Urllib2 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20411814/

相关文章:

python - 前向填充时间序列数据指定频率的某些列

Python 将 "5.4"但不是 "5.4.4"转换为 float ,这是怎么回事?

python - SWIG:如何将复数 vector 从 C++ 传递到 Python

python - 无法点击分页中的下一个按钮

multithreading - 通过 golang 进行 gitlab 抓取的问题

python - 如何打印并显示网页抓取的所有结果?

python - 属性错误 : 'list' object has no attribute 'click' - Selenium Webdriver

selenium-webdriver - reportTestCaseResult() 给出 teSTLink.api.java.client.TestLinkAPIException

grails - Grails-Selenium-RC插件-启动Selenium服务器

python - 在 selenium 网络驱动程序中如何选择正确的 iframe