python - 网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver

标签 python selenium selenium-webdriver webdriver bots

我正在尝试抓取 https://www.controller.com/使用 python,并且由于该页面检测到使用 pandas.get_html 的机器人,并使用用户代理和旋转代理进行请求,因此我求助于使用 selenium webdriver。但是,这也被检测为带有以下消息的机器人。谁能解释我怎样才能克服这个问题?:

Pardon Our Interruption... As you were browsing www.controller.com something about your browser made us think you were a bot. There are a few reasons this might happen: You're a power user moving through this website with super-human speed. You've disabled JavaScript in your web browser. A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article. To request an unblock, please fill out the form below and we will review it as soon as possible"

这是我的代码:

from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)

最佳答案

您仅在您的问题中提到了 pandas.get_html 并且仅在您的代码中提到了 options.add_argument('headless') 所以不确定您是否正在实现它们。但是,从您的代码尝试中取出最少的代码如下:

  • 代码块:

    from selenium import webdriver
    
    options = webdriver.ChromeOptions()
    options.add_argument("start-maximized")
    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get('https://www.controller.com/')
    print(driver.title)
    

我遇到了同样的问题。

  • 浏览器快照:

controller_com

当我检查 HTML DOM据观察,该网站在 window.onbeforeunload 上引用 distil_referrer 如下:

<script type="text/javascript" id="">
    window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>

快照:

onbeforeunload

这清楚地表明该网站受到Bot Management 服务提供商的保护Distil Networks ChromeDriver 的导航会被检测到并随后被阻止


提炼

根据文章There Really Is Something About Distil.it... :

Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.

更进一步,

"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".


引用

您可以在以下位置找到一些详细的讨论:

关于python - 网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54984185/

相关文章:

python - 如何对挂起的 Paho Python Mqtt Single Publish 进行故障排除

python - Node.js 中等效的 HMAC SHA-512 哈希函数

selenium - 使用 Selenium 2 执行复制和粘贴

python - 我们可以像在 php 中一样在 python 中生成网页吗?

python - 无法在 Python 2.7.9 虚拟环境中导入 _winreg

java - 如何将一个独立项目添加到另一个maven项目中

javascript - 动态生成的元素 Protractor id

selenium - 有没有办法告诉 Selenium runner 驱动程序在哪里使用命令行?

c# - Selenium Webdriver - 如何为 Firefox 设置代理到 "auto-detect"

python - 如何在 ubuntu 16.04 上为 python3 selenium 安装 firefoxdriver webdriver?