我正在尝试抓取 https://www.controller.com/使用 python,并且由于该页面检测到使用 pandas.get_html
的机器人,并使用用户代理和旋转代理进行请求,因此我求助于使用 selenium webdriver。但是,这也被检测为带有以下消息的机器人。谁能解释我怎样才能克服这个问题?:
Pardon Our Interruption... As you were browsing www.controller.com something about your browser made us think you were a bot. There are a few reasons this might happen: You're a power user moving through this website with super-human speed. You've disabled JavaScript in your web browser. A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article. To request an unblock, please fill out the form below and we will review it as soon as possible"
这是我的代码:
from selenium import webdriver
import requests
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
options = webdriver.ChromeOptions()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
#options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.controller.com/')
driver.implicitly_wait(30)
最佳答案
您仅在您的问题中提到了 pandas.get_html
并且仅在您的代码中提到了 options.add_argument('headless')
所以不确定您是否正在实现它们。但是,从您的代码尝试中取出最少的代码如下:
代码块:
from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_argument("disable-infobars") options.add_argument("--disable-extensions") driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe') driver.get('https://www.controller.com/') print(driver.title)
我遇到了同样的问题。
- 浏览器快照:
当我检查 HTML DOM据观察,该网站在 window.onbeforeunload
上引用 distil_referrer 如下:
<script type="text/javascript" id="">
window.onbeforeunload=function(a){"undefined"!==typeof sessionStorage&&sessionStorage.removeItem("distil_referrer")};
</script>
快照:
这清楚地表明该网站受到Bot Management 服务提供商的保护Distil Networks ChromeDriver 的导航会被检测到并随后被阻止。
提炼
根据文章There Really Is Something About Distil.it... :
Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.
更进一步,
"One pattern with Selenium was automating the theft of Web content"
, Distil CEO Rami Essaid said in an interview last week."Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".
引用
您可以在以下位置找到一些详细的讨论:
关于python - 网页正在使用 Chromedriver 作为机器人检测 Selenium Webdriver,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54984185/