python - 使用selenium webdriver爬取网页时，服务器如何区分是机器人还是人？

标签 python selenium firefox web-crawler

我们实验室与一家网络公司合作，开发了可以保护网页不被网络爬虫抓取的技术。测试网站是http://119.254.209.77/ .我无法获取左侧页面(例如“正在检查”)的网址。当我点击链接时，它会创建一个url。使用Python+Selenium+Firefox，我模拟了点击操作，但我得到了一个空白页面而不是真实的数据。如果我自己点击链接，它会返回真实的数据。所以我想知道当我在firefox中使用selenium webdriver时，服务器如何识别我是网络爬虫？我还想知道如何避免被其网站视为网络爬虫。

这是我的代码:


    driver = webdriver.Firefox()
    driver.get('http://119.254.209.77/')
    time.sleep(5)
    pageSource = driver.page_source
    print(driver.page_source)

    # the target url

    checking = driver.find_element_by_id('_ctl0__ctl0_Content_MenuHyperLink2')

    # it seems to has no effects
    checking.click()
    time.sleep(2)
    print(driver.page_source)

最佳答案

在将您转到下一页之前，该网站似乎正在检查您的鼠标所在位置。在单击元素之前移动到它对我有用:

driver = webdriver.Chrome()
driver.get('http://119.254.209.77/')
time.sleep(5)
pageSource = driver.page_source
print(driver.page_source)
# the target url
checking = driver.find_element_by_id('_ctl0__ctl0_Content_MenuHyperLink2')

action_chain = webdriver.ActionChains(driver)
action_chain.move_to_element(checking)
action_chain.click(checking)
action_chain.perform()
time.sleep(2)
print(driver.page_source)

关于python - 使用selenium webdriver爬取网页时，服务器如何区分是机器人还是人？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38320811/

上一篇：python - 为什么这个简单的 Tensorflow 代码不成功？ (使用 Tensorflow 的 ConvnetJS)

下一篇：python - 在 Google Cloud Storage 中以编程方式创建和删除存储桶

java - 元素不可见 Selenium

jquery - 动画 scrollTop 在 firefox 中不起作用

linux - 构建 Pulseaudio-11.1 和 intltool >= 0.35.0 错误

jquery - Firefox css 显示问题

python - 函数返回 np.array 的副本并替换了一些元素

python - 尝试在 pygame 中将重力带入类似马里奥的平台游戏中

python - 在单行列表理解中操作字典列表

python - 以狂野的方式将 Django 1.5 升级到 1.8。好主意还是一个非常愚蠢的主意？

java - 解析双java时出错