python - 如何使用 selenium 网络驱动程序抓取网站而不会被阻止

标签 python selenium web-scraping proxy ip

我正在抓取此页面 https://www.elcorteingles.es/supermercado/alimentacion-general/但每次浏览器都无法加载页面或无法访问网站。我该如何解决这个问题？

class SuperSpider(scrapy.Spider):
name = 'super'
allowed_domains = ['www.elcorteingles.es/supermercado']
start_urls = ['https://www.elcorteingles.es/supermercado/alimentacion-general/']

def __init__(self):
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_path = which("chromedriver")
    driver = webdriver.Chrome(executable_path=chrome_path)
    driver.get("https://www.elcorteingles.es/supermercado/alimentacion-general/")
    driver.maximize_window()
    time.sleep(25)
    self.html = driver.page_source
    driver.close()

def parse(self, response):
    pass

最佳答案

from fake_useragent import UserAgent
ua = UserAgent()
a = ua.random
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')

options.add_argument('--disable-blink-features=AutomationControlled')

options.add_argument('--headless')
options.add_argument("--window-size=1920,1080")
#your code
time.sleep(30)
print(driver.page_source)

这应该绕过 bot 检测，但要注意 driver_page_source 很大。

关于python - 如何使用 selenium 网络驱动程序抓取网站而不会被阻止，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66454219/

上一篇：c# - 如何为 BackgroundService 的 ExecuteAsync 设置超时？

下一篇：arrays - 最大化可以从数组形成的所有非重叠子数组的最大和最小元素之间的绝对差之和？

javascript - 使用 axios get 获取请求失败，状态代码为 403

python - 在python中复制结构

c# - 如何使用Selenium元素等待、检查、点击而不会再次找到元素？

java - Selenium WebDriver GetPageSource().包含 ("")

python - 如何使我的 session.get() 链接到变量？

python - Selenium Python 无法定位元素

python - 在 python 文件中指定编码时，python 中的 "magic lines(s)"如何工作？

python - 我们可以在Python中打印socket.getfqdn和socket.gethostbyname的输出吗

python - 按值的 INT 值对字典进行排序