javascript - Selenium : scraping a page till all the products loaded

我是 selenium 的新手，正在尝试从事一个需要从页面中抓取 URL 的项目。

来源是:- https://www.autofurnish.com/audi-car-accessories

我想抓取数据以获取这些产品的 URL。我能够完成它但面临滚动部分的问题。我需要抓取此页面上所有产品的所有 URL。这是一个包含大量结果的巨大页面。

我尝试了什么:-

 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

我试过这段代码，但它只是向下滚动到最后，所有产品都没有加载。

data = driver.find_elements(By.XPATH,"//h2[@class='product-title']//a")
for i in data:
    driver.execute_script("arguments[0].scrollIntoView();", i)

项目 = [] last_height = driver.execute_script("返回 document.body.scrollHeight") item_targetcount = 1000 而 item_targetcount > len(items): driver.execute_script("window.scrollTo(0,document.body.scrollHeight);") time.sleep(2) # 给网站加载时间 new_height = driver.execute_script("返回文档.body.scrollHeight") 如果 new_height == last_height: 休息 last_height = new_height

试图从以下方面寻求帮助:- How to scroll down in Python Selenium step by step Scrolling to element using webdriver? 尝试观看一些 youtube 视频仍然无法解决此问题。

我抓取其他细节的主要代码是:-

prod_details = []
for i in models:
    driver.find_element(By.XPATH,"//span[@aria-labelledby='select2-brand-container']").click()
    time.sleep(2)
    driver.find_element(By.XPATH,"//input[@class='select2-search__field']").send_keys(i)
    driver.find_element(By.XPATH,"//input[@class='select2-search__field']").send_keys(Keys.ENTER)
    driver.find_element(By.XPATH,"//div[@class='btnred sbv-link sbv-inactive']").click()
    time.sleep(3)
    prod = driver.find_elements(By.XPATH,"//h2[@class='product-title']//a")
    for i in prod:
        prod_details.append(i.get_attribute("href"))
    driver.get('https://www.autofurnish.com/')
    time.sleep(2)

仍然无法完全加载页面并获取所有输出。

最佳答案

这是一个非常棘手的问题......我在试图让它发挥作用时遇到了几个意想不到的问题。

主要问题是等待加载微调器并将其保持在屏幕上。我最初尝试像您那样滚动到页面底部，这使页面陷入了加载新产品部分的无限循环，因为页脚太大，加载微调器位于可见页面上方(至少对我而言) ).我通过滚动到最后一个可见的产品来解决这个问题，该产品足以触发下一部分加载但又不会低到进入无限加载模式。

在大多数情况下，当涉及到加载微调器时，您希望等待它变得可见，然后再不可见。这可以防止错误的计时情况，并且是等待新产品加载的最可靠方式。

基本流程是

加载页面
开始循环
1. 抓取所有产品 A 标签
2. 使用 JS，将页面向下滚动到最后一个 A 标签
3. 等待加载微调器变为可见然后不可见
4. 如果没有更多产品加载或达到某个最大产品数量，则退出循环
写下产品总数
写下产品网址

代码

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

...

# may need to adjust the timeout based on your experience... the site is really slow for me
wait = WebDriverWait(driver, 60)
new_count = 0
old_count = 0
while True:
    old_count = new_count
    products = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h2.product-title > a"))
    new_count = len(products)

    # scroll down to last product to trigger the loading spinner
    driver.execute_script("arguments[0].scrollIntoView();", products[len(products) - 1])

    # wait for loading spinner to appear and then disappear
    wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.infinite-scroll-loader")))
    wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, "div.infinite-scroll-loader")))

    # if the count didn't change, we've loaded all products on the page
    # I put a max of 50 products to load as a demo. You can adjust higher as needed but you should put something reasonably sized here to prevent the script from running for an hour
    if new_count == old_count or new_count > 50
        break

# print results
print(len(products))
for product in products:
    print(product.get_attribute("href"))

关于javascript - Selenium : scraping a page till all the products loaded，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73430414/

javascript - Selenium : scraping a page till all the products loaded

上一篇：rust - 避免在结果链中多次调用 `map_err`

下一篇：Flutter Tab Bar Page白色闪烁oninit