我的目标是获取所有已在 https://www.prusaprinters.org/prints 上发布的新项目的名称列表在给定一天的 24 小时内。
通过一些阅读,我了解到我应该使用 Selenium,因为我抓取的网站是动态的(在用户滚动时加载更多对象)。
问题是,我似乎无法从 webdriver.find_elements_by_
中得到一个空列表,其中任何后缀都列在 https://selenium-python.readthedocs.io/locating-elements.html 中。 .
在网站上,当我检查要获取标题的元素时,我看到 "class = name"
和 "class = clamp-two-lines"
(见屏幕截图),但我似乎无法返回页面上所有元素的列表,其中包含该 name
类或 clamp-two-lines
类。
这是我目前的代码(注释掉的行是失败的尝试):
from timeit import default_timer as timer
start_time = timer()
print("Script Started")
import bs4, selenium, smtplib, time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(r'D:\PortableApps\Python Peripherals\chromedriver.exe')
url = 'https://www.prusaprinters.org/prints'
driver.get(url)
# foo = driver.find_elements_by_name('name')
# foo = driver.find_elements_by_xpath('name')
# foo = driver.find_elements_by_class_name('name')
# foo = driver.find_elements_by_tag_name('name')
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[id*=name]')]
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[class*=name]')]
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[id*=clamp-two-lines]')]
# foo = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="printListOuter"]//ul[@class="clamp-two-lines"]/li')))
print(foo)
driver.quit()
print("Time to run: " + str(round(timer() - start_time,4)) + "s")
我的研究:
- Selenium only returns an empty list
- Selenium find_elements_by_css_selector returns an empty list
- Web Scraping Python (BeautifulSoup,Requests)
- Get HTML Source of WebElement in Selenium WebDriver using Python
- How to get Inspect Element code in Selenium WebDriver
- Web Scraping Python (BeautifulSoup,Requests)
- https://chrisalbon.com/python/web_scraping/monitor_a_website/
- https://www.codementor.io/@gergelykovcs/how-and-why-i-built-a-simple-web-scrapig-script-to-notify-us-about-our-favourite-food-fcrhuhn45
- https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_dynamic_websites.htm
最佳答案
要获取文本,请等待元素的可见性。标题的 CSS 选择器是 #printListOuter h3
:
titles = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#printListOuter h3')))
for title in titles:
print(title.text)
较短的版本:
wait = WebDriverWait(driver, 10)
titles = [title.text for title in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#printListOuter h3')))]
关于python - Selenium webdriver 从 find_elements_by_X 返回空列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59868524/