python-3.x - 如何使用 Google Chrome Headless 和 Selenium 提取 youtube 视频的评论数量?

标签 python-3.x selenium xpath youtube css-selectors

每个 youtube 网页中都有一个元素来显示视频的评论数。
就是这样一个html结构:

<yt-formatted-string class="count-text style-scope ytd-comments-header-renderer">xx Comments</yt-formatted-string>
我想得到号码xx Comments与 Selenium 。
code1-带头浏览器
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
proxy = '127.0.0.1:1080'   
options.add_argument('--proxy-server=socks5://' + proxy)
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver,30)
url='https://www.youtube.com/watch?v=N0lxfilGfak'

driver.get(url)
driver.execute_script("return scrollBy(0, 1000);")
comment = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[contains(., 'Comments')]")))
driver.execute_script("arguments[0].scrollIntoView(true);",comment)
print(driver.find_element_by_xpath("//h2[@id='count']").text)
使用上面的python代码,我可以得到717 Comments对于 https://www.youtube.com/watch?v=N0lxfilGfak .
现在我想在 selenium 中使用 headless 浏览器获得相同的数字。
code2-带 headless 浏览器。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
proxy = '127.0.0.1:1080'   
options.add_argument('--proxy-server=socks5://' + proxy)
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--headless")
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
wait = WebDriverWait(driver,30)
url='https://www.youtube.com/watch?v=N0lxfilGfak'

driver.get(url)
driver.execute_script("return scrollBy(0, 1000);")
comment = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[contains(., 'Comments')]")))
driver.execute_script("arguments[0].scrollIntoView(true);",comment)
print(driver.find_element_by_xpath("//h2[@id='count']").text)
注意:code2 比code1 多出三行。
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument("--headless")
code2 和 code1 中的其他行相同。
它卡在 comment语句何时执行code2:
>>> comment = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[contains(., 'Comments')]")))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 
为什么无法在 selenium 中使用 headless 浏览器获取元素?

最佳答案

你快到了。打印文本 xx 评论 使用 Selenium驱动 ChromeDriver发起浏览上下文你必须诱导WebDriverWait对于visibility_of_element_located()您可以使用以下任一Locator Strategies :

  • 使用 XPATH和文本属性:
    driver.get("https://www.youtube.com/watch?v=N0lxfilGfak")
    driver.execute_script("return scrollBy(0, 1000);")
    subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
    driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
    print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH,"//h2[@id='count']/yt-formatted-string"))).text)
    
  • 使用 CSS_SELECTORget_attribute() :
    driver.get("https://www.youtube.com/watch?v=N0lxfilGfak")
    driver.execute_script("return scrollBy(0, 1000);")
    subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
    driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
    print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2#count>yt-formatted-string"))).get_attribute("innerHTML"))
    
  • 控制台输出:
    717 Comments
    
  • 备注 :您必须添加以下导入:
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

  • 使用 headless Chrome
    使用 您可以使用以下解决方案:
  • 代码块:
    from selenium import webdriver
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    options.add_argument('--headless')
    options.add_argument('--window-size=1920,1080')
    driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
    driver.get("https://www.youtube.com/watch?v=N0lxfilGfak")
    driver.execute_script("return scrollBy(0, 1000);")
    subscribe = WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//yt-formatted-string[text()='Subscribe']")))
    driver.execute_script("arguments[0].scrollIntoView(true);",subscribe)
    # using xpath and text attribute
    print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH,"//h2[@id='count']/yt-formatted-string"))).text)
    # using cssSelector and get_attribute()
    print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"h2#count>yt-formatted-string"))).get_attribute("innerHTML"))
    print("Exiting")
    driver.quit()
    
  • 控制台输出:
    717 Comments
    717 Comments
    Exiting
    
  • 关于python-3.x - 如何使用 Google Chrome Headless 和 Selenium 提取 youtube 视频的评论数量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62857213/

    相关文章:

    python - 获取唯一行值的最大日期

    python - 如何在 Python3 中获取 "old"zip()?

    java - Selenide 测试因 org/openqa/selenium/NoSuchSessionException 失败

    javascript - 从 XPath 打印元素会产生更多的元素

    javascript - Protractor - 计算 xpath 中的元素并将其存储到 var

    python - `tf.set_random_seed()` 相当于操作种子?

    python - 检查字典是否有多个键

    selenium - 在测试套件中运行特定测试 - Selenium Side Runner (IDE)

    selenium - 是否有适用于 Microsoft Edge 浏览器的 Selenium WebDriver?

    selenium - 如何在 Chrome 或 Firefox 中获取绝对 XPath