python - 使用 Selenium + Python 循环访问链接并从结果页面中抓取数据

<分区>

我是 Selenium 的新手，需要抓取一个包含结构完全如下的链接列表的网站:

<a class="unique" href="...">
    <i class="something"></i>
    "Text - "
    <span class="something">Text</span>
</a>
<a class="unique" href="...">
    <i class="something"></i>
    "Text - "
    <span class="something">Text</span>
</a>
...
...

我需要在循环中单击此链接列表并从结果页面中抓取数据。到目前为止我所做的是:

lists = browser.find_elements_by_xpath("//a[@class='unique']")
for lis in lists:
    print(lis.text)
    lis.click()
    time.sleep(4)
    # Scrape data from this page (works fine).
    browser.back()
    time.sleep(4)

对于第一个循环它工作正常但是当第二个循环到达时

print(lis.text)

它抛出一个错误说:

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document

我试过 print (lists) 并且它给出了所有链接元素的列表，所以工作正常。当浏览器返回到上一页时，就会出现问题。我尝试延长时间并使用 browser.get(...) 而不是 browser.back() 但错误仍然存在。我不明白为什么它不会打印 lis.text 因为列表仍然包含所有元素的列表。任何帮助将不胜感激。

最佳答案

您试图点击文本而不是启动链接。

点击每个链接，抓取数据并返回似乎也没有效果，相反你可以将所有链接存储在某个列表中然后你可以使用 driver.get('some link' ) 方法，你就可以抓取数据了。为了避免一些异常，请尝试以下修改后的代码:

# Locate the anchor nodes first and load all the elements into some list
lists = browser.find_elements_by_xpath("//a[@class='unique']")
# Empty list for storing links
links = []
for lis in lists:
    print(lis.get_attribute('href'))
    # Fetch and store the links
    links.append(lis.get_attribute('href'))

# Loop through all the links and launch one by one
for link in links:
    browser.get(link)
    # Scrape here
    sleep(3)

或者，如果您想使用相同的逻辑，那么您可以使用 Fluent Wait 来避免一些异常，例如如下所示的 StaleElementReferenceException:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import *

wait = WebDriverWait(browser, 10, poll_frequency=1, ignored_exceptions=[StaleElementReferenceException])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "xPath that you want to click")))

希望对你有帮助

关于python - 使用 Selenium + Python 循环访问链接并从结果页面中抓取数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54590058/

上一篇：python - Conda 与操作系统冲突

下一篇：python - 从 FASTA 文件中提取基因序列？

python - 如何同时处理keyPressEvent Escape和Ctrl+C(PySide)？

python - 带有 Selenium 错误 : Message: 'phantomjs' executable needs to be in PATH 的 PhantomJS

testing - Selenium Webdriver 异常 : waitForPageToLoad is not a valid webdriver command?

python - 将任意长度的字典项展平为 Python 中的路径列表

python - 使用 Gridsearch 为回归模型选择最佳参数

python - MAP@k计算

python - 使用 Selenium 保存页面

java - Selenium WebDriver 如何使用“确定”和“取消”按钮关闭浏览器确认弹出窗口

python - 如何为Safari 的Selenium 设置UA 和Headless？