java - Selenium 火 StaleElementReferenceException

标签 java selenium selenium-webdriver web-crawler staleelementreferenceexception

我尝试用selenium 制作一个网络爬虫。 我的程序引发 StaleElementReferenceException。 我认为这是因为我递归地抓取页面,并且当页面没有更多链接时,该函数导航到下一页,而不是之前导航到父页面。

因此,我引入了一个树形数据结构,当当前 url 不等于父 url 时导航回父级。但这不是我的问题的解决方案。

有人可以帮我吗?

代码:

public class crawler {
    private static FirefoxDriver driver;
    private static String main_url = "https://robhammond.co/tools/seo-crawler";
    private static List<String> uniqueLinks = new ArrayList<String>();

    public static void main(String[] args) {
        driver = new FirefoxDriver();

        Node<String> root = new Node<>(main_url);

        scrape(root, main_url);
    }

    public static void scrape(Node<String> node, String url) {
        if(node.getParent() != null && (!driver.getCurrentUrl().equals(node.getParent().getData()))) {
            driver.navigate().to(node.getParent().getData());
        }

        driver.navigate().to(url);

        List<WebElement> allLinks = driver.findElements(By.tagName("a"));

        for(WebElement link : allLinks) {
            if(link.getAttribute("href").contains(main_url) && !uniqueLinks.contains(link.getAttribute("href")) && link.isDisplayed()) {
                uniqueLinks.add(link.getAttribute("href"));

                System.out.println(link.getAttribute("href"));

                scrape(new Node<>(link.getAttribute("href")), link.getAttribute("href"));
            }
        }
    }
}

这是控制台的输出:

D:\Programme\openjdk-12.0.1_windows-x64_bin\jdk-12.0.1\bin\java.exe "-javaagent:D:\Programme\JetBrains\IntelliJ IDEA 2019.1.2\lib\idea_rt.jar=60461:D:\Programme\JetBrains\IntelliJ IDEA 2019.1.2\bin" -Dfile.encoding=UTF-8 -classpath C:\Users\admin\Desktop\SeleniumWebScraper\out\production\SeleniumWebScraper;D:\Downloads\selenium-server-standalone-3.141.59.jar de.company.crawler.crawler
1557924446770   mozrunner::runner   INFO    Running command: "C:\\Program Files\\Mozilla Firefox\\firefox.exe" "-marionette" "-foreground" "-no-remote" "-profile" "C:\\Users\\admin\\AppData\\Local\\Temp\\rust_mozprofile.YqmEqE8y1pjv"
1557924447037   addons.webextension.screenshots@mozilla.org WARN    Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: mozillaAddons
1557924447037   addons.webextension.screenshots@mozilla.org WARN    Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: resource://pdf.js/
1557924447037   addons.webextension.screenshots@mozilla.org WARN    Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: about:reader*
1557924448047   Marionette  INFO    Listening on port 60468
1557924448383   Marionette  WARN    TLS certificate errors will be ignored for this session
Mai 15, 2019 2:47:28 NACHM. org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: W3C
JavaScript warning: https://robhammond.co/js/jquery.min.js, line 4: Using //@ to indicate sourceMappingURL pragmas is deprecated. Use //# instead
https://robhammond.co/tools/seo-crawler#content
https://twitter.com/intent/tweet?text=SEO%20Crawler&url=https://robhammond.co/tools/seo-crawler&via=robhammond
Exception in thread "main" org.openqa.selenium.StaleElementReferenceException: The element reference of <a href="/tools/"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/stale_element_reference.html
Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:25:53'
System info: host: 'DESKTOP-admin', ip: '192.168.233.1', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '12.0.1'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Capabilities {acceptInsecureCerts: true, browserName: firefox, browserVersion: 66.0.5, javascriptEnabled: true, moz:accessibilityChecks: false, moz:geckodriverVersion: 0.24.0, moz:headless: false, moz:processID: 19124, moz:profile: C:\Users\admin\AppData\Loca..., moz:shutdownTimeout: 60000, moz:useNonSpecCompliantPointerOrigin: false, moz:webdriverClick: true, pageLoadStrategy: normal, platform: WINDOWS, platformName: WINDOWS, platformVersion: 10.0, rotatable: false, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify}
Session ID: b3b87675-57c8-4b48-9a20-8df5e4d37503
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
    at org.openqa.selenium.remote.http.W3CHttpResponseCodec.createException(W3CHttpResponseCodec.java:187)
    at org.openqa.selenium.remote.http.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:122)
    at org.openqa.selenium.remote.http.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:49)
    at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:158)
    at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:83)
    at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:552)
    at org.openqa.selenium.remote.RemoteWebElement.execute(RemoteWebElement.java:285)
    at org.openqa.selenium.remote.RemoteWebElement.getAttribute(RemoteWebElement.java:134)
    at de.company.crawler.crawler.scrape(crawler.java:33)
    at de.company.crawler.crawler.scrape(crawler.java:38)
    at de.company.crawler.crawler.main(crawler.java:20)

Process finished with exit code 1

最佳答案

  1. 当您离开首页时,所有WebElementsallLinks 列表中丢失。

    我建议将其从 WebElement 列表转换为普通 Strings 列表像:

    List<String> allLinksHrefs = allLinks.stream().map(link -> link.getAttribute("href")).collect(Collectors.toList());
    

    并迭代这个新的 allLinksHrefs 列表。

  2. 您可以使用基于哈希的集合来保存uniqueLinks,例如 HashSet - 这样重复项将被自动消除
  3. 当前方法可能需要几天时间才能完成,请考虑使用 Selenium Gridrunning your scraper in Parallel

关于java - Selenium 火 StaleElementReferenceException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56150033/

相关文章:

java - 如何使用 apache common math 3.3 计算积分

java - 尝试在 Jenkins 上运行 Selenium 时出现 NoSuchSession 异常

java - Selenium 无法使用 xpath 定位元素,但 firebug 可以

javascript - Protractor 和 Jasmine 永远不会解决获取网页标题的 promise

JavaFX 将组合框中的字符串字段显示为字符串的枚举(在 TableView 中)

java - org.hibernate.exception.ConstraintViolationException : Could not execute JDBC batch update

java - 有没有办法使用 Java 中的接口(interface)来绑定(bind)泛型类型?

javascript - Rselenium 无法点击所有单选按钮(仅限其中一些)

使用android的WebDriver中的java.lang.NoClassDefFoundError : org. openqa.selenium.android.AndroidWebDriver错误

python - 使用 getPageSource 检查网页上是否存在某些文本。我得到错误对象没有属性 getPageSource