java - HtmlUnit WebClient 超时

标签 java multithreading timeout web-scraping htmlunit

在我之前关于 HtmlUnit 的问题中 Skip particular Javascript execution in HTML unitFetch Page source using HtmlUnit : URL got stuck

我曾提到过 URL 被卡住了。我还发现由于 HtmlUnit 库中的方法之一(解析)没有执行而卡住了。

我对此做了进一步的工作。我编写了代码,以便在完成时间超过指定的超时秒数时退出该方法。

import java.io.IOException;
import java.net.MalformedURLException;
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class HandleHtmlUnitTimeout {

public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException, TimeoutException 
    {   
        Date start = new Date();
        String url = "http://ericaweiner.com/collections/";
        doWorkWithTimeout(url, 60);
    }

public static void doWorkWithTimeout(final String url, long timeoutSecs) throws InterruptedException, TimeoutException {
    //maintains a thread for executing the doWork method
    ExecutorService executor = Executors.newFixedThreadPool(1);
    //logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
    //set the executor thread working

    final Future<?> future = executor.submit(new Runnable() {
        public void run() 
            {
            try 
                {
                getPageSource(url);
                }
            catch (Exception e) 
                {
                throw new RuntimeException(e);
                }
        }
    });

    //check the outcome of the executor thread and limit the time allowed for it to complete
    try {
        future.get(timeoutSecs, TimeUnit.SECONDS);
    } catch (Exception e) {
        //ExecutionException: deliverer threw exception
        //TimeoutException: didn't complete within downloadTimeoutSecs
        //InterruptedException: the executor thread was interrupted

        //interrupts the worker thread if necessary
        future.cancel(true);

        //logger.warn("encountered problem while doing some work", e);
        throw new TimeoutException();
    }finally{ 
    executor.shutdownNow();
    }
}

public static void getPageSource(String productPageUrl)
    {
    try {
    if(productPageUrl == null)
        {
        productPageUrl = "http://ericaweiner.com/collections/";
        }   

        WebClient wb = new WebClient(BrowserVersion.FIREFOX_3_6);
        wb.getOptions().setTimeout(120000);
        wb.getOptions().setJavaScriptEnabled(true);
        wb.getOptions().setThrowExceptionOnScriptError(true);
        wb.getOptions().setThrowExceptionOnFailingStatusCode(false);
        HtmlPage page = wb.getPage(productPageUrl);
        wb.waitForBackgroundJavaScript(4000);
        wb.closeAllWindows();
} 
catch (FailingHttpStatusCodeException e) 
    {
    e.printStackTrace();
    } 
catch (MalformedURLException e) 
    {
    e.printStackTrace();
    } 
catch (IOException e) 
    {
    e.printStackTrace();
    }
    }

}

这段代码确实来自 doWorkWithTimeout(url, 60);方法。但这并没有终止。

当我尝试使用以下代码调用类似的实现时:

import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;

import org.apache.log4j.Logger;


public class HandleScraperTimeOut {

private static Logger logger = Logger.getLogger(HandleScraperTimeOut .class);


public void doWork() throws InterruptedException {
    logger.info(new Date()+ "Starting worker method ");
    Thread.sleep(20000);
    logger.info(new Date()+ "Ending worker method ");
    //perform some long running task here...
}

public void doWorkWithTimeout(int timeoutSecs) {
    //maintains a thread for executing the doWork method
    ExecutorService executor = Executors.newFixedThreadPool(1);
    logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
    //set the executor thread working

    final Future<?> future = executor.submit(new Runnable() {
        public void run() 
            {
            try 
                {
                doWork();
                }
            catch (Exception e) 
                {
                throw new RuntimeException(e);
                }
        }
    });

    //check the outcome of the executor thread and limit the time allowed for it to complete
    try {
        future.get(timeoutSecs, TimeUnit.SECONDS);
    } catch (Exception e) {
        //ExecutionException: deliverer threw exception
        //TimeoutException: didn't complete within downloadTimeoutSecs
        //InterruptedException: the executor thread was interrupted

        //interrupts the worker thread if necessary
        future.cancel(true);

        logger.warn("encountered problem while doing some work", e);
    }
    executor.shutdown();
}

public static void main(String a[])
    {
        HandleScraperTimeOut hcto = new HandleScraperTimeOut ();
        hcto.doWorkWithTimeout(30);

    }

}

如果有人可以看一下并告诉我问题是什么,这将非常有帮助。

有关问题的更多详细信息,您可以查看 Skip particular Javascript execution in HTML unitFetch Page source using HtmlUnit : URL got stuck

<小时/>

更新 1 奇怪的是:future.cancel(true);在这两种情况下都返回 TRUE。 我的预期是:

  • 对于 HtmlUnit,它应该返回 FALSE,因为进程仍然挂起。
  • 与正常的 Thread.sleep();它应该返回 TRUE,因为该过程 已成功取消。

更新2 它仅与 http://ericaweiner.com/collections/ URL 挂起。如果我提供任何其他网址,即 http://www.google.comhttp://www.yahoo.com ,它不会提供。在这些情况下,它会抛出 InruptedException 并退出进程。

似乎http://ericaweiner.com/collections/页面源代码中的某些元素会导致问题。

最佳答案

Future.cancel(boolean) 返回:

  • 如果任务无法取消,则为 false,通常是因为它已正常完成
  • 否则为真

取消意味着线程在取消之前未完成,取消标志设置为 true,如果请求,线程将被中断。

中断线程意味着它被称为Thread.interrupt,仅此而已。 Future.cancel(boolean) 不会检查线程是否实际停止。

所以在这种情况下 cancel 返回 true 是正确的。

中断线程意味着它应该尽快停止,但并不强制执行。您可以尝试使其停止/无法关闭所需的资源或其他内容。我通常通过线程从套接字读取(等待传入数据)来做到这一点。我关闭套接字,使其停止等待。

关于java - HtmlUnit WebClient 超时,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14559746/

相关文章:

java - 如何使用java访问文件中的单词

c# - 我们可以使用 Parallel.ForEach() 向列表中添加新元素吗?

java - 线程是否有助于提高 Java 的效率?

javascript - 类型错误 : $timeout is not a function

java - 套接字连接超时

java - 对象数组的插入排序?

java - 使用 Java 编写 HTML 文件

java - Firebase 未填充列表

c# - 当工作线程试图在主线程上调用某些东西时出现死锁

asp.net - 身份验证超时无法正常工作