在我之前关于 HtmlUnit 的问题中 Skip particular Javascript execution in HTML unit 和 Fetch Page source using HtmlUnit : URL got stuck
我曾提到过 URL 被卡住了。我还发现由于 HtmlUnit 库中的方法之一(解析)没有执行而卡住了。
我对此做了进一步的工作。我编写了代码,以便在完成时间超过指定的超时秒数时退出该方法。
import java.io.IOException;
import java.net.MalformedURLException;
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.TimeoutException;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class HandleHtmlUnitTimeout {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException, InterruptedException, TimeoutException
{
Date start = new Date();
String url = "http://ericaweiner.com/collections/";
doWorkWithTimeout(url, 60);
}
public static void doWorkWithTimeout(final String url, long timeoutSecs) throws InterruptedException, TimeoutException {
//maintains a thread for executing the doWork method
ExecutorService executor = Executors.newFixedThreadPool(1);
//logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
//set the executor thread working
final Future<?> future = executor.submit(new Runnable() {
public void run()
{
try
{
getPageSource(url);
}
catch (Exception e)
{
throw new RuntimeException(e);
}
}
});
//check the outcome of the executor thread and limit the time allowed for it to complete
try {
future.get(timeoutSecs, TimeUnit.SECONDS);
} catch (Exception e) {
//ExecutionException: deliverer threw exception
//TimeoutException: didn't complete within downloadTimeoutSecs
//InterruptedException: the executor thread was interrupted
//interrupts the worker thread if necessary
future.cancel(true);
//logger.warn("encountered problem while doing some work", e);
throw new TimeoutException();
}finally{
executor.shutdownNow();
}
}
public static void getPageSource(String productPageUrl)
{
try {
if(productPageUrl == null)
{
productPageUrl = "http://ericaweiner.com/collections/";
}
WebClient wb = new WebClient(BrowserVersion.FIREFOX_3_6);
wb.getOptions().setTimeout(120000);
wb.getOptions().setJavaScriptEnabled(true);
wb.getOptions().setThrowExceptionOnScriptError(true);
wb.getOptions().setThrowExceptionOnFailingStatusCode(false);
HtmlPage page = wb.getPage(productPageUrl);
wb.waitForBackgroundJavaScript(4000);
wb.closeAllWindows();
}
catch (FailingHttpStatusCodeException e)
{
e.printStackTrace();
}
catch (MalformedURLException e)
{
e.printStackTrace();
}
catch (IOException e)
{
e.printStackTrace();
}
}
}
这段代码确实来自 doWorkWithTimeout(url, 60);方法。但这并没有终止。
当我尝试使用以下代码调用类似的实现时:
import java.util.Date;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import org.apache.log4j.Logger;
public class HandleScraperTimeOut {
private static Logger logger = Logger.getLogger(HandleScraperTimeOut .class);
public void doWork() throws InterruptedException {
logger.info(new Date()+ "Starting worker method ");
Thread.sleep(20000);
logger.info(new Date()+ "Ending worker method ");
//perform some long running task here...
}
public void doWorkWithTimeout(int timeoutSecs) {
//maintains a thread for executing the doWork method
ExecutorService executor = Executors.newFixedThreadPool(1);
logger.info("Starting method with "+timeoutSecs+" seconds as timeout");
//set the executor thread working
final Future<?> future = executor.submit(new Runnable() {
public void run()
{
try
{
doWork();
}
catch (Exception e)
{
throw new RuntimeException(e);
}
}
});
//check the outcome of the executor thread and limit the time allowed for it to complete
try {
future.get(timeoutSecs, TimeUnit.SECONDS);
} catch (Exception e) {
//ExecutionException: deliverer threw exception
//TimeoutException: didn't complete within downloadTimeoutSecs
//InterruptedException: the executor thread was interrupted
//interrupts the worker thread if necessary
future.cancel(true);
logger.warn("encountered problem while doing some work", e);
}
executor.shutdown();
}
public static void main(String a[])
{
HandleScraperTimeOut hcto = new HandleScraperTimeOut ();
hcto.doWorkWithTimeout(30);
}
}
如果有人可以看一下并告诉我问题是什么,这将非常有帮助。
有关问题的更多详细信息,您可以查看 Skip particular Javascript execution in HTML unit 和 Fetch Page source using HtmlUnit : URL got stuck
<小时/>更新 1 奇怪的是:future.cancel(true);在这两种情况下都返回 TRUE。 我的预期是:
- 对于 HtmlUnit,它应该返回 FALSE,因为进程仍然挂起。
- 与正常的 Thread.sleep();它应该返回 TRUE,因为该过程 已成功取消。
更新2
它仅与 http://ericaweiner.com/collections/
URL 挂起。如果我提供任何其他网址,即 http://www.google.com
、 http://www.yahoo.com
,它不会提供。在这些情况下,它会抛出 InruptedException 并退出进程。
似乎http://ericaweiner.com/collections/
页面源代码中的某些元素会导致问题。
最佳答案
Future.cancel(boolean) 返回:
- 如果任务无法取消,则为 false,通常是因为它已正常完成
- 否则为真
取消意味着线程在取消之前未完成,取消标志设置为 true,如果请求,线程将被中断。
中断线程意味着它被称为Thread.interrupt,仅此而已。 Future.cancel(boolean) 不会检查线程是否实际停止。
所以在这种情况下 cancel 返回 true 是正确的。
中断线程意味着它应该尽快停止,但并不强制执行。您可以尝试使其停止/无法关闭所需的资源或其他内容。我通常通过线程从套接字读取(等待传入数据)来做到这一点。我关闭套接字,使其停止等待。
关于java - HtmlUnit WebClient 超时,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14559746/