java - 使用java线程池从网页收集链接

我正在编写指定数量页面的链接收集器。为了提高效率，我使用固定大小的线程池。因为我确实是多线程领域的新手，所以我在解决一些问题时遇到了问题。所以我的想法是每个线程都做同样的事情:连接到页面并收集每个 url。之后，url 将添加到下一个线程的队列中。

但这行不通。首先，程序分析baseurl并从中添加url。但起初我只想使用 LinksToVisit.add(baseurl) 执行此操作并使用线程池运行它，但它总是轮询队列并且线程不添加任何新内容，因此队列顶部为空。我不知道为什么:(

我尝试使用 ArrayBlockingQueue 来做到这一点，但没有成功。使用分析基本 url 来修复它并不是一个好的解决方案，因为当基本 URL 上只有一个链接时，它不会跟随它。所以我认为我的处理方式是错误的或者错过了一些重要的事情。作为 html 解析器，我使用 Jsoup。感谢您的回答。

来源(删除了不必要的方法):

package collector;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.text.DecimalFormat;
import java.util.Iterator;
import java.util.Map;
import java.util.Scanner;
import java.util.Map.Entry;
import java.util.concurrent.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;


public class Collector {

private String baseurl;
private int links;
private int cvlinks;
private double time;
private int chcount;
private static final int NTHREADS = Runtime.getRuntime().availableProcessors()*2;
private ConcurrentLinkedQueue<String> LinksToVisit = new ConcurrentLinkedQueue<String>();
private ConcurrentSkipListMap<String, Double> SortedCharMap = new ConcurrentSkipListMap<String, Double>();
private ConcurrentHashMap<String, Double> CharMap = new ConcurrentHashMap<String, Double>();

public Collector(String url, int links) {
    this.baseurl = url;
    this.links = links;
    this.cvlinks = 0;
    this.chcount = 0;

    try {
        Document html = Jsoup.connect(url).get();

        if(cvlinks != links){
            Elements collectedLinks = html.select("a[href]");
            for(Element link:collectedLinks){
                if(cvlinks == links) break;
                else{
                    String current = link.attr("abs:href");
                    if(!current.equals(url) && current.startsWith(baseurl)&& !current.contains("#")){
                        LinksToVisit.add(current);
                        cvlinks++;
                    }
                }
            }
        }

        AnalyzeDocument(html, url);
    } catch (IOException e) {
        e.printStackTrace();
    }
    CollectFromWeb();
}

private void AnalyzeDocument(Document doc,String url){
    String text = doc.body().text().toLowerCase().replaceAll("[^a-z]", "").trim();
    chcount += text.length();
    String chars[] = text.split("");
    CharCount(chars);

}
private void CharCount(String[] chars) {
    for(int i = 1; i < chars.length; i++) {
        if(!CharMap.containsKey(chars[i]))  
            CharMap.put(chars[i],1.0);
        else
            CharMap.put(chars[i], CharMap.get(chars[i]).doubleValue()+1);
    }
}

private void CollectFromWeb(){
    long startTime = System.nanoTime();
    ExecutorService executor = Executors.newFixedThreadPool(NTHREADS);
     CollectorThread[] workers = new CollectorThread[this.links];
    for (int i = 0; i < this.links; i++) {
        if(!LinksToVisit.isEmpty()){
            int j = i+1;
            System.out.println("Collecting from "+LinksToVisit.peek()+" ["+j+"/"+links+"]");
            //Runnable worker = new CollectorThread(LinksToVisit.poll());   
            workers[i] = new CollectorThread(LinksToVisit.poll());
            executor.execute(workers[i]);
        }
        else break;
    }
    executor.shutdown();
    while (!executor.isTerminated()) {}

    SortedCharMap.putAll(CharMap);

    this.time =(System.nanoTime() - startTime)*10E-10;
}

class CollectorThread implements Runnable{
    private Document html;
    private String url;

    public CollectorThread(String url){
        this.url = url;
        try {
            this.html = Jsoup.connect(url).get();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    @Override
    public void run() {
        if(cvlinks != links){
            Elements collectedLinks = html.select("a[href]");
            for(Element link:collectedLinks){
                if(cvlinks == links) break;
                else{
                    String current = link.attr("abs:href");
                    if(!current.equals(url) && current.startsWith(baseurl)&& !current.contains("#")){
                        LinksToVisit.add(current);
                        cvlinks++;
                    }
                }
            }
        }

        AnalyzeDocument(html, url);
    }
}

}

最佳答案

无需使用 LinksToVisit 队列，只需直接从 CollectorThread.run() 调用 executor.execute(new CollectorThread(current)) 即可。 ExecutorService 有自己的内部任务队列，它将在线程可用时运行。

这里的另一个问题是，在将第一组 URL 添加到队列后调用 shutdown() 将阻止新任务添加到执行器。您可以通过在清空队列时关闭执行程序来解决此问题:

class Queue extends ThreadPoolExecutor {
    Queue(int nThreads) {
        super(nThreads, nThreads, 0L, TimeUnit.MILLISECONDS, 
                new LinkedBlockingQueue<Runnable>());
    }

    protected void afterExecute(Runnable r, Throwable t) {
        if(getQueue().isEmpty()) {
            shutdown();
        }
    }
}

关于java - 使用java线程池从网页收集链接，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8758277/

java - 使用java线程池从网页收集链接

上一篇：java - 如何缓存任意时间差异的数据差异(java web 服务)

下一篇：Java - 使用按钮和鼠标单击创建自定义事件