java - 如何提高相对简单的 Java 计数方法的效率和/或性能?

标签 java performance methods classification word-count

我正在构建一个必须阅读大量文本文档的分类器,但我发现我的 countWordFrequenties 方法处理的文档越多,速度就越慢。下面的这个方法需要 60 毫秒(在我的 PC 上),而读取、规范化、标记化、更新我的词汇表和均衡不同的整数列表总共只需要 3-5 毫秒(在我的 PC 上)。我的 countWordFrequencies 方法如下:

public List<Integer> countWordFrequencies(String[] tokens) 
{
    List<Integer> wordFreqs = new ArrayList<>(vocabulary.size());
    int counter = 0;

    for (int i = 0; i < vocabulary.size(); i++) 
    {
        for (int j = 0; j < tokens.length; j++)
            if (tokens[j].equals(vocabulary.get(i)))
                counter++;

        wordFreqs.add(i, counter);
        counter = 0;
    }

    return wordFreqs;
}

加快此过程的最佳方法是什么?这种方法有什么问题?

这是我的整个类(class),还有另一个类(class)类别,将它也张贴在这里是个好主意还是你们不需要它?

public class BayesianClassifier 
{
    private Map<String,Integer>  vocabularyWordFrequencies;
    private List<String> vocabulary;
    private List<Category> categories;
    private List<Integer> wordFrequencies;
    private int trainTextAmount;
    private int testTextAmount;
    private GUI gui;

    public BayesianClassifier() 
    {
        this.vocabulary = new ArrayList<>();
        this.categories = new ArrayList<>();
        this.wordFrequencies = new ArrayList<>();
        this.trainTextAmount = 0;
        this.gui = new GUI(this);
        this.testTextAmount = 0;
    }

    public List<Category> getCategories() 
    {
        return categories;
    }

    public List<String> getVocabulary() 
    {
        return this.vocabulary;
    }

    public List<Integer> getWordFrequencies() 
    {
        return  wordFrequencies;
    }

    public int getTextAmount() 
    {
        return testTextAmount + trainTextAmount;
    }

    public void updateWordFrequency(int index, Integer frequency)
    {
        equalizeIntList(wordFrequencies);
        this.wordFrequencies.set(index, wordFrequencies.get(index) + frequency);
    }

    public String readText(String path) 
    {
        BufferedReader br;
        String result = "";

        try 
        {
            br = new BufferedReader(new FileReader(path));

            StringBuilder sb = new StringBuilder();
            String line = br.readLine();

            while (line != null) 
            {
                sb.append(line);
                sb.append("\n");
                line = br.readLine();
            }

            result = sb.toString();
            br.close();
        } 
        catch (IOException e) 
        {
            e.printStackTrace();
        }

        return result;
    }

    public String normalizeText(String text) 
    {
        String fstNormalized = Normalizer.normalize(text, Normalizer.Form.NFD);

        fstNormalized = fstNormalized.replaceAll("[^\\p{ASCII}]","");
        fstNormalized = fstNormalized.toLowerCase();
        fstNormalized = fstNormalized.replace("\n","");
        fstNormalized = fstNormalized.replaceAll("[0-9]","");
        fstNormalized = fstNormalized.replaceAll("[/()!?;:,.%-]","");
        fstNormalized = fstNormalized.trim().replaceAll(" +", " ");

        return fstNormalized;
    }

    public String[] handleText(String path) 
    {
        String text = readText(path);
        String normalizedText = normalizeText(text);

        return tokenizeText(normalizedText);
    }

    public void createCategory(String name, BayesianClassifier bc) 
    {
        Category newCategory = new Category(name, bc);

        categories.add(newCategory);
    }

    public List<String> updateVocabulary(String[] tokens) 
    {
        for (int i = 0; i < tokens.length; i++)
            if (!vocabulary.contains(tokens[i]))
                vocabulary.add(tokens[i]);

        return vocabulary;
    }

    public List<Integer> countWordFrequencies(String[] tokens)
    {
        List<Integer> wordFreqs = new ArrayList<>(vocabulary.size());
        int counter = 0;

        for (int i = 0; i < vocabulary.size(); i++) 
        {
            for (int j = 0; j < tokens.length; j++)
                if (tokens[j].equals(vocabulary.get(i)))
                    counter++;

            wordFreqs.add(i, counter);
            counter = 0;
        }

        return wordFreqs;
    }

    public String[] tokenizeText(String normalizedText) 
    {
        return normalizedText.split(" ");
    }

    public void handleTrainDirectory(String folderPath, Category category) 
    {
        File folder = new File(folderPath);
        File[] listOfFiles = folder.listFiles();

        if (listOfFiles != null) 
        {
            for (File file : listOfFiles) 
            {
                if (file.isFile()) 
                {
                    handleTrainText(file.getPath(), category);
                }
            }
        } 
        else 
        {
            System.out.println("There are no files in the given folder" + " " + folderPath.toString());
        }
    }

    public void handleTrainText(String path, Category category) 
    {
        long startTime = System.currentTimeMillis();

        trainTextAmount++;

        String[] text = handleText(path);

        updateVocabulary(text);
        equalizeAllLists();

        List<Integer> wordFrequencies = countWordFrequencies(text);
        long finishTime = System.currentTimeMillis();

        System.out.println("That took 1: " + (finishTime-startTime)+ " ms");

        long startTime2 = System.currentTimeMillis();

        category.update(wordFrequencies);
        updatePriors();

        long finishTime2 = System.currentTimeMillis();

        System.out.println("That took 2: " + (finishTime2-startTime2)+ " ms");
    }

    public void handleTestText(String path) 
    {
        testTextAmount++;

        String[] text = handleText(path);
        List<Integer> wordFrequencies = countWordFrequencies(text);
        Category category = guessCategory(wordFrequencies);
        boolean correct = gui.askFeedback(path, category);

        if (correct) 
        {
            category.update(wordFrequencies);
            updatePriors();
            System.out.println("Kijk eens aan! De tekst is succesvol verwerkt.");
        } 
        else 
        {
            Category correctCategory = gui.askCategory();
            correctCategory.update(wordFrequencies);
            updatePriors();
            System.out.println("Kijk eens aan! De tekst is succesvol verwerkt.");
        }
    }

    public void updatePriors()
    {
        for (Category category : categories)
        {
            category.updatePrior();
        }
    }

    public Category guessCategory(List<Integer> wordFrequencies) 
    {
        List<Double> chances = new ArrayList<>();

        for (int i = 0; i < categories.size(); i++)
        {
            double chance = categories.get(i).getPrior();

            System.out.println("The prior is:" + chance);

            for(int j = 0; j < wordFrequencies.size(); j++)
            {
                chance = chance * categories.get(i).getWordProbabilities().get(j);
            }

            chances.add(chance);
        }

        double max = getMaxValue(chances);
        int index = chances.indexOf(max);

        System.out.println(max);
        System.out.println(index);
        return categories.get(index);
    }

    public double getMaxValue(List<Double> values)
    {
        Double max = 0.0;

        for (Double dubbel : values)
        {
            if(dubbel > max)
            {
                max = dubbel;
            }
        }

        return max;
    }

    public void equalizeAllLists()
    {
        for(Category category : categories)
        {
            if (category.getWordFrequencies().size() < vocabulary.size())
            {
                category.setWordFrequencies(equalizeIntList(category.getWordFrequencies()));
            }
        }

        for(Category category : categories)
        {
            if (category.getWordProbabilities().size() < vocabulary.size())
            {
                category.setWordProbabilities(equalizeDoubleList(category.getWordProbabilities()));
            }
        }
    }

    public List<Integer> equalizeIntList(List<Integer> list)
    {
        while (list.size() < vocabulary.size())
        {
            list.add(0);
        }

        return list;
    }

    public List<Double> equalizeDoubleList(List<Double> list)
    {
        while (list.size() < vocabulary.size())
        {
            list.add(0.0);
        }

        return list;
    }

    public void selectFeatures()
    {
        for(int i = 0; i < wordFrequencies.size(); i++)
        {
            if(wordFrequencies.get(i) < 2)
            {
                vocabulary.remove(i);
                wordFrequencies.remove(i);

                for(Category category : categories)
                {
                    category.removeFrequency(i);
                }
            }
        }
    }
}

最佳答案

您的方法有 O(n*m) 运行时间(n 是词汇量大小,m 是标记大小)。通过散列,这可以减少到 O(m),这显然更好。

for (String token: tokens) {
  if(!map.containsKey(token)){
      map.put(token,0);
  }
  map.put(token,map.get(token)+1);
}

关于java - 如何提高相对简单的 Java 计数方法的效率和/或性能?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34500585/

相关文章:

java - 如何在 Java 中将 float 转换为 4 个字节?

java - Scala 如何将模拟对象注入(inject) ScalatraFlatSpec

c# - 将关系(规范化)数据表快速插入 SQL Server 2008 数据库

c# - 如何在 C# 中将动态数据类型作为参数传递?

jQuery 方法链接和 var 含义意外

java - JAVA中如何创建支持长连接的HttpServer?

java - Firebase 助手错误 : Failed to resolve: firebase-messaging-15. 0.0

performance - PixelSearch功能效率很低,如何优化?

linux - 在 `perf stat` 的输出上运行 `perf record` ?

c# - 在类中创建方法来更新类中保存的列表