Java字数统计: a mediocre implementation

我用 Java 实现了一个字数统计程序。基本上，该程序需要一个大文件(在我的测试中，我使用了一个仅包含数字的 10 GB 数据文件)，并计算每个“单词”出现的次数 - 在这种情况下，可能会出现一个数字(例如 23723)文件中 243 次)。

下面是我的实现。我寻求改进它，主要考虑性能，但也考虑其他一些事情，并且我正在寻求一些指导。以下是我希望纠正的一些问题:

目前，程序已线程化且工作正常。然而，我所做的是将一 block 内存(500MB/NUM_THREADS)传递给每个线程，然后每个线程继续进行字数统计。这里的问题是，我让主线程等待所有线程完成，然后再将更多数据传递给每个线程。这并不是什么太大的问题，但是在一段时间内，有几个线程会等待一段时间，什么也不做。我相信某种工作池或执行器服务可以解决这个问题(我还没有学习这个语法)。
该程序仅适用于包含整数的文件。这是一个问题。我在这个问题上遇到了很多困难，因为我不知道如何在不创建大量未使用变量的情况下迭代数据(使用 String 甚至 StringBuilder 的性能很糟糕)。目前，我知道输入是整数，并将临时变量存储为 int，因此不存在内存问题。我希望能够使用某种分隔符，无论该分隔符是空格还是多个字符。
我正在使用全局 ConcurrentHashMap 来记录键值对。例如，如果线程找到数字“24624”，它将在映射中搜索该数字。如果存在，则会将该键的值加一。末尾键的值表示该键出现的次数。那么这是正确的设计吗？通过为每个线程提供自己的 HashMap ，然后最后将它们全部合并，我是否可以提高性能？
是否有其他方法可以在不使用 RandomAccessMemory 类的情况下通过偏移量查找文件？这个类只会读入字节数组，然后我必须对其进行转换。我没有计时此转换，但也许使用其他东西可能会更快。

我也对其他可能性持开放态度，这正是我想到的。

注意:拆分文件不是我想要探索的选项，因为我可能会将其部署在我不应该创建自己的文件的服务器上，但如果它确实会提高性能，我可能会听.

其他注意事项:我是 java 线程的新手，也是 StackOverflow 的新手。温柔一点。

    public class BigCount2 {
        public static void main(String[] args) throws IOException, InterruptedException {
            int num, counter;
            long i, j;
            String delimiterString = " ";
            ArrayList<Character> delim = new ArrayList<Character>();
            for (char c : delimiterString.toCharArray()) {
                delim.add(c);
            }
            int counter2 = 0;
            num = Integer.parseInt(args[0]);
            int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
            int remainder = bytesToRead % num;
            int k = 0;
            bytesToRead = bytesToRead - remainder;
            int byr = bytesToRead / num;
            String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
            RandomAccessFile file = new RandomAccessFile(filepath, "r");
            Thread[] t = new Thread [num];//array of threads
            ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
            byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
            char[] newbyte;
            for (i = 0; i < file.length(); i += bytesToRead) {
                counter = 0;
                for (j = 0; j < bytesToRead; j += byr) {
                    file.seek(i + j);
                    file.read(byteArray, 0, byr);
                    newbyte = new String(byteArray).toCharArray();
                    t[counter] = new Thread(
                            new BigCountThread2(counter, 
                                newbyte, 
                                delim, 
                                wordCountMap));//giving each thread t[i] different file fileReader[i] 
                    t[counter].start();
                    counter++;
                    newbyte = null;
                }
                for (k = 0; k < num; k++){
                    t[k].join(); //main thread continues after ALL threads have finished. 
                }
                counter2++;
                System.gc();
            }
            file.close();
            System.exit(0);
        }
    }   

class BigCountThread2 implements Runnable {
    private final ConcurrentMap<Integer, Integer> wordCountMap;
    char [] newbyte;
    private ArrayList<Character> delim;
    private int threadId; //use for later
    BigCountThread2(int tid, 
            char[] newbyte, 
            ArrayList<Character> delim,
            ConcurrentMap<Integer, Integer> wordCountMap) { 
        this.delim = delim;
        threadId = tid;
        this.wordCountMap = wordCountMap;
        this.newbyte = newbyte;
    }
    public void run() {
        int intCheck = 0;
        int counter = 0; int i = 0; Integer check;  int j =0; int temp = 0; int intbuilder = 0;
        for (i = 0; i < newbyte.length; i++) {
            intCheck = Character.getNumericValue(newbyte[i]);
            if (newbyte[i] == ' ' || intCheck == -1) {    //once a delimiter is found, the current tempArray needs to be added to the MAP
                check = wordCountMap.putIfAbsent(intbuilder, 1);
                if (check != null) { //if returns null, then it is the first instance
                    wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
                }

                intbuilder = 0;
            }

            else {
                intbuilder = (intbuilder * 10) + intCheck;
                counter++;
            }

        }
    }
}

最佳答案

关于大多数的一些想法..

.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).

如果所有线程花费大约相同的时间来处理相同数量的数据，那么这里确实没有那么大的“问题”。

但是，Thread Pool 有一件好事。它允许人们相当简单地调整一些基本参数，例如并发工作人员的数量。此外，使用 executor service future 可以提供额外的抽象级别；在这种情况下，如果每个线程都返回一个映射作为结果，这会特别方便。

The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..

这听起来像是一个实现问题。虽然我会先尝试 StreamTokenizer (因为它已经写好了)，如果手动执行，我会 check out the source - 当简化“ token ”的概念时，可以省略其中的大部分内容。 (它使用临时数组来构建 token 。)

I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?

每个线程使用单独的映射和合并策略会减少锁定并可能提高性能。此外，当前的实现已被破坏，因为 wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1) 不是原子的，因此操作可能会被计算在内。我会使用单独的映射，只是因为减少可变共享状态使线程程序更容易推理。

Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.

考虑在同一文件上每个线程使用 FileReader(和 BufferedReader)。这将避免必须首先将文件复制到数组中，然后将其切片以供各个线程使用，虽然总读取量相同，但可以避免占用如此多的内存。完成的读取实际上不是随机访问，而只是从不同偏移量开始的顺序(带有“跳过”) - 每个线程仍然在互斥的范围内工作。

此外，如果整数值被“切成”两半，那么带有切片的原始代码就会被破坏，因为每个线程都会读取一半的单词。一个解决方法是让每个线程跳过第一个单词如果它是前一个 block 的延续(即更快地扫描一个字节)，然后根据需要读取其范围的末尾以完成最后一句话。

关于Java字数统计: a mediocre implementation，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24133037/

Java字数统计: a mediocre implementation

上一篇：java - Google Api Java 客户端列出根文件夹子级

下一篇：java - Java程序-实现各种设置(选项)