java - 解析多个大型csv文件并将所有记录添加到ArrayList

目前我有大约 12 个 csv 文件，每个文件大约有 150 万条记录。

我使用 univocity-parsers 作为我的 csv 阅读器/解析器库。

使用 univocity-parsers，我读取每个文件并使用 addAll() 方法将所有记录添加到数组列表中。当所有 12 个文件都被解析并添加到数组列表中时，我的代码在末尾打印数组列表的大小。

for (int i = 0; i < 12; i++) {
    myList.addAll(parser.parseAll(getReader("file-" + i + ".csv")));

}

一开始它工作得很好，直到我到达我的第 6 个连续文件，然后它在我的 IntelliJ IDE 输出窗口中似乎花了很长时间，即使在一小时后也不会打印出数组列表大小，而在我的第 6 个文件之前它相当快。

如果有帮助的话，我正在 MacBook Pro(2014 年中)OSX Yosemite 上运行。

这是一个关于 fork 和连接的教科书问题。

最佳答案

我是这个库的创建者。如果您只想计算行数，请使用 行处理器。您甚至不需要自己计算行数，因为解析器会为您完成此操作:

// Let's create our own RowProcessor to analyze the rows
static class RowCount extends AbstractRowProcessor {

    long rowCount = 0;

    @Override
    public void processEnded(ParsingContext context) {
        // this returns the number of the last valid record.
        rowCount = context.currentRecord();
    }
}

public static void main(String... args) throws FileNotFoundException {
    // let's measure the time roughly
    long start = System.currentTimeMillis();

    //Creates an instance of our own custom RowProcessor, defined above.
    RowCount myRowCountProcessor = new RowCount();

    CsvParserSettings settings = new CsvParserSettings();


    //Here you can select the column indexes you are interested in reading.
    //The parser will return values for the columns you selected, in the order you defined
    //By selecting no indexes here, no String objects will be created
    settings.selectIndexes(/*nothing here*/);

    //When you select indexes, the columns are reordered so they come in the order you defined.
    //By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
    settings.setColumnReorderingEnabled(false);

    //We instruct the parser to send all rows parsed to your custom RowProcessor.
    settings.setRowProcessor(myRowCountProcessor);

    //Finally, we create a parser
    CsvParser parser = new CsvParser(settings);

    //And parse! All rows are sent to your custom RowProcessor (CsvDimension)
    //I'm using a 150MB CSV file with 3.1 million rows.
    parser.parse(new File("c:/tmp/worldcitiespop.txt"));

    //Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
    System.out.println("Rows: " + myRowCountProcessor.rowCount);
    System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");

}

输出

Rows: 3173959
Time taken: 1062 ms

编辑:我看到您关于需要使用行中的实际数据的评论。在这种情况下，请在 RowProcessor 类的 rowProcessed() 方法中处理行，这是处理此问题的最有效方法。

编辑2:

如果您只想计算行数，请使用 CsvRoutines 中的 getInputDimension:

    CsvRoutines csvRoutines = new CsvRoutines();
    InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
    System.out.println(d.rowCount());
    System.out.println(d.columnCount());

关于java - 解析多个大型csv文件并将所有记录添加到ArrayList，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32547277/

java - 解析多个大型csv文件并将所有记录添加到ArrayList

上一篇：java - 为什么我的 CardLayout .show() 没有执行任何操作... Java Netbean

下一篇：Java Json 读取 hashmap 时遇到问题