java - 从大文件(超过 700MB)中提取模式的更有效方法是什么

我遇到了一个问题，需要我从本地机器解析一个文本文件。有一些并发症:

文件可以很大 (700mb+)
模式出现在多行
我需要模式后的商店行信息

我使用 BufferReader、String.indexOf 和 String.substring 创建了一个简单的代码(以获取第 3 项)。

在文件中，它有一个名为 code= 的键(模式)，它在不同的 block 中多次出现。该程序使用 BufferReader.readLine 从该文件中读取每一行。它使用 indexOf 检查模式是否出现，然后在模式后提取文本并存储在一个公共(public)字符串中。

当我用 600mb 文件运行我的程序时，我注意到在处理文件时性能最差。我在 CodeRanch 中读到一篇文章，其中提到 Scanner 类对于大文件来说性能不佳。

是否有一些技术或库可以提高我的表现？

提前致谢。

这是我的源代码:

String codeC = "code=[";
String source = "";
try {
    FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
    DataInputStream in = new DataInputStream(f1);
    BufferedReader br = new BufferedReader(new InputStreamReader(in));

    String strLine;
    boolean bPrnt = false;
    int ln = 0;
    // Read File Line By Line
    while ((strLine = br.readLine()) != null) {
        // Print the content on the console
        if (strLine.indexOf(codeC) != -1) {
            ln++;
            System.out.println(strLine + " ---- register : " + ln);
            strLine = strLine.substring(codeC.length(), strLine.length());
            source = source + "\n" + strLine;
        }
    }
    System.out.println("");
    System.out.println("Lines :" + ln);
    f1.close();
} catch ( ... ) {
    ...
}

最佳答案

您的这段代码非常可疑，很可能至少是您部分性能问题的原因:

FileInputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
DataInputStream in = new DataInputStream(f1);
BufferedReader br = new BufferedReader(new InputStreamReader(in));

您无缘无故地涉及 DataInputStream，实际上将其用作 Reader 的输入可以被视为代码损坏的情况。改为这样写:

InputStream f1 = new FileInputStream("c:\\Temp\\fo1.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(fr));

您正在使用的 System.out 会对性能造成巨大损害，尤其是当您在 Eclipse 中运行时测量性能时，即使是从命令行运行也是如此。我的猜测是，这是造成瓶颈的主要原因。当您追求最佳性能时，一定要确保您不会在主循环中打印任何内容。

关于java - 从大文件(超过 700MB)中提取模式的更有效方法是什么，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/13632922/

java - 从大文件(超过 700MB)中提取模式的更有效方法是什么

上一篇：java - 使用 OOP 和设计模式创建标准图形构建器

下一篇：java - 使用也处理撇号的正则表达式匹配单词