java - 按特定字符序列将文本文件拆分为多个文件

我有一个具有以下格式的文件。

.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...

“.I 1”表示与 doc ID1 对应的文本 block 的开头和“< strong>.I 2”表示与 doc ID2 对应的文本 block 的开头。

我需要是读取“.I 1”和“.I 2”之间的文本并将其保存为单独的文件，例如“DOC_ID_1.txt”，然后读取“.I 1”和“.I 2”之间的文本。 I 2"和".I 3" 并将其另存为单独的文件，如“DOC_ID_2.txt”等。 假设 .I # 的数量未知。

我已经尝试过，但无法完成。任何帮助将不胜感激

String inputDocFile="C:\\Dropbox\\Data\\cran.all.1400";     
try {
     File inputFile = new File(inputDocFile);
     FileReader fileReader = new FileReader(inputFile);
     BufferedReader bufferedReader = new BufferedReader(fileReader);
     String line=null;
     String outputDocFileSeperatedByID="DOC_ID_";
     //Pattern docHeaderPattern = Pattern.compile(".I ", Pattern.MULTILINE | Pattern.COMMENTS);
     ArrayList<ArrayList<String>> result = new ArrayList<> ();
     int docID =0;
     try {
          StringBuilder sb = new StringBuilder();
          line = bufferedReader.readLine();
          while (line != null) {
              if (line.startsWith(".I"))
              { 
                 result.add(new ArrayList<String>());
                 result.get(docID).add(".I");
                 line = bufferedReader.readLine();

                 while(line != null && !line.startsWith(".I")){
                    line = bufferedReader.readLine();
                    }
                     ++docID;
              }        
              else line = bufferedReader.readLine();
          }

      } finally {
          bufferedReader.close();
      }
   } catch (IOException ex) {
      Logger.getLogger(ReadFile.class.getName()).log(Level.SEVERE, null, ex);
   }

最佳答案

您想要查找与“I n”匹配的行。

您需要的正则表达式是:^.I\d$

^ 表示行的开头。因此，如果 I 之前有一些空格或文本，则该行将与正则表达式不匹配。
\d 表示任意数字。为了简单起见，我在此正则表达式中只允许使用一位数字。
$ 表示行尾。因此，如果数字后面有一些字符，则该行将与表达式不匹配。

现在，您需要逐行读取文件并保留对在其中写入当前行的文件的引用。

在 Java 8 中使用 Files.lines(); 逐行读取文件要容易得多

private String currentFile = "root.txt";

public static final String REGEX = "^.I \\d$";

public void foo() throws Exception{

  Path path = Paths.get("path/to/your/input/file.txt");
  Files.lines(path).forEach(line -> {
    if(line.matches(REGEX)) {
      //Extract the digit and update currentFile
      currentFile = "File DOC_ID_"+line.substring(3, line.length())+".txt";
      System.out.println("Current file is now : currentFile);
    } else {
      System.out.println("Writing this line to "+currentFile + " :" + line);
      //Files.write(...);
    }
  });

注意:为了提取数字，我使用了原始的 "".substring() ，我认为这是邪恶的，但它更容易理解。您可以使用 Pattern 和 Matcher 以更好的方式做到这一点:

使用这个正则表达式:“.I (\\d)”。 (与之前相同，但带有括号，表示您要捕获的内容)。然后:

Pattern pattern = Pattern.compile(".I (\\d)");
Matcher matcher = pattern.matcher(".I 3");
if(matcher.find()) {
  System.out.println(matcher.group(1));//display "3"
}

关于java - 按特定字符序列将文本文件拆分为多个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30345816/

java - 按特定字符序列将文本文件拆分为多个文件

上一篇：java - 对二维ArrayList进行排序

下一篇：java - 用java编写一个带有事件处理的简单电话簿代码