java - 自定义 Java 正则表达式 : Match starting with and ending with

标签 java regex apache-poi file-processing

我已经为此苦苦挣扎了几天,我想知道也许有人可以帮助我。

我想要完成的是处理一个包含一组问题和答案的文本文件。文件(.doc 或 .docx)的内容如下所示:

Document Name
1. Question one:
a. Answer one to question one
b. Answer two to question one
c. Answer three to question one
2. Question two:
a. Answer one to question two
c. Answer two to question two
e. Answer three to question two

到目前为止我尝试过的是:

通过 Apache POI 读取文档内容,如下所示:

fis = new FileInputStream(new File(FilePath));
XWPFDocument doc = new XWPFDocument(fis);
XWPFWordExtractor extract = new XWPFWordExtractor(doc);
String extractorText = extract.getText();

所以,到目前为止,我已经掌握了文档的内容。接下来,我尝试创建一个正则表达式模式,该模式将匹配问题开头的数字和点(1.12.),并继续直到它通过以下方式匹配冒号:

Pattern regexPattern = Pattern.compile("^(\\d|\\d\\d)+\\.[^:]+:\\s*$", Pattern.MULTILINE);
Matcher regexMatcher = regexPattern.matcher(extractorText);

但是,当我尝试循环遍历结果集时,我找不到任何问题文本:

while (regexMatcher.find()) {
    System.out.println("Found");
    for (int i = 0; i < regexMatcher.groupCount() - 2; i += 2) {
        map.put(regexMatcher.group(i + 1), regexMatcher.group(i + 2));
        System.out.println("#" + regexMatcher.group(i + 1) + " >> " + regexMatcher.group(i + 2));
    }
}

我不确定我哪里出了问题,因为我是 Java 新手,希望有人能帮助我。

此外,如果有人对如何创建包含问题和相关答案的 map 有更好的方法,我们将非常感激。

提前谢谢您。

编辑:我正在尝试获取类似 map 的内容,其中包含键(问题文本)和另一个字符串列表,这些字符串将表示与该问题相关的答案集,例如:

Map<String, List<String>> desiredResult = new HashMap<>();
    desiredResult.entrySet().forEach((entry) -> {
        String       questionText = entry.getKey();
        List<String> answersList  = entry.getValue();

        System.out.println("Now at question: " + questionText);

        answersList.forEach((answerText) -> {
            System.out.println("Now at answer: " + answerText);
        });
    });

这将生成以下输出:

Now at question: 1. Question one:
Now at answer: a. Answer one to question one
Now at answer: b. Answer two to question one
Now at answer: c. Answer three to question one

最佳答案

经过一番思考,我找到了答案。通过用新行分割文档,我们得到一个包含所有行的数组。

当迭代该数组时,我们只需要确定一行是问题还是答案。我已经用 2 个不同的正则表达式做到了这一点:

对于问题:

\d{1,2}\..+

答案:

[a-z]\..+

据此,我们可以决定是否开始一个新问题,或者是否需要将该行添加到结果中。

代码如下:

// the read document
String document = "Document Name\n" +
    "1. Question one:\n" +
    "a. Answer one to question one\n" +
    "b. Answer two to question one\n" +
    "c. Answer three to question one\n" +
    "2. Question two:\n" +
    "a. Answer one to question two\n" +
    "c. Answer two to question two\n" +
    "e. Answer three to question two";

// splitting by lines
String[] lines = document.split("\r?\n");

// the regex patterns
Pattern questionPattern = Pattern.compile("\\d{1,2}\\..+");
Pattern answerPattern = Pattern.compile("[a-z]\\..+");

// intermediate holding variable
String lastLine = null;

// the result    
Map<String, List<String>> result = new HashMap<>();

for(int lineNumber = 0; lineNumber < lines.length; lineNumber++){
    String line = lines[lineNumber];

    if(questionPattern.matcher(line).matches()){
        result.put(line, new LinkedList<>());
        lastLine = line;
    } else if(answerPattern.matcher(line).matches()){
        result.get(lastLine).add(line);
    } else{
        System.out.printf("Line %s is not a question nor an answer!%n", lineNumber);
    }
}

关于java - 自定义 Java 正则表达式 : Match starting with and ending with,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51404331/

相关文章:

regex - Perl 正则表达式/o 优化或错误?

regex - 获取文本区域中以换行符分隔的所有字符串

python - 我在 python 中遇到一个问题,即用句子的行结束字符分割文本的一部分

java - XWPF - 删除单元格文本

java - Poi : Saving an excel file as xlsx after opening it from xlsm

具有背景颜色的粗体文本样式的 excel 行的 Java 代码

java - "Fast"Java中的整数幂

java - Spring MVC - POJO 抛出无效 spring 的异常 :form information how to how to catch in controller

java - 是否可以为 netbeans 上的 Web 服务操作创建自定义异常?

java - Http 到 https 重定向