java - 使用正则表达式从文本中获取对话片段

我正在尝试从书籍文本中提取对话片段。例如，如果我有字符串

"What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me."

然后我想提取“What's the matter with the flag?”和“Seem's all right to me.”。

我找到了一个正则表达式来使用 here ，即 "[^"\\]*(\\.[^"\\]*)*"。当我在我的书 .txt 文件上执行 Ctrl+F 查找正则表达式时，这在 Eclipse 中非常有效，但是当我运行以下代码时:

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\""; Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

if(m.find())
 System.out.println(m.group(1));

唯一打印的是null。那么我没有正确地将正则表达式转换为 Java 字符串吗？我是否需要考虑 Java 字符串的双引号是 \"？

最佳答案

在自然语言文本中，" 不太可能被前面的斜杠转义，因此您应该能够仅使用模式 "([^"]*)"。

作为 Java 字符串文字，这是 "\"([^\"]*)\""。

在 Java 中:

String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

上面的打印:

What's the matter with the flag?
Seems all right to me.

关于转义序列

鉴于此声明:

String s = "\"";
System.out.println(s.length()); // prints "1"

字符串 s 只有一个字符，"。\ 是 Java 源代码级别的转义序列；字符串本身没有斜线。

另见

> JLS 3.10.6 Escape Sequences for Character and String Literals

原代码的问题

模式本身实际上没有任何问题，但您没有捕捉到正确的部分。 \1 没有捕获引用的文本。这是具有正确捕获组的模式:

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);

while (m.find()) {
    System.out.println(m.group(1));
}

为了视觉比较，这是原始模式，作为 Java 字符串文字:

String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
                            ^^^^^^^^^^^^^^^^^
                           why capture this part?

这是修改后的模式:

String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    we want to capture this part!

不过，如前所述:自然语言文本不需要这种复杂的模式，因为自然语言文本不太可能包含转义引号。

另见

> regular-expressions.info/Grouping and backreferences

关于java - 使用正则表达式从文本中获取对话片段，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/2947502/

java - 使用正则表达式从文本中获取对话片段

关于转义序列

另见

原代码的问题

另见

上一篇：java - 实现 eBay 查找/反馈 API

下一篇：java - 尝试用 Java 编写优先级队列但得到 "Exception in thread "main"java.lang.ClassCastException"