java - 带符号的拉丁正则表达式

我需要拆分文本并只获取单词、数字和带连字符的组合词。我还需要学习拉丁语单词，然后我使用了 \p{L}，它给出了 é、ú ü ã 等等。例子是:

String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% "  ' : ; > < / \  | ,  here some is wrong… * + () e -"

Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );

这个正则表达式有什么问题？为什么它匹配 "("、"+"、"-"、"*" 和 "|"?

一些结果是:

dresse     // OK
sud-est    // OK
occident)  // WRONG
987        // OK
()         // WRONG
(a         // WRONG
*          // WRONG
-          // WRONG
+          // WRONG
(          // WRONG
|          // WRONG

正则表达式的解释是:

[^\p{L}+(\-\p{L}+)*\d]+

 * Word separator will be:
 *     [^  ...  ]  No sequence in:
 *     \p{L}+        Any latin letter
 *     (\-\p{L}+)*   Optionally hyphenated
 *     \d            or numbers
 *     [ ... ]+      once or more.

最佳答案

如果我对你的要求的理解是正确的，这个正则表达式将匹配你想要的:

"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"

它将匹配:

一个连续的 Unicode 序列 Latin script人物。我将其限制为拉丁文字，因为 \p{L} 将匹配 any 文字中的字母。如果您的 Java 版本不支持该语法，请将 \\p{IsLatin} 更改为 \\pL。
或几个这样的序列，连字符
或连续的十进制数字序列 (0-9)

上面的regex是通过调用Pattern.compile来使用的，调用matcher(String input)得到一个Matcher对象，并使用循环查找匹配项。

Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);

while (matcher.find()) {
    System.out.println(matcher.group());
}

如果你想允许带撇号的单词 ':

"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"

我还在字符类 ['\\-] 中转义了 - 以防万一你想添加更多。实际上，如果 - 是字符类中的第一个或最后一个，则不需要转义，但为了安全起见，我还是对其进行了转义。

关于java - 带符号的拉丁正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14833001/

java - 带符号的拉丁正则表达式

上一篇：java - 使用jxl为excel表中的单元格设置不同的颜色

下一篇：java - 条件与对象等待/通知