为了更好地清理我的论坛消息语料库,我想使用两个正则表达式删除标点符号之前的前导空格,并在需要时添加一个标点符号之后的空格。后者没有问题( (?<=[.,!?()])(?! )
),但我至少对第一个有一些问题。
我使用了这个表达式:\s([?.!,;:"](?:\s|$))
但到目前为止还不够灵活:
- 即使标点符号前面已经有一个空格(或更多),它也会匹配
- 如果标点符号后面没有空格则不匹配
- 它与任何未列出的标点符号都不匹配(但我想我最终可以使用
[:punct:]
来实现这一点)
最后,两者都匹配小数点(虽然不应该)
我最终如何重写表达式来满足我的需求?
示例字符串和预期输出
This is the end .Hello world! # This is the end. Hello world! (remove the leading, add the trailing)
This is the end, Hello world! # This is the end, Hello world! (ok!)
This is the end . Hello world! # This is the end. Hello world! (remove the leading, ok the trailing)
This is a .15mm tube # This is a .15 mm tube (ok since it's a decimal point)
最佳答案
使用\p{P}
匹配所有标点符号。使用\h*
而不是\s*
因为\s
也会匹配换行符。
(?<!\d)\h*(\p{P}+)\h*(?!\d)
将匹配的字符串替换为 \1<space>
> x <- c('This is the end .Stuff', 'This is the end, Stuff', 'This is the end . Stuff', 'This is a .15mm tube')
> gsub("(?<!\\d)\\h*(\\p{P}+)\\h*(?!\\d)", "\\1 ", x, perl=T)
[1] "This is the end. Stuff" "This is the end, Stuff" "This is the end. Stuff"
[4] "This is a .15mm tube"
关于正则表达式前导空格/在标点符号之前/添加尾随空格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26543166/