regex - 如何使用正则表达式查找并删除文件中的重复行?

标签 regex

这个问题与语言无关。仅使用正则表达式,我可以查找并替换文件中的重复行吗?

请考虑以下示例输入和我想要的输出;

输入>>

11
22
22  <-duplicate
33
44
44  <-duplicate
55

输出>>

11
22
33
44
55

最佳答案

Regular-expressions.info 有一个页面 Deleting Duplicate Lines From a File

这基本上可以归结为搜索这个单行:

^(.*)(\r?\n\1)+$

...并替换为\1
注意:点不能与换行符匹配

说明:

The caret will match only at the start of a line. So the regex engine will only attempt to match the remainder of the regex there. The dot and star combination simply matches an entire line, whatever its contents, if any. The parentheses store the matched line into the first backreference.

Next we will match the line separator. I put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. So up to this point we matched a line and the following line break.

Now we need to check if this combination is followed by a duplicate of that same line. We do this simply with \1. This is the first backreference which holds the line we matched. The backreference will match that very same text.

If the backreference fails to match, the regex match and the backreference are discarded, and the regex engine tries again at the start of the next line. If the backreference succeeds, the plus symbol in the regular expression will try to match additional copies of the line. Finally, the dollar symbol forces the regex engine to check if the text matched by the backreference is a complete line. We already know the text matched by the backreference is preceded by a line break (matched by \r?\n). Therefore, we now check if it is also followed by a line break or if it is at the end of the file using the dollar sign.

The entire match becomes line\nline (or line\nline\nline etc.). Because we are doing a search and replace, the line, its duplicates, and the line breaks in between them, are all deleted from the file. Since we want to keep the original line, but not the duplicates, we use \1 as the replacement text to put the original line back in.

关于regex - 如何使用正则表达式查找并删除文件中的重复行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/1573361/

相关文章:

php - 如何正确地内爆数组以在 mySQL 中实现 REGEXP?

python - 以字母开头并以数字结尾

c# - 使用正则表达式按类名选择 xml 元素

javascript - 如何从字符串中去除具有属性的 HTML 标签?

正则表达式匹配任何字符,包括空格

c# - RegularExpression 属性 - 传入类的属性字段

php - 获取选定列的正则表达式

javascript - 使用正则表达式 javascript 查找要终结的单词

java - URL 的正则表达式模式匹配

JavaScript:从字符串中的数字添加或减去