regex - 如何优化 grep 正则表达式以匹配 URL

背景:

我在 Mac OS 10.7.5 上有一个名为“stuff”的目录，其中包含 26 个文件(2 个 .txt 和 24 个 .rtf)。

我正在使用 grep (GNU v2.5.1) 在这 26 个文件中查找与 URL 结构匹配的所有字符串，然后将它们打印到一个新文件 (output.txt)。

下面的正则表达式确实适用于小规模。我在包含 3 个文件(1 .rtf 和 2 .txt)、一堆虚拟文本和 30 个 URL 的目录上运行它，它在不到 1 秒的时间内成功执行。

我正在使用以下正则表达式:
1

grep -iIrPoh 'https?://.+?\s' . --include=*.txt --include=*.rtf > output.txt

问题
我的目录“stuff”的当前大小为 180 KB，包含 26 个文件。在终端中，我 cd 到这个目录(东西)然后运行我的正则表达式。我等了大约 15 分钟并决定终止该进程，因为它没有完成。当我查看 output.txt 文件时，它高达 19.75GB ( screenshot )。
问题

是什么导致 output.txt 文件比整个目录大这么多数量级？

我还能在正则表达式中添加什么来简化处理时间。

预先感谢您在此处提供的任何指导。近 16 个小时，我一直在研究正则表达式的许多不同变体，并且在网上阅读了大量帖子，但似乎没有任何帮助。我是编写正则表达式的新手，但只要稍微掌握一下，我想我会明白的。
附加评论
我运行以下命令以查看 output.txt (19.75GB) 文件中记录的内容。看起来正则表达式正在寻找正确的字符串，但我认为是奇怪的字符除外:花括号 } {和一个字符串，如:{\fldrslt

    **TERMINAL**
    $ head -n 100 output.txt
    http://michacardenas.org/\
    http://culturelab.asc.upenn.edu/2013/03/06/calling-all-wearable-electronics-hackers-e-textile-makers-and-fashion-activists/\
    http://www.mumia-themovie.com/"}}{\fldrslt 
    http://www.mumia-themovie.com/}}\
    http://www.youtube.com/watch?v=Rvk2dAYkHW8\
    http://seniorfitnesssite.com/category/senior-fitness-exercises\
    http://www.giac.org/ 
    http://www.youtube.com/watch?v=deOCqGMFFBE"}}{\fldrslt 
    http://www.youtube.com/watch?v=deOCqGMFFBE}}
    https://angel.co/jason-a-hoffman\
    https://angel.co/joyent?save_req=mention_slugs"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html}} 
    http://www.cooking-hacks.com/index.php/documentation/tutorials/ehealth-biometric-sensor-platform-arduino-raspberry-pi-medical"}}{\fldrslt 
    http://www.cooking-hacks.com/index.php/documentation

到目前为止测试的正则表达式命令目录
2grep -iIrPoh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt失败:运行/生成空白文件需要 1 秒(output_2.txt)
3grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt失败:运行/生成空白文件需要 1 秒(output_3.txt)
4grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > sixth.txt失败:运行/生成空白文件需要 1 秒(output_4.txt)
5grep -iIroh 'https?://' . --include=*.txt --include=*.rtf > output.txt失败:运行/生成空白文件需要 1 秒(output_5.txt)
6grep -iIroh 'https?://\S' . --include=*.txt --include=*.rtf > output.txt失败:运行/生成空白文件需要 1 秒(output_6.txt)
7grep -iIroh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt失败:运行/生成空白文件需要 1 秒(output_7.txt)
8grep -iIrPoh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt失败:让运行 1O 分钟并手动终止进程/生成 20.63 GB 文件 (output_8.txt)/从好的方面来说，这个正则表达式捕获的字符串是准确的，因为它们不包含任何奇怪的附加字符，如花括号或RTF 文件格式语法 {\fldrslt
9find . -print | grep -iIPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_9.txt失败:运行/生成空白文件需要 1 秒(output_9.txt)
10find . -print | grep -iIrPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_10.txt失败:运行/生成空白文件需要 1 秒(output_10.txt)
11grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf编者注:此正则表达式仅在我将字符串输出到终端窗口时才能正常工作。当我输出到文件 output_11.txt 时它不起作用。
NEAR SUCCESS:所有 URL 字符串都被干净利落地删除，以删除字符串前后的空白，并删除所有与 .RTF 格式相关的特殊标记。缺点:在经过准确性测试的示例 URL 中，有些 URL 被缩短，最后失去了它们的结构。我估计大约 10% 的字符串被不正确地截断。
截断字符串的示例:URL structure before the regex: http://www.youtube.com/watch?v=deOCqGMFFBEURL structure after the regex: http://www.youtube.com/watch?v=de现在的问题变成了:
1.) 有没有办法确保我们不会像上面的例子那样删除 URL 字符串的一部分？
2.) 为正则表达式定义转义命令有帮助吗？ (如果这甚至可能的话)。
12grep -iIroh 'https?:\/\/[\w~#%&_+=,.?\/-]+' . --include=*.txt --include=*.rtf > output_12.txt失败:运行/生成空白文件需要 1 秒(output_12.txt)
13grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > tmp/output.txt失败:运行 2 分钟并手动终止进程/生成 1 GB 文件。这个正则表达式的目的是将 grep 的输出文件 (output.txt) 隔离到一个子目录中，以确保我们不会创建一个无限循环，让 grep 读回它自己的输出。不错的主意，但没有雪茄(screenshot)。
14grep -iIroh 'https\?://[a-z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf失败:与#11 相同的结果。该命令导致了一个带有截断字符串的无限循环。
15grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf 几乎赢家:这捕获了整个 URL 字符串。它确实导致了一个无限循环，在终端中创建了数百万个字符串，但我可以手动确定第一个循环的开始和结束位置，所以这应该没问题。干得好@acheong87!谢谢你!
16

find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt

接近成功:我能够获取整个 URL 字符串，这很好。然而，命令变成了无限循环。运行输出到终端大约 5 秒后，它产生了大约 100 万个 URL 字符串，这些字符串都是重复的。如果我们能弄清楚如何在一个循环后转义它，这将是一个很好的表达。
17ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txtNEAR SUCCESS:此命令导致对目录中的所有文件进行一次循环，这很好地解决了无限循环问题。但是，URL 字符串的结构被截断并包含了字符串来源的文件名。
18ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'NEAR SUCCESS:这个表达式防止了无限循环，这是好的，它在它查询的目录中创建了一个新文件，这个文件很小，大约 30KB。它捕获了字符串中所有正确的字符和几个不需要的字符。正如 Floris 所提到的，在 URL 没有以空格结尾的情况下 - 例如 http://www.mumia-themovie.com/"}}{\fldrslt它捕获了标记语法。
19ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[a-z./?#=%_-,~&]+'FAIL:这个表达式阻止了一个很好的无限循环，但是它没有捕获整个 URL 字符串。

最佳答案

我在评论中给出的表达(你的测试 17)旨在测试两件事:

1)我们可以让无限循环消失吗
2)我们可以干净地遍历目录中的所有文件吗

我相信我们两者都实现了。所以现在我大胆地提出了一个“解决方案”:

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'

分解它:

ls *.rtf *.txt         - list all .rtf and .txt files
grep -v 'output.txt'   - skip 'output.txt' (in case it was left from a previous attempt)
xargs                  - "take each line of the input in turn and substitute it 
                       - at the end of the following command 
                       - (or use -J xxx to sub at place of xxx anywhere in command)
grep -i                - case insensitive
     -I                - skip binary (shouldn't have any since we only process .txt and .rtf...)
     -o                - print only the matched bit (not the entire line), i.e. just the URL
     -h                - don't include the name of the source file
     -E                - use extended regular expressions 

     'http             - match starts with http (there are many other URLs possible... but out of scope for this question)
      s?               - next character may be an s, or is not there
      ://              - literal characters that must be there
      [^[:space:]]+    - one or more "non space" characters (greedy... "as many as possible")

这在一组非常简单的文件/URL 上似乎可以正常工作。我想既然迭代问题解决了，剩下的就简单了。网上有大量的“URL 验证”正则表达式。选择其中任何一个……上面的表达式实际上只是搜索“http 之后的所有内容，直到一个空格”。如果您最终遇到奇数或丢失的匹配项，请告诉我们。

关于regex - 如何优化 grep 正则表达式以匹配 URL，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20036994/

regex - 如何优化 grep 正则表达式以匹配 URL

上一篇：ruby-on-rails - carrierwave gem 可以同时在 AWS-S3 和本地文件系统上存储文件吗？

下一篇：optimization - Haskell:列表/向量/数组性能调优