我正在尝试使用包含我要删除的 IP 地址列表的 file2.txt 来清理一个始终包含相同行的 file1.txt。
我相信可以以某种方式增强我编写的工作脚本以加快执行速度。
我的脚本:
#!/bin/bash
IFS=$'\n'
for i in $(cat file1.txt); do
for j in $(cat file2); do
echo ${i} | grep -v ${j}
done
done
我已经使用以下数据集测试了脚本:
Amount of lines in file1.txt = 10,000
Amount of lines in file2.txt = 3
Scrit execution time:
real 0m31.236s
user 0m0.820s
sys 0m6.816s
file1.txt 内容:
I3fSgGYBCBKtvxTb9EMz,1.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.2.2.3,45,This IP belongs to office space,1539760502,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.3.2.3,45,This IP belongs to office space,1539760503,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.4.2.3,45,This IP belongs to office space,1539760504,https://myoffice.com
I3fSgGYBCBKtvxTb9EMz,1.5.2.3,45,This IP belongs to office space,1539760505,https://myoffice.com
... lots of other lines in the same format
I3fSgGYBCBKtvxTb9EMz,4.1.2.3,45,This IP belongs to office space,1539760501,https://myoffice.com
file2.txt 内容:
1.1.2.3
1.2.2.3
... lots of other IPs here
1.2.3.9
我怎样才能改进这些时间安排?
我相信这些文件会随着时间的推移而增长。在我的例子中,我将每小时从 cron 运行脚本,因此我想在此处进行改进。
您想删除 file1.txt
中包含与 file2.txt
匹配的子字符串的所有行。 grep救援
grep -vFwf file2.txt file1.txt
-w
需要避免11.11.11.11
匹配111.11.11.111
-F, --fixed-strings, --fixed-regexp
Interpret PATTERN
as a list of fixed strings, separated by newlines, any of which is to be matched. (-F
is specified by POSIX, --fixed-regexp
is an obsoleted alias, please do not use it in new scripts.)
-f FILE, --file=FILE
Obtain patterns from FILE
, one per line. The empty file contains zero patterns and therefore matches nothing. (-f
is specified by POSIX.)
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
source: man grep
进一步说明,这里有一些针对您的脚本的提示:
- 不要使用 for 循环来读取文件 (http://mywiki.wooledge.org/DontReadLinesWithFor)。
- 不要使用
cat
(参见 How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?)
- 使用引号! (参见 Bash and Quotes)
这允许我们将其重写为:
#!/bin/bash
while IFS=$'\n' read -r i; do
while IFS=$'\n' read -r j; do
echo "$i" | grep -v "$j"
done < file2
done < file1
现在的问题是你读取了 file2
N 次。其中 N
是 file1
的行数。这不是很有效。幸运的是grep有适合我们的解决方案(见顶部)。