linux - 比较数十万个文件并在 bash 中创建输出结果文件的最快方法

我有以下内容:

-值文件，values.txt

-目录结构:./dataset/label/author/files.txt

-数以万计的files.txt的

-一个名为targets.txt的文件，其中包含每个files.txt的位置

示例 targets.txt

./dataset/tallperson/Jabba/awesome.txt
./dataset/fatperson/Detox/toxic.txt

我有一个名为 values.txt 的文件，其中包含数十万行值。这些值是“aef”、“; i”、“jfk”等。随机的 3 字符行。

我还有几万个文件，每个文件也有几百到几千行。每行还包含随机 3 字符行。

values.txt 是使用每个 files.txt 的值创建的。因此，任何不包含在 values.txt 中的 file.txt 文件都没有值。 values.txt 不包含重复值。

例子:

./dataset/weirdperson/Crooked/file1.txt

LOL
hel
lo 
how
are
you
on 
thi
s f
ine
day

./dataset/awesomeperson/Mild/file2.txt

I a
m v
ery
goo
d. 
Tha
nks
LOL

值.txt

are
you
on 
thi
s f
ine
day
goo
d. 
Tha
hel
lo 
how
I a
m v
ery
nks
LOL

以上只是示例数据。每个文件将包含数百行。而 values.txt 将包含数十万行。

我的目标是制作一个文件，其中每一行都是一个文件。每行将包含 N 个值，其中每个值对应于 values.txt 中的行。每个值将用逗号分隔。每个值都是根据每个文件包含 values.txt 中每一行的值的次数简单计算的。

结果应该是这样的。第 1 行是 file1.txt，第 2 行是 file2.txt。

结果.txt

1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,

现在。最后一件事是，在得到这个结果后我想添加一个标签。标签相当于文件的第 N 个父目录。对于这个例子，假设是第二个父目录。因此标签将是“高个子”或“矮个子”。因此，新的 Results.txt 文件将如下所示。

结果.txt

1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson

我想要一种方法来完成所有这些，但我需要它快速，因为我正在处理一个非常大规模的数据集。

这是我当前的代码，但它太慢了。瓶颈在第 2 行。

脚本。每个文件位于“./dataset/label/author/file.java”

1  while IFS= read file_name; do
2      cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" "$file_name" | xargs printf "%d," >> Results.txt;
3      label=$(echo "$file_name" | cut -d '/' -f 3);
4      printf "$label\n" >> Results.txt;
5  done < targets.txt

-----------< em>-

重现这个问题。执行以下操作:

mkdir -p dataset/{label1,label2}
touch file1.txt; chmod 777 file1.txt
touch file2.txt; chmod 777 file2.txt
echo "Enter anything here" > file1.txt
echo "Enter something here too" > file2.txt
mv file1.txt ./dataset/label1
mv file2.txt ./dataset/label2
find ./dataset/ -type f -name "*.txt" | while IFS= read file_name; do cat $file_name | sed -e "s/.\{3\}/&\n/g" | sort -u > $modified-file_name; done
find ./dataset/ -type f -name "modified-*.txt" | xargs -d '\n' -I {} echo {} >> targets.txt
xargs cat < targets.txt | sort -u > values.txt

在上面的内容不变的情况下，你应该得到一个 values.txt，其中包含类似于下面的内容。如果由于某种原因有任何行少于或多于 3 个字符，请删除该行。

any
e
Ent
er 
eth
he
her
ing
ng 
re 
som
thi
too

你应该得到一个 targets.txt 文件

./dataset/label2/modified-file2.txt
./dataset/label1/modified-file1.txt

从这里开始。目标是检查 targets.txt 中的每个文件，并计算该文件在 values.txt 中包含了多少个值。并将带有标签的结果输出到Results.txt

以下脚本适用于此示例，但我需要它能够更快地进行大规模操作。

while IFS= read file_name; do
  cat values.txt | xargs -d '\n' -I {} grep -Fc -- "{}" $file_name | xargs printf "%d," >> Results.txt;
  label=$(echo "$file_name" | cut -d '/' -f 3);
  printf "$label\n" >> Results.txt;
done < targets.txt

还有一个例子

示例 2:

./dataset/weirdperson/Crooked/file1.txt

LOL
LOL
HAHA

./dataset/awesomeperson/Mild/file2.txt

LOL
LOL
LOL

值.txt

LOL
HAHA

结果.txt

2,1,weirdperson
3,0,awesomeperson

最佳答案

这是 Python 中的解决方案，使用其有序字典数据类型。

import os
from collections import OrderedDict

# read samples from values.txt into an Ordered Dict.
# each dict key is a line from the file
# (including the trailing newline, but that doesn't matter)
# each dict value is 0

with open('values.txt', 'r') as f:
  samplecount0=OrderedDict((sample, 0) for sample in f.readlines())

# get list of filenames from targets.txt

with open('targets.txt', 'r') as f:
  targets=[t.rstrip('\n') for t in f.readlines()]

# for each target,
# read its lines of samples
# increment the corresponding count in samplecount
# print out samplecount in a single line separated by commas
# each line also has the 2nd-to-last directory component of the target's pathname

for target in targets:
  with open(target, 'r') as f:
    # copy samplecount0 to samplecount so we don't have to read the values.txt file again
    samplecount=samplecount0.copy()
    # for each sample in the target file, increment the samplecount dict entry
    for tsample in f.readlines():
      samplecount[tsample] += 1
    output = ','.join(str(v) for v in samplecount.values())
    output += ',' + os.path.basename(os.path.dirname(os.path.dirname(target)))
    print(output)

输出:

$ python3  doit.py
1,1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,weirdperson
0,0,0,0,0,0,0,1,1,1,0,0,0,1,1,1,1,1,awesomeperson

关于linux - 比较数十万个文件并在 bash 中创建输出结果文件的最快方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56976664/

linux - 比较数十万个文件并在 bash 中创建输出结果文件的最快方法

上一篇：linux - 用perf记录缺页的指令地址

下一篇：css - 为什么 gnome 应用程序会忽略部分 gtk 3 主题？