linux - sed 字符串的第二次出现 - 对于外部文件中的所有行 (Linux)

我有一个文件，其第 2 列中的值需要重命名。在整个 ~5m 行文件 (with.duplicates) 中，有约 8k 个重复值(在文件 list.of.duplicates 中)。

数据集输入:

with.duplicates

1 rs143225517 0 751756 C T

1 rs146277091 0 752478 G

1 rs3094315 0 752566 G A

1 rs149886465 0 752617 A C

1 rs3131972 0 752721 G

1 rs3131972 0 752721 ATG

1 rs3131971 0 752894 T C

1 rs61770173 0 753405 C A

1 rs2073814 0 753474 CG

1 rs2073813 0 753541 G

1 rs12184325 0 754105 TC

list.of.duplicates

rs3131972

rs4310388

rs7529459

rs905135

rs9786995

rs12065710

rs6426404

rs12759849

rs6603823

我试过的代码

这正是我想要的 - 但效率低下且仅用于一次替换

sed -i '0,/rs3131972/! s/rs3131972/qrs3131972/' with.duplicates

但我不知道如何遍历整个重复值列表

i=0 
while ((i++)); 
read -r snp 
do 
sed -i '0,/${snp}/! s/${snp}/q${snp}/' with.duplicates 
done < list.of.duplicates

我在整个网站上找到了部分答案，但没有一个能将所有内容整合到一个有效的脚本中。

在此先感谢您的帮助!

在 Linux 或 R 中寻找解决方案

编辑:

期望的输出

1 rs143225517 0 751756 C T

1 rs146277091 0 752478 G

1 rs3094315 0 752566 G A

1 rs149886465 0 752617 A C

1 rs3131972 0 752721 G

1 qrs3131972 0 752721 ATG

1 rs3131971 0 752894 T C

1 rs61770173 0 753405 C A

1 rs2073814 0 753474 CG

1 rs2073813 0 753541 G

1 rs12184325 0 754105 TC

最佳答案

好吧，awk 可以自己处理这个问题。你不需要循环。

awk '(FNR==NR) { d[$1]; next }
     ($2 in d) && !(++d[$2]-2) { $2 = "q" $2; delete a[$2] }
     1' list.of.duplicates with.duplicates

Can it be modified to instead of adding the "q" to the second column of the second occurrence, to add the q to the second column of the longer line?

可以，但效率不如上述。

awk '(ARGIND==1) { d[$1]; next }
     (ARGIND==2) {
         if ($2 in d) {
             if ($2 in r) { if (length(r[$2]) > length()) d[$2]++; delete r[$2] }
             else { r[$2] = $0 }
         } next }
     ($2 in d) && !(++d[$2]-2) { $2 = "q" $2; delete d[$2] }
     1' list.of.duplicates with.duplicates with.duplicates

关于linux - sed 字符串的第二次出现 - 对于外部文件中的所有行 (Linux)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53840154/

linux - sed 字符串的第二次出现 - 对于外部文件中的所有行 (Linux)

数据集输入:

我试过的代码

上一篇：linux - 如何通过 add_custom_command 指定 LD_LIBRARY_PATH？

下一篇：linux - 公开上传的文件不显示