perl - 折叠具有多个字段的行

我有这个代码:

awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]", " : "\t") $2} END{for (i in a) print i a[i]} ' inputfile

我想折叠包含两个以上字段的行，但始终基于第一个字段作为索引。

输入文件(三列制表符分隔):

protein_1   membrane    1e-4
protein_1   intracellular   1e-5
protein_2   membrane    1e-50
protein_2   citosol 1e-40

所需的输出(三列制表符分隔):

protein_1   membrane, intracellular 1e-4, 1e-5
protein_2   membrane, citosol   1e-50, 1e-40

谢谢!

堆栈在这里:

awk '!seen[$1,$2]++{a[$1]=(a[$1] ? a[$1]"\t" : "\t") $2};{a[$1]=(a[$1] ? a[$1]", " : "\t") $3} END{for (i in a) print i a[i]} ' 1 inputfile

最佳答案

我真的希望有人发布一些 awk 魔法，但我现在会继续扔掉更长的 perl 脚本:

use strict;
use warnings;

my @cols = ();
my $lastprotein = '';

while (<DATA>) {
    chomp;
    my ($protein, @data) = split "\t";

    if ($protein ne $lastprotein && @cols) {
        print join("\t", $lastprotein, map {join ', ', @$_} @cols), "\n";
        @cols = ();
    }

    push @{$cols[$_]}, $data[$_] for (0..$#data);
    $lastprotein = $protein;
}

print join("\t", $lastprotein, map {join ', ', @$_} @cols), "\n";

__DATA__
protein_1   membrane    1e-4
protein_1   intracellular   1e-5
protein_2   membrane    1e-50
protein_2   citosol 1e-40

输出

protein_1       membrane, intracellular 1e-4, 1e-5
protein_2       membrane, citosol       1e-50, 1e-40

关于perl - 折叠具有多个字段的行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22307750/

上一篇：带有咕噜声的 WebDriver + Protractor

下一篇：json - 我是否需要自定义序列化程序来在 kafka 中生成 JSON 消息？

相关文章：

perl - 使用 mod_cgi 和 mod_perl 捕获错误

mysql - Perl 模块实例化 + DBI + fork "Mysql server has gone away"

string - 字符串 'eq' Perl 中的未初始化值

perl - 在Perl中解析一个由空字节分隔的字符串

svn |哇 |邮件在 bash 中有效，但在 cron 作业中无效

linux - sed : printing lines between two words only when one of the line matches a third word or any pattern

linux - 如何将上下文后的 grep 设置为 "until the next blank line"？

linux - 在 linux 中显示 perl 脚本的输出

bash - 仅获取值符合 bash 范围的行

awk - Grep 文本垂直