unix - 如何合并以特定子字符串开头的行？

我有一个这样的文件

$ head test
                     gene=ENSECAG00000012421
                     note="synaptonemal complex central element protein 1
                     [Source:HGNC Symbol;Acc:28852]"
                     gene=ENSECAG00000017803
                     note="Uncharacterized protein
                     [Source:UniProtKB/TrEMBL;Acc:F6SNR9]"
                     gene=ENSECAG00000019088
                     note="cytochrome P450 2E1  [Source:RefSeq
                     peptide;Acc:NP_001104773]"
                     gene=ENSECAG00000004229

我希望它看起来像这样让这个文件看起来像这样

ENSECAG00000012421    synaptonemal complex central element protein 1 [Source:HGNC Symbol;Acc:28852]
ENSECAG00000017803    Uncharacterized protein [Source:UniProtKB/TrEMBL;Acc:F6SNR9]

我不确定注释是否总是两行，所以我想要类似的内容

awk '{if(substr($1,1,4)=="gene") gene=$1; else print gene,$1}'

但我希望它能够识别出它可能是两行，并且单词之间有空格。所以我希望它将“”中的所有内容打印为第 2 列(理想情况下用\t 分隔两列，这样以后就不会混淆) 我知道如何摆脱基因并注意和“，但不确定它们是否有助于识别。我很高兴它是一串不同的命令，首先将整个注释放在一行中，然后将其与基因或所有内容一次性组合，无论效果最好。

此外，如果您使用 awk，您能否简要解释一下您在做什么？

感谢您的帮助!

最佳答案

如果您有 GNU awk 或 mawk (该解决方案依赖于基于正则表达式的输入记录分隔符，严格符合 POSIX 或较旧的 awk 实现不支持):

简短版本:

awk -v RS=' *(gene=|note="|")' '
  { gsub("\n", ""); if ($0 == "") next; $1=$1; 
    printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n") }
  ' file

带注释的版本:

-v RS=' *(gene=|note="|")' - RS 是定义输入记录分隔符的特殊变量 - 指定正则表达式跨行将输入分解为感兴趣的记录。

awk -v RS=' *(gene=|note="|")' '
  {    
   gsub("\n", "");     # remove all newlines from record
   if ($0 == "") next  # ignore empty records
   $1=$1;              # rebuild record to compress multiple interior spaces
    # Output:
    #  - Is it a gene record, i.e. is there only 1 field that contains a gene name?
    #    Output it with just a trailing \t, but no trailing \n, so that the next
    #    note record will print on the same line.
    #  - Otherwise: a note record: print with trailing \n, effectively
    #    appending it to the previous gene record.
   printf "%s%s", $0, (/^ENSECAG[0-9]+$/ ? "\t" : "\n")
  }
  ' file

关于unix - 如何合并以特定子字符串开头的行？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23004626/

unix - 如何合并以特定子字符串开头的行？

上一篇：reporting-services - SSRS 矩阵中需要换行

下一篇：cron - 如何使用crontab运行图形化程序，例如 "gedit"