AWK 整合文件(后续)

标签 awk

this post 的后续工作: (根据评论中的要求进行更新)

我从实际(伪装的)示例数据以及在该数据上运行脚本的结果开始。

原始数据样本:

目录中前两个文件的标题和前两行。它们是相同的数据,这不是错误。 (这可能会发生在这个数据集中。)

文件 1:

Provider,,,,,,,,,,,,,,
02/01/2018 - 02/28/2018,,,,,,,,,,,,,,
Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8,Field9,Field10,Field11,Field12,Field13,Field14,Field15
B002H5QQJA,803814064988,803814064988,P2IIPDM5MDTW,P2IIPDM5MDTW,T,Prod_P,,,,1,foo1,bar1,YDtAK,BrandX
B002H5QQTU,803814064988,803814064988,K59C4XR93JOV,K59C4XR93JOV,T,Prod_P,,,,1,foo1,bar1,kmAnC,BrandX
B002H5QR44,803814064988,803814064988,FUBOROFTLW9U,FUBOROFTLW9U,T,Prod_P,,,,1,foo1,bar1,JdLye,BrandX
B002H5QRBC,803814064988,803814064988,KMHRXLF2FRKH,KMHRXLF2FRKH,T,Prod_P,,,,1,foo1,bar1,Biqvo,BrandX
B002H5QSC0,803814064988,803814064988,PCLB5UPGGP9T,PCLB5UPGGP9T,T,Prod_P,,,,1,foo2,bar2,Iwvhe,BrandX
B002H5QU3M,505545471538,505545471538,3K4GDYDEOH1M,3K4GDYDEOH1M,T,Prod_P,,,,1,foo3,bar3,NWsOC,BrandY
B002H5QUAK,417248985349,417248985349,7R40MN9AD9I8,7R40MN9AD9I8,T,Prod_I,1,0,1,0,foo4,bar4,YVQeH,BrandY
B002H5QUBY,417248985349,417248985349,C04GQONG1Z5B,C04GQONG1Z5B,T,Prod_I,1,0,1,0,foo4,bar4,PERMW,BrandY
B002H5QUCI,505545471538,505545471538,4E1ZGIJR1GPR,4E1ZGIJR1GPR,T,Prod_P,,,,1,foo3,bar3,UycEB,BrandY
B002H5QUVO,804699101426,804699101426,51RXKMWGJJ30,51RXKMWGJJ30,T,Prod_P,,,,1,foo5,bar5,Qwyuy,BrandY
B002H5QUZ0,804699101426,804699101426,7L0QBQM8S80L,7L0QBQM8S80L,T,Prod_P,,,,1,foo5,bar5,nqgId,BrandY
B002H5QXF2,803814064988,803814064988,PH0Q5QI34B0R,PH0Q5QI34B0R,T,Prod_P,,,,1,foo6,bar6,hPFiY,BrandX
B002H5QXWK,803814064988,803814064988,PSCLFNIDVZS0,PSCLFNIDVZS0,T,Prod_P,,,,1,foo6,bar6,BCdzF,BrandX

文件 2:

Provider,,,,,,,,,,,,,,,
01/01/2018 - 01/31/2018,,,,,,,,,,,,,,,
Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8,Field9,Field10,Field11,,Field12,Field13,Field14,Field15
B002H5N3AA,245462777033,245462777033,CFFWR2KSWLR8,CFFWR2KSWLR8,T,Prod_P,,,,1,bar1,foo2,bar1,RkG7D,BrandY
B002H5N3IM,245462777033,245462777033,CYFTO0FGAPSJ,CYFTO0FGAPSJ,T,Prod_P,,,,1,bar1,foo2,bar1,jqiGj,BrandY
B002H5N3R8,245462777033,245462777033,8ZNJHVCVO0A1,8ZNJHVCVO0A1,T,Prod_P,,,,1,bar1,foo2,bar1,Ylrcy,BrandY
B002H5N6X4,766193337142,766193337142,37YX24TRDPNW,37YX24TRDPNW,T,Prod_P,,,,1,bar2,foo3,bar2,WHxLZ,BrandX
B002H5N756,766193337142,766193337142,H56J19KCLFZP,H56J19KCLFZP,T,Prod_P,,,,1,bar2,foo3,bar2,VVw34,BrandX
B002H5N8QO,73612604823,73612604823,HZC9P776G2EP,HZC9P776G2EP,T,Prod_P,,,,1,bar3,foo4,bar3,X48HD,BrandZ
B002H5NA3U,932053704970,932053704970,XFIB2V8RQXN4,XFIB2V8RQXN4,T,Prod_P,,,,1,bar4,foo5,bar4,ghftn,BrandY
B002H5NJ6S,245675038659,245675038659,MUCSMOR5HB7V,MUCSMOR5HB7V,T,Prod_I,11,2,1,0,bar5,foo6,bar5,TVY19,BrandX
B002H5NJ6S,245675038659,245675038659,MUCSMOR5HB7V,MUCSMOR5HB7V,T,Prod_P,,,,2,bar5,foo6,bar5,M2j1i,BrandX
B002H5NJXQ,73612604823,73612604823,RJER36PXDF0T,RJER36PXDF0T,T,Prod_P,,,,1,bar6,foo7,bar6,1UnN3,BrandY
B002H5OU5C,491559514618,491559514618,X9K6BVZEHDDZ,X9K6BVZEHDDZ,T,Prod_P,,,,1,bar7,foo8,bar7,eybpO,BrandX
B002H5OU66,491559514618,491559514618,6510BKD3XD9R,6510BKD3XD9R,T,Prod_P,,,,1,bar7,foo8,bar7,yS9xk,BrandX
B002H5OU6Q,491559514618,491559514618,EFWDVP7FPCFA,EFWDVP7FPCFA,T,Prod_P,,,,1,bar7,foo8,bar7,0IXqS,BrandX

期望的输出:

Filename,Report_Period,Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8,Field9,Field10,Field11,Field12,Field13,Field14,Field15
FILENAME,02/01/2018 - 02/28/2018,B002H5QQJA,8.03814E+11,8.04E+11,P2IIPDM5MDTW,P2IIPDM5MDTW,T,Prod_P,,,,1,foo1,bar1,YDtAK,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5QQTU,8.03814E+11,8.04E+11,K59C4XR93JOV,K59C4XR93JOV,T,Prod_P,,,,1,foo1,bar1,kmAnC,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5QR44,8.03814E+11,8.04E+11,FUBOROFTLW9U,FUBOROFTLW9U,T,Prod_P,,,,1,foo1,bar1,JdLye,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5QRBC,8.03814E+11,8.04E+11,KMHRXLF2FRKH,KMHRXLF2FRKH,T,Prod_P,,,,1,foo1,bar1,Biqvo,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5QSC0,8.03814E+11,8.04E+11,PCLB5UPGGP9T,PCLB5UPGGP9T,T,Prod_P,,,,1,foo2,bar2,Iwvhe,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5QU3M,5.05545E+11,5.06E+11,3K4GDYDEOH1M,3K4GDYDEOH1M,T,Prod_P,,,,1,foo3,bar3,NWsOC,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5QUAK,4.17249E+11,4.17E+11,7R40MN9AD9I8,7R40MN9AD9I8,T,Prod_I,1,0,1,0,foo4,bar4,YVQeH,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5QUBY,4.17249E+11,4.17E+11,C04GQONG1Z5B,C04GQONG1Z5B,T,Prod_I,1,0,1,0,foo4,bar4,PERMW,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5QUCI,5.05545E+11,5.06E+11,4E1ZGIJR1GPR,4E1ZGIJR1GPR,T,Prod_P,,,,1,foo3,bar3,UycEB,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5QUVO,8.04699E+11,8.05E+11,51RXKMWGJJ30,51RXKMWGJJ30,T,Prod_P,,,,1,foo5,bar5,Qwyuy,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5QUZ0,8.04699E+11,8.05E+11,7L0QBQM8S80L,7L0QBQM8S80L,T,Prod_P,,,,1,foo5,bar5,nqgId,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5QXF2,8.03814E+11,8.04E+11,PH0Q5QI34B0R,PH0Q5QI34B0R,T,Prod_P,,,,1,foo6,bar6,hPFiY,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5QXWK,8.03814E+11,8.04E+11,PSCLFNIDVZS0,PSCLFNIDVZS0,T,Prod_P,,,,1,foo6,bar6,BCdzF,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5N3AA,2.45463E+11,2.45E+11,CFFWR2KSWLR8,CFFWR2KSWLR8,T,Prod_P,,,,1,foo2,bar1,RkG7D,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5N3IM,2.45463E+11,2.45E+11,CYFTO0FGAPSJ,CYFTO0FGAPSJ,T,Prod_P,,,,1,foo2,bar1,jqiGj,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5N3R8,2.45463E+11,2.45E+11,8ZNJHVCVO0A1,8ZNJHVCVO0A1,T,Prod_P,,,,1,foo2,bar1,Ylrcy,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5N6X4,7.66193E+11,7.66E+11,37YX24TRDPNW,37YX24TRDPNW,T,Prod_P,,,,1,foo3,bar2,WHxLZ,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5N756,7.66193E+11,7.66E+11,H56J19KCLFZP,H56J19KCLFZP,T,Prod_P,,,,1,foo3,bar2,VVw34,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5N8QO,73612604823,73612604823,HZC9P776G2EP,HZC9P776G2EP,T,Prod_P,,,,1,foo4,bar3,X48HD,BrandZ
FILENAME,02/01/2018 - 02/28/2018,B002H5NA3U,9.32054E+11,9.32E+11,XFIB2V8RQXN4,XFIB2V8RQXN4,T,Prod_P,,,,1,foo5,bar4,ghftn,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5NJ6S,2.45675E+11,2.46E+11,MUCSMOR5HB7V,MUCSMOR5HB7V,T,Prod_I,11,2,1,0,foo6,bar5,TVY19,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5NJ6S,2.45675E+11,2.46E+11,MUCSMOR5HB7V,MUCSMOR5HB7V,T,Prod_P,,,,2,foo6,bar5,M2j1i,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5NJXQ,73612604823,73612604823,RJER36PXDF0T,RJER36PXDF0T,T,Prod_P,,,,1,foo7,bar6,1UnN3,BrandY
FILENAME,02/01/2018 - 02/28/2018,B002H5OU5C,4.9156E+11,4.92E+11,X9K6BVZEHDDZ,X9K6BVZEHDDZ,T,Prod_P,,,,1,foo8,bar7,eybpO,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5OU66,4.9156E+11,4.92E+11,6510BKD3XD9R,6510BKD3XD9R,T,Prod_P,,,,1,foo8,bar7,yS9xk,BrandX
FILENAME,02/01/2018 - 02/28/2018,B002H5OU6Q,4.9156E+11,4.92E+11,EFWDVP7FPCFA,EFWDVP7FPCFA,T,Prod_P,,,,1,foo8,bar7,0IXqS,BrandX

我的脚本(改编 self 的 O.P. 中接受的答案):

它几乎可以工作了。但它包含每个文件的第 1-3 行:

gawk '
function basename(file) {
    sub(".*/", "", file)
    return file
  }
BEGIN { FS=OFS="," }
NR < 3 {
    if ( NR == 2 ) {
        hdr = "Report_Period" OFS
        val = val $1 OFS
    }
    next
}
FNR>3 {
    print "Filename", hdr $0
    next
}
{ print basename(FILENAME), val $0 }
' OFS="," /path/to/input/files/*.csv > ~/path/to/output/file/SampleOutput.csv

实际输出

这是结果文件的全部内容。问题似乎是标题重复:

Sample1.csv,02/01/2018 - 02/28/2018,Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8,Field9,Field10,Field11,Field12,Field13,Field14,Field15
Filename,Report_Period,B002H5QQJA,803814064988,803814064988,P2IIPDM5MDTW,P2IIPDM5MDTW,T,Prod_P,,,,1,foo1,bar1,YDtAK,BrandX
Filename,Report_Period,B002H5QQTU,803814064988,803814064988,K59C4XR93JOV,K59C4XR93JOV,T,Prod_P,,,,1,foo1,bar1,kmAnC,BrandX
Filename,Report_Period,B002H5QR44,803814064988,803814064988,FUBOROFTLW9U,FUBOROFTLW9U,T,Prod_P,,,,1,foo1,bar1,JdLye,BrandX
Filename,Report_Period,B002H5QRBC,803814064988,803814064988,KMHRXLF2FRKH,KMHRXLF2FRKH,T,Prod_P,,,,1,foo1,bar1,Biqvo,BrandX
Filename,Report_Period,B002H5QSC0,803814064988,803814064988,PCLB5UPGGP9T,PCLB5UPGGP9T,T,Prod_P,,,,1,foo2,bar2,Iwvhe,BrandX
Filename,Report_Period,B002H5QU3M,505545471538,505545471538,3K4GDYDEOH1M,3K4GDYDEOH1M,T,Prod_P,,,,1,foo3,bar3,NWsOC,BrandY
Filename,Report_Period,B002H5QUAK,417248985349,417248985349,7R40MN9AD9I8,7R40MN9AD9I8,T,Prod_I,1,0,1,0,foo4,bar4,YVQeH,BrandY
Filename,Report_Period,B002H5QUBY,417248985349,417248985349,C04GQONG1Z5B,C04GQONG1Z5B,T,Prod_I,1,0,1,0,foo4,bar4,PERMW,BrandY
Filename,Report_Period,B002H5QUCI,505545471538,505545471538,4E1ZGIJR1GPR,4E1ZGIJR1GPR,T,Prod_P,,,,1,foo3,bar3,UycEB,BrandY
Filename,Report_Period,B002H5QUVO,804699101426,804699101426,51RXKMWGJJ30,51RXKMWGJJ30,T,Prod_P,,,,1,foo5,bar5,Qwyuy,BrandY
Filename,Report_Period,B002H5QUZ0,804699101426,804699101426,7L0QBQM8S80L,7L0QBQM8S80L,T,Prod_P,,,,1,foo5,bar5,nqgId,BrandY
Filename,Report_Period,B002H5QXF2,803814064988,803814064988,PH0Q5QI34B0R,PH0Q5QI34B0R,T,Prod_P,,,,1,foo6,bar6,hPFiY,BrandX
Filename,Report_Period,B002H5QXWK,803814064988,803814064988,PSCLFNIDVZS0,PSCLFNIDVZS0,T,Prod_P,,,,1,foo6,bar6,BCdzF,BrandX
Sample2.csv,02/01/2018 - 02/28/2018,Provider,,,,,,,,,,,,,,
Sample2.csv,02/01/2018 - 02/28/2018,01/01/2018 - 01/31/2018,,,,,,,,,,,,,,
Sample2.csv,02/01/2018 - 02/28/2018,Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8,Field9,Field10,Field11,Field12,Field13,Field14,Field15
Filename,Report_Period,B002H5N3AA,2.45463E+11,2.45463E+11,CFFWR2KSWLR8,CFFWR2KSWLR8,T,Prod_P,,,,1,foo2,bar1,RkG7D,BrandY
Filename,Report_Period,B002H5N3IM,2.45463E+11,2.45463E+11,CYFTO0FGAPSJ,CYFTO0FGAPSJ,T,Prod_P,,,,1,foo2,bar1,jqiGj,BrandY
Filename,Report_Period,B002H5N3R8,2.45463E+11,2.45463E+11,8ZNJHVCVO0A1,8ZNJHVCVO0A1,T,Prod_P,,,,1,foo2,bar1,Ylrcy,BrandY
Filename,Report_Period,B002H5N6X4,7.66193E+11,7.66193E+11,37YX24TRDPNW,37YX24TRDPNW,T,Prod_P,,,,1,foo3,bar2,WHxLZ,BrandX
Filename,Report_Period,B002H5N756,7.66193E+11,7.66193E+11,H56J19KCLFZP,H56J19KCLFZP,T,Prod_P,,,,1,foo3,bar2,VVw34,BrandX
Filename,Report_Period,B002H5N8QO,73612604823,73612604823,HZC9P776G2EP,HZC9P776G2EP,T,Prod_P,,,,1,foo4,bar3,X48HD,BrandZ
Filename,Report_Period,B002H5NA3U,9.32054E+11,9.32054E+11,XFIB2V8RQXN4,XFIB2V8RQXN4,T,Prod_P,,,,1,foo5,bar4,ghftn,BrandY
Filename,Report_Period,B002H5NJ6S,2.45675E+11,2.45675E+11,MUCSMOR5HB7V,MUCSMOR5HB7V,T,Prod_I,11,2,1,0,foo6,bar5,TVY19,BrandX
Filename,Report_Period,B002H5NJ6S,2.45675E+11,2.45675E+11,MUCSMOR5HB7V,MUCSMOR5HB7V,T,Prod_P,,,,2,foo6,bar5,M2j1i,BrandX
Filename,Report_Period,B002H5NJXQ,73612604823,73612604823,RJER36PXDF0T,RJER36PXDF0T,T,Prod_P,,,,1,foo7,bar6,1UnN3,BrandY
Filename,Report_Period,B002H5OU5C,4.9156E+11,4.9156E+11,X9K6BVZEHDDZ,X9K6BVZEHDDZ,T,Prod_P,,,,1,foo8,bar7,eybpO,BrandX
Filename,Report_Period,B002H5OU66,4.9156E+11,4.9156E+11,6510BKD3XD9R,6510BKD3XD9R,T,Prod_P,,,,1,foo8,bar7,yS9xk,BrandX
Filename,Report_Period,B002H5OU6Q,4.9156E+11,4.9156E+11,EFWDVP7FPCFA,EFWDVP7FPCFA,T,Prod_P,,,,1,foo8,bar7,0IXqS,BrandX

再次感谢!

最佳答案

这可能是您想要做的:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR==3 {
    print "Filename", "Report_Period", $0
}
FNR==2 {
    fname = FILENAME
    sub(".*/","",fname)
    period = $1
}
FNR>3 {
    print fname, period, $0
}

我自己编写了数据来测试它(file#_row#_col#):

$ cat file1
Provider,,
02/01/2018 - 02/28/2018,,
Field1,Field2,Field3
f1_r1_c1,f1_r1_c2,f1_r1_c3
f1_r2_c1,f1_r2_c2,f1_r2_c3

$ cat file2
Provider,,
01/01/2018 - 01/31/2018,,
Field1,Field2,Field3
f2_r1_c1,f2_r1_c2,f2_r1_c3
f2_r2_c1,f2_r2_c2,f2_r2_c3

$ awk -f tst.awk file1 file2
Filename,Report_Period,Field1,Field2,Field3
file1,02/01/2018 - 02/28/2018,f1_r1_c1,f1_r1_c2,f1_r1_c3
file1,02/01/2018 - 02/28/2018,f1_r2_c1,f1_r2_c2,f1_r2_c3
file2,01/01/2018 - 01/31/2018,f2_r1_c1,f2_r1_c2,f2_r1_c3
file2,01/01/2018 - 01/31/2018,f2_r2_c1,f2_r2_c2,f2_r2_c3

如果这不是您想要的,请编辑您的问题以澄清您的要求并使用最少的示例输入/输出。

关于AWK 整合文件(后续),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51181188/

相关文章:

linux - 提取括号内的单词并将其替换为句子

bash - 如何删除分隔文件中标签之间的空格?

linux - 如何使用 Linux/Unix 实用程序提取每个输入行开头的整数或小数?

unix - 如何对 AWK 中的列求和?

bash - Shell 脚本获取两年之间的值

bash - 使用 ps 和 awk 获取 pid,然后杀死它

bash - 将 CSV 文件的第一行全部改为大写

python - 从文本文件中提取单词

sed - 使用 sed 修剪文本

linux - 如何在 linux 中从 string1 搜索到 string2?