shell - 提取标签 <t> </t> 之间的数据

标签 shell unix

我有如下数据 如何打印两个标签之间的数据 我希望数据是命令分隔的 csv 格式

我的方法是将数据转换为水平格式,然后在每第 4 列后进行剪切并转换为垂直格式

xml文件中的数据

<?xml version="1.0" encoding="UTF-8" standalone="true"?>
-
<sst uniqueCount="12" count="12"
    xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
-
    <si>
        <t>"NAME"</t>
    </si>
-
    <si>
        <t>"Vikas"</t>
    </si>
-
    <si>
        <t>"Vijay"</t>
    </si>
-
    <si>
        <t>"Vilas"</t>
    </si>
-
    <si>
        <t>"AGE"</t>
    </si>
-
    <si>
        <t>"24"</t>
    </si>
-
    <si>
        <t>"34"</t>
    </si>
-
    <si>
        <t>"35"</t>
    </si>
-
    <si>
        <t>"COURSE"</t>
    </si>
-
    <si>
        <t>"MCA"</t>
    </si>
-
    <si>
        <t>"MMS"</t>
    </si>
-
    <si>
        <t>"MBA"</t>
    </si>
</sst>
      

我尝试过以下命令不起作用..

awk '/<t/{flag=1;next}/<t/{flag=0}flag' abc.xml

即使尝试了下面的命令,它也会提供数据,但在单行中

awk -F'(</*t>|</*t>)' 'NF>1{for(i=2;i<NF; i=i+2) printf("%s%s", $i, (i+1==NF)?ORS:OFS)}' OFS=',' demo.xml

我想要以下数据作为输出

NAME,AGE,Course
Vikas,"25",MCA
Prabhash,"34",MBA
Arjun,"21",MMS

最佳答案

仅使用您显示的示例,您可以尝试以下操作。

awk -v OFS="," '
!NF || /^-$/{ next }
/<t>"COURSE"<\/t>/{
  foundAge=foundName=""
  foundCourse=1
  count=0
}
/<t>"AGE"<\/t>/{
  foundAge=1
  foundName=""
  count=0
}
/<t>"NAME"<\/t>/{
  foundName=1
  count=0
}
foundAge && match($0,/>[^<]*/){
  age[++count]=substr($0,RSTART+1,RLENGTH-1)
}
foundName && match($0,/>[^<]*/){
  name[++count]=substr($0,RSTART+1,RLENGTH-1)
}
foundCourse && match($0,/>[^<]*/){
  course[++count]=substr($0,RSTART+1,RLENGTH-1)
}
END{
  for(k=1;k<=count;k++){
    if(name[k]){
      print name[k],age[k],course[k]
    }
  }
}
'  Input_file

说明:为上述内容添加详细说明。

awk -v OFS="," '                                 ##Starting awk program from here.
!NF || /^-$/{ next }                             ##if line is empty or starts with - then skip that line.
/<t>"COURSE"<\/t>/{                              ##Checking if line has <t>"COURSE"</t> then do following.
  foundAge=foundName=""                          ##Nullifying foundAge and foundName here.
  foundCourse=1                                  ##Setting foundCourse to 1 here.
  count=0                                        ##Setting count to 0 here.
}
/<t>"AGE"<\/t>/{                                 ##Checking if line has <t>"AGE"</t> then do following.
  foundAge=1                                     ##Setting foundAge to 1 here.
  foundName=foundCourse=""                       ##Nullifying foundName and foundCourse here.
  count=0                                        ##Setting count to 0 here.
}
/<t>"NAME"<\/t>/{                                ##Checking if line has <t>"NAME"</t> then do following.
  foundName=1                                    ##Setting foundName to 1 here.
  count=0                                        ##Setting count to 0 here.
}
foundAge && match($0,/>[^<]*/){                  ##Checking if foundAge is set and using match function to get values from > to till < here.
  age[++count]=substr($0,RSTART+1,RLENGTH-1)     ##Creating age with index of count and having matched regex value here.
}
foundName && match($0,/>[^<]*/){                 ##Checking if foundName is set and using match function to get values from > to till < here.
  name[++count]=substr($0,RSTART+1,RLENGTH-1)    ##Creating name with index of count and having matched regex value here.
}
foundCourse && match($0,/>[^<]*/){               ##Checking if foundCourse is set and using match function to get values from > to till < here.
  course[++count]=substr($0,RSTART+1,RLENGTH-1)  ##Creating course with index of count and having matched regex value here.
}
END{                                             ##Starting END block of this awk program from here.
  for(k=1;k<=count;k++){                         ##Traversing through all elements of name here.
    if(name[k]){
      print name[k],age[k],course[k]             ##Printing respective array values here.
    }
  }
}
'  Input_file                                    ##Mentioning Input_file name here.


编辑:根据OP的评论,如果一行中需要所有值,请尝试以下操作:

awk -v OFS="," '
!NF || /^-$/{ next }
/<t>"COURSE"<\/t>/{
  foundAge=foundName=""
  foundCourse=1
  count=0
}
/<t>"AGE"<\/t>/{
  foundAge=1
  foundName=""
  count=0
}
/<t>"NAME"<\/t>/{
  foundName=1
  count=0
}
foundAge && match($0,/>[^<]*/){
  age[++count]=substr($0,RSTART+1,RLENGTH-1)
}
foundName && match($0,/>[^<]*/){
  name[++count]=substr($0,RSTART+1,RLENGTH-1)
}
foundCourse && match($0,/>[^<]*/){
  course[++count]=substr($0,RSTART+1,RLENGTH-1)
}
END{
  for(k=1;k<=count;k++){
     if(name[k]){
     nameVal=(nameVal?nameVal OFS:"")name[k]
     ageVal=(ageVal?ageVal OFS:"")age[k]
     courseVal=(courseVal?courseVal OFS:"")course[k]
     }
  }
  print nameVal,ageVal,courseVal
}
'  Input_file

关于shell - 提取标签 <t> </t> 之间的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67245123/

相关文章:

linux - 执行csh脚本时如何显示行号?

linux - 如何实现一个新的shell命令?

shell - 在k8s中启动一个pod用作集群的终端

linux - 结合 Unix 测试命令 && + ||

bash - 在 AWK 中分配一个 Shell 变量

regex - 使用 Sed 删除前导空格和尾随空格的问题

linux - 可变频率的调度脚本

java - 用 Java 写入 Perl 进程输入流

regex - grep --ignore-case --only

shell - WGET:删除 'filename' 因为它应该被拒绝