bash - 从 XML 文件中删除特定的重复行

我一直在阅读关于删除整个堆栈中的重复行的内容。有 perl、awk 和 sed 解决方案，但没有一个像我想要的那样具体，我不知所措。

我想使用快速的 bash/shell perl 命令不敏感地从这个 XML 案例中删除重复的 <path> 标签。保持所有其他重复行(如 <start> 和 <end> )完好无损!

输入 XML:

  <package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>      <------ Duplicate line to keep 
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
      <model type="B">                 
        <start>2016-04-20</start>      <------ Duplicate line to keep
        <end>2017-04-20</end>          <------ Duplicate line to keep
      </model>
    </models>
    <userinterface>
      <upath>/Example/Dir/Here</upath>
      <upath>/Example/Dir/Here2</upath>
      <upath>/example/dir/here</upath>   <------ Duplicate line to REMOVE
    </userinterface>
  </package>

到目前为止，我已经能够抓取重复的行，但不知道如何删除它们。以下

grep -H path *.[Xx][Mm][Ll] | sort | uniq -id

给出结果:

test.xml:          <upath>/example/dir/here</upath>

现在如何删除该行？

执行下面的 perl 版本或 awk 版本也会删除 <start> 和 <end> 日期。

perl -i.bak -ne 'print unless $seen{lc($_)}++' test.xml
awk '!a[tolower($0)]++' test.xml > test.xml.new

最佳答案

以下脚本接受一个 XML 文件作为第一个参数，使用 xmlstarlet(脚本中的 xml)解析 XML 树和一个 Associative Array(需要 Bash 4)来存储唯一的 <upath> 节点值。

#!/bin/bash

input_file=$1
# XPath to retrieve <upath> node value.
xpath_upath_value='//package/userinterface/upath/text()'
# XPath to print XML tree excluding  <userinterface> part.
xpath_exclude_userinterface_tree='//package/*[not(self::userinterface)]'
# Associative array to help us remove duplicated <upath> node values.
declare -A arr

print_userinterface_no_dup() { 
    printf '%s\n' "<userinterface>"
    printf '<upath>%s</upath>\n' "${arr[@]}"
    printf '%s\n' "</userinterface>"
}

# Iterate over each <upath> node value, lower-case it and use it as a key in the associative array.
while read -r upath; do
    key="${upath,,}"
    # We can remove this 'if' statement and simply arr[$key]="$upath"
    # if it doesn't matter whether we remove <upath>foo</upath> or <upath>FOO</upath>
    if [[ ! "${arr[$key]}" ]]; then
        arr[$key]="$upath"
    fi
done < <(xml sel -t -m "$xpath_upath_value" -c \. -n "$input_file")

printf '%s\n' "<package>"

# Print XML tree excluding <userinterface> part.
xml sel -t -m "$xpath_exclude_userinterface_tree" -c \. "$input_file"

# Print <userinterface> tree without duplicates.
print_userinterface_no_dup

printf '%s\n' "</package>"

测试(脚本名称为 sof ):

$ ./sof xml_file
<package>
    <id>1523456789</id>
    <models>
      <model type="A">
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
      <model type="B">                 
        <start>2016-04-20</start>
        <end>2017-04-20</end>
      </model>
    </models>
    <userinterface>
        <upath>/Example/Dir/Here2</upath>
        <upath>/Example/Dir/Here</upath>
    </userinterface>
</package>

如果我的评论对您来说代码不够清晰，请提问，我会相应地回答和编辑此解决方案。

我的 xmlstarlet 版本是 1.6.1，针对 libxml2 2.9.2 和 libxslt 1.1.28 编译。

关于bash - 从 XML 文件中删除特定的重复行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36755492/

bash - 从 XML 文件中删除特定的重复行

上一篇：bash:在循环中 curl 并行请求

下一篇：bash - 在函数内访问脚本位置参数