xml - 使用XMLStarlet插入1000多个节点和属性-运行缓慢

这是效率问题，而不是故障排除。我有以下代码片段:

# The -R flag restores malformed XML
xmlstarlet -q fo -R <<<"$xml_content" | \
    # Delete xml_data
    xmlstarlet ed -d "$xml_data" | \
    # Delete index
    xmlstarlet ed -d "$xml_index" | \
    # Delete specific objects
    xmlstarlet ed -d "$xml_nodes/objects" | \
    # Append new node
    xmlstarlet ed -s "$xml_nodes" -t elem -n subnode -v "Hello World" | \
        # Add x attribute to node
        xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n x -v "0" | \
        # Add y attribute to node
        xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n y -v "0" | \
        # Add z attribute to node
        xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
            > "$output_file"

变量$xml_content包含xml目录树和
使用cat命令从大小为472.6 MB的文件中解析出的节点。

顾名思义，变量$output_file包含路径
到输出文件。

其余变量仅包含我要编辑的相应XPath。

根据帮助编写此代码的简短article，它指示:

This is a bit ineffeciant since the xml file is parsed and written twice.

就我而言，它被解析和写入了两次以上(最终以loop的形式超过了1000次)。

因此，以上述脚本为例，该短片段的执行时间仅为4分钟7秒。

假定过多的，重复的，可能是效率低下的管道传输以及文件大小是代码运行缓慢的原因，那么我最终插入/删除的子节点越多，最终将使其执行得越慢。

如果要重申自己的观点或提出一个古老且可能已经回答的话题，我可能会单调，我事先表示歉意，但是，我真的很想了解xmlstarlet如何在大型XML文档中进行详细的工作。

更新

正如@Cyrus在他先前的回答中所声称的那样:

Those two xmlstarlets should do the job:

xmlstarlet -q fo -R <<<"$xml_content" |\
  xmlstarlet ed \
    -d "$xml_data" \
    -d "$xml_index" \
    -d "$xml_nodes/objects" \
    -s "$xml_nodes" -t elem -n subnode -v "Hello World" \
    -i "($xml_nodes)[last()]" -t attr -n x -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n y -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n z -v "1" > "$output_file"

这产生了以下错误:

-:691.84: Attribute x redefined

-:691.84: Attribute z redefined

-:495981.9: xmlSAX2Characters: huge text node: out of memory

-:495981.9: Extra content at the end of the document

老实说，我不知道这些错误是如何产生的，因为我经常更改代码来测试各种场景和潜在的替代方法，但是，这就是我的诀窍:

xmlstarlet ed --omit-decl -L \
    -d "$xml_data" \
    -d "$xml_index" \
    -d "$xml_nodes/objects" \
    -s "$xml_nodes" -t elem -n subnode -v "Hello World" \
    "$temp_xml_file"

xmlstarlet ed --omit-decl -L \
    -i "($xml_nodes)[last()]" -t attr -n x -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n y -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
    "$temp_xml_file"

关于实际插入的data，这是我开始时的内容:

...
<node>
    <subnode>A</subnode>
    <subnode>B</subnode>
    <objects>1</objects>
    <objects>2</objects>
    <objects>3</objects>
    ...
</node>
...

执行上面的(拆分)代码会给我我想要的东西:

...
<node>
    <subnode>A</subnode>
    <subnode>B</subnode>
    <subnode x="0" y="0" z="1">Hello World</subnode>
</node>
...

通过拆分它们，xmlstarlet可以将attributes插入到新创建的节点中，否则它将在甚至创建last()之前将它们添加到所选Xpath的--subnode实例中。从某种程度上来说，这仍然是低效的，但是代码现在不到一分钟就可以运行了。

下面的代码，

xmlstarlet ed --omit-decl -L \
    -d "$xml_data" \
    -d "$xml_index" \
    -d "$xml_nodes/objects" \
    -s "$xml_nodes" -t elem -n subnode -v "Hello World" \
    -i "($xml_nodes)[last()]" -t attr -n x -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n y -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
    "$temp_xml_file"

但是，给我这个:

...
<node>
    <subnode>A</subnode>
    <subnode x="0" y="0" z="1">B</subnode>
    <subnode>Hello World</subnode>
</node>
...

通过将xmlstarlets加入类似post(也由@Cyrus回答)的方式，它首先以某种方式添加了attributes，然后创建了--subnode，其中innerText为Hello World。

谁能解释为什么发生这种奇怪的行为？

这是另一个reference，它指出“每个编辑操作都按顺序进行”

上面的文章确切地解释了我要寻找的内容，但是我无法设法使其全部工作在一个xmlstarlet ed \中。另外，我尝试了:

用($xml_nodes)[last()]替换$xml_nodes[text() = 'Hello World']

像此answer一样，使用$prev(或$xstar:prev)作为-i的参数。 [Examples]

添加-r后，通过attr的temporary element name技巧重命名临时节点

以上所有内容都插入了--subnode，但保留了不带attributes的新元素。

注意:我在OS X El Capitan v 10.11.3上运行XMLStarlet 1.6.1

奖励

正如我在开始时提到的，我希望像这样使用loop:

list="$(tr -d '\r' < $names)"

for name in $list; do
    xmlstarlet ed --omit-decl -L \
    -d "$xml_data" \
    -d "$xml_index" \
    -d "$xml_nodes/objects" \
    -s "$xml_nodes" -t elem -n subnode -v "$name" \
    -i "($xml_nodes)[last()]" -t attr -n x -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n y -v "0" \
    -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
    "$temp_xml_file"
done

$list包含一千多个不同的名称，需要将其与各自的attributes相加。每个属性的--value也可能随每个loop的不同而变化。鉴于以上模型:

如果正确将属性添加到相应的节点，那么这种loop的最快，最准确的版本是什么？

在外部txt文件中创建节点列表，然后将这些xml元素(在txt文件内部)添加到另一个XML文件中，会更快吗？如果是，怎么办？也许使用sed或grep？

关于最后一个问题，我指的是this。应该从txt添加xml的节点必须是特定的，例如至少可以由XPath选择，因为我只想编辑某些节点。

注意:以上模型只是一个示例。实际的loop将为每个--subnodes添加26 loop，为每个attr添加3或4 --subnode。这就是为什么xmlstarlet正确添加attr而不是不添加到其他元素很重要的原因。必须按顺序添加它们。

最佳答案

为什么不使用并行(或sem)，以便可以在计算机上可用的内核数上并行化作业？
我使用的代码是解析具有2个变量的数组，我将其导出到本地只是为了确保进程被隔离。

for array in "${listofarrays[@]}"; do
    local var1;local var2
    IFS=, read var1 var2 <<< $array
    sem -j +0
    <code goes here>
done
sem --wait

关于xml - 使用XMLStarlet插入1000多个节点和属性-运行缓慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47234551/

xml - 使用XMLStarlet插入1000多个节点和属性-运行缓慢

上一篇：ruby-on-rails - 同一行上的两个索引

下一篇：xml - 通过 XML 在 TestNG 中的套件之前运行方法