xml - 使用 sed 在两个 XML 标签之间提取文本

我有类似于以下内容的 XML 文件:

<?xml version="1.0" encoding="UTF-8"?>
<OnlineCommentary>
    <doc docid="cnn_210085_comment002" articleURL="http://www.cnn.com/News.asp?NewsID=210085" date="10/07/2010" time="00:21" subtitle="Is Justin Bieber getting special treatment?" author="Zorro75">
        <seg id="1"> They are the same thing. Let's shoot them both. </seg>
    </doc>
    <doc docid="cnn_210092_comment004" articleURL="http://www.cnn.com/News.asp?NewsID=210092" date="06/04/2010" time="17:07" subtitle="Dear Chicago, we love you despite it all" author="MRL1313">
        <seg id="1"> We can't wait for you to move back either. </seg>
        <seg id="2"> You seem quite uptight. </seg>
        <seg id="3"> Does your wife (who is also your sister) not give it up any more? </seg>
    </doc>
</OnlineCommentary>

我想在此文件上执行命令以仅提取开始标记之间的contnet <seg ...>和结束标记 </seg>

我试过了:

sed -n 's:.*<seg id="1">\(.*\)</seg>.*:\1:p' XML-file.xml > output.txt

我的问题如下:

-- 如何打印所有 <seg id="*"> ？？我的命令只打印第一个标签的内容(<seg id="*">)

-- 是否有一种方法可以用来制作例如 <seg id="1"> , <seg id="2"> , <seg id="3">打印在同一行，而标签只包含 <seg id="1">打印在单独的行？？

最佳答案

打印所有<seg id=> (每行一个)包括 <seg

sed -n 's:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:\1:p' XML-file.xml > output.txt

全部打印在 1 行上，分隔 , .使用保持缓冲区而不是打印，最后调用缓冲区，用 , 替换新行(并删除开始 , 由于追加操作)，并打印结果

sed -n '\:.*\(<seg id="[0-9]\{1,\}">.*</seg>\).*:  { s//\1/
   H
   }
$ {g
   s/\n/,/g;s/^,//
   p
   }' XML-file.xml > output.txt

现在，@Choroba 使用适当的 XML 工具的建议非常好，您可以最大限度地降低处理不需要的文件数据的风险。

关于xml - 使用 sed 在两个 XML 标签之间提取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25930230/

xml - 使用 sed 在两个 XML 标签之间提取文本

上一篇：linux - 如何删除单词前的文字？

下一篇：linux - Mac 键盘中的 Vim 映射