regex - 结合删除标签正则表达式和删除 sed 中的空行 - Unix

标签 regex xml bash unix sed

给定一个这样的标记文件:

<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
<seg id="5">In addition, participants at the event plan to discuss the rules for forming an expert panel, which is responsible for evaluating the work of scientific groups, as well as the criteria for carrying out evaluations.</seg>
<seg id="6">The third Expert Session will be the final meeting in a series of events on the formation of a unified approach for all three academies to the evaluation of the effectiveness of activities of scientific organizations.</seg>
<seg id="7">Over the past five months, we were able to achieve this, and the final version of the regulatory documents is undergoing approval.</seg>
<seg id="8">According to the plans for the upcoming session, we should complete the development of procedures for scientometric and expert analysis, and come to an agreement on the stages and timeframes for the evaluation process”, said the Head of FANO’s Expert-Analytical Department, Elena Aksenova.</seg>
<seg id="9">Representatives from more than one hundred Russian scientific institutes will take part in the event.</seg>
<seg id="10">It is expected that a resolution will be adopted based on its results.</seg>
<seg id="11">The meeting will begin at 10 am, Moscow time, on September 16, 2014, at the following address: 14 Solyanka Street, Moscow.</seg>
</p>
</doc>
</srcset>

我可以使用 Sed remove tags from html file 删除标记标签:

sed -e 's/<[^>]*>//g' file.txt 

这将使我的输出带有空行,我必须这样做 Delete empty lines using SED :

sed -e 's/<[^>]*>//g' file.txt  | sed '/^\s*$/d'

我应该如何将删除标记和删除空行正则表达式合并为一个?

最佳答案

立即删除怎么样? :

sed -e 's/<[^>]*>//g;/^\s*$/d' file.txt

关于regex - 结合删除标签正则表达式和删除 sed 中的空行 - Unix,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40230164/

相关文章:

linux - 如何在bash中提取子字符串

linux - 我想将 Bash 脚本的所有命令行参数存储到单个变量中

linux - 如何按列对一组数据进行排序?

python - 如何提取域名并将其插入新的 Pandas 列?

正则表达式仅匹配模式的一次出现

java - 如何定义正确的 XSLT?

xml - 使用能够在生成XML文件时写入流的解决方案来替换TXMLDocument(基于DOM)的XML生成

php - 如何在 preg_match_all 中创建模式字符串

javascript - 使用 RegEx 查找 HTML 标签之间的内容

java - 去除xml中的文本内容