html - 正则表达式问题匹配 HTML 标记

所以我正在尝试使用 sed(它必须在这些系统上使用 sed，所以请不要只推荐使用 Perl)来匹配 HTML 标记并从中获取内容。 HTML 标签看起来像这样:

<div class="SectionText"> Received poor service or think your current mechanic is ripping you off? Get some help from <a href="http://www.union.umd.edu/gradlegalaid/index.htm" target="_blank">Graduate Legal Aid</a> or consult the <a href="http://www.oag.state.md.us/Consumer/index.htm" target="_blank">Maryland Attorney General Office of Consumer Protection</a> at <a href="mailto:consumer@oag.state.md.us">consumer@oag.state.md.us</a> or through their hotline at 410-528-8662 or 888-743-0023.<br /></div>

全部在一条线上。所以，我写了这个...但是它不起作用。

sed 's/<div class=\"SectionText\">\([^<\/div>]*\)<\/div>/\1/g'

这不会改变任何文本。

我尝试使用此网站作为指南 - http://www.ibm.com/developerworks/linux/library/l-sed2/index.html (在 RegExp Snafus 下)\

最重要的是这一行脚本不要贪心，直到最后才匹配

最佳答案

除了尝试在 html 上使用正则表达式(参见 RegEx match open tags except XHTML self-contained tags)，我看到的第一个问题是:

[^<\/div>]*

这是说匹配任何不是 < 的字符 , / , d , i , v , 或 > .很明显，你有一个 d和一个 i在那里。 (“收到我 d 糟糕的服务......”)

如果您准备为此使用正则表达式，并且您有一个非常受控/可预测的输入，您可以简单地执行 [^<>] ，假设您的文本不会包含这些字符。但是，我看到你这样做了，因为你的 div 里面有标签...

但是，如果你这样做:

sed 's/<div.class="SectionText">\(.*\)<\/div>/\1/g'

只要您没有多个 </div> 它就应该可以工作秒。 .*只会匹配直到找到 <\/div> .

关于html - 正则表达式问题匹配 HTML 标记，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9744246/

html - 正则表达式问题匹配 HTML 标记

上一篇：c - 打开/读/写是否缓冲？

下一篇：linux - 在一台服务器上编辑 perl 脚本，但在另一台服务器上执行