regex - 如何使用 RegEx 从 HTML 中提取值?

标签 regex html-content-extraction text-extraction

给定以下 HTML:

<p><span class="xn-location">OAK RIDGE, N.J.</span>, <span class="xn-chron">March 16, 2011</span> /PRNewswire/ -- Lakeland Bancorp, Inc. (Nasdaq:   <a href='http://studio-5.financialcontent.com/prnews?Page=Quote&Ticker=LBAI' target='_blank' title='LBAI'> LBAI</a>), the holding company for Lakeland Bank, today announced that it redeemed <span class="xn-money">$20 million</span> of the Company's outstanding <span class="xn-money">$39 million</span> in Fixed Rate Cumulative Perpetual Preferred Stock, Series A that was issued to the U.S. Department of the Treasury under the Capital Purchase Program on <span class="xn-chron">February 6, 2009</span>, thereby reducing Treasury's investment in the Preferred Stock to <span class="xn-money">$19 million</span>. The Company paid approximately <span class="xn-money">$20.1 million</span> to the Treasury to repurchase the Preferred Stock, which included payment for accrued and unpaid dividends for the shares. &#160;This second repayment, or redemption, of Preferred Stock will result in annualized savings of <span class="xn-money">$1.2 million</span> due to the elimination of the associated preferred dividends and related discount accretion. &#160;A one-time, non-cash charge of <span class="xn-money">$745 thousand</span> will be incurred in the first quarter of 2011 due to the acceleration of the Preferred Stock discount accretion. &#160;The warrant previously issued to the Treasury to purchase 997,049 shares of common stock at an exercise price of <span class="xn-money">$8.88</span>, adjusted for stock dividends and subject to further anti-dilution adjustments, will remain outstanding.</p>

我想获取 <span> 中的值元素。我还想获得 class 的值<span> 上的属性元素。

理想情况下,我可以通过一个函数运行一些 HTML 并取回提取实体的字典(基于上面定义的 <span> 解析)。

上面的代码是一个更大的源 HTML 文件的片段,它无法与 XML 解析器匹配。所以我正在寻找一个可能的正则表达式来帮助提取感兴趣的信息。

最佳答案

使用此工具(免费): http://www.radsoftware.com.au/regexdesigner/

使用这个正则表达式:

"<span[^>]*>(.*?)</span>"

组 1 中的值(对于每个匹配项)将是您需要的文本。

在 C# 中它看起来像:

            Regex regex = new Regex("<span[^>]*>(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string val = m.Groups[1].Value;
                    //Do something with the value
                }
            }

修正回答评论:

            Regex regex = new Regex("<span class=\"(.*?)\">(.*?)</span>");
            string toMatch = "<span class=\"ajjsjs\">Some text</span>";
            if (regex.IsMatch(toMatch))
            {
                MatchCollection collection = regex.Matches(toMatch);
                foreach (Match m in collection)
                {
                    string class = m.Groups[1].Value;
                    string val = m.Groups[2].Value;
                    //Do something with the class and value
                }
            }

关于regex - 如何使用 RegEx 从 HTML 中提取值?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5327503/

相关文章:

php - 用非字母数字重复清理字符串

c# - 将所有标记的字符串更改为另一个字符串

algorithm - 我可以使用什么算法来识别网页上的内容

perl - 编写可维护的网络抓取应用程序的最佳方法是什么?

php - 我可以从 mysql 表中提取 html 而不是纯文本吗?

php - DBpedia信息抽取框架

javascript - 正则表达式匹配 "everything but"与 webpack 的 kebab-case

python - 在 Python 中用正则表达式匹配的字符串周围添加括号

php - 如何使用 php 从 html 中提取 img src、title 和 alt?

正则表达式获取大写章节标题之间的文本