html - 从 Nokogiri 文档中提取不在括号内的链接

标签 html ruby nokogiri

我有一个 Nokogiri::HTML 文档。它对应于维基百科文章中的内容,可能如下所示:

James Henry 'Jimmie' Lyons (born in Chicago, Illinois – November 6, 1892 – October 10, 1963) was a baseball player in the Negro Leagues. He pitched and played outfield and between 1910 to 1925.

其中有相应的 HTML:

<p><b>James Henry 'Jimmie' Lyons</b> (born in <a href="/wiki/Chicago,_Illinois" title="Chicago, Illinois" class="mw-redirect">Chicago, Illinois</a> – November 6, 1892 – October 10, 1963) was a <a href="/wiki/Baseball" title="Baseball">baseball</a> player in the <a href="/wiki/Negro_League_baseball" title="Negro League baseball" class="mw-redirect">Negro Leagues</a>.<sup id="cite_ref-5" class="reference"><a href="#cite_note-5"><span>[</span>5<span>]</span></a></sup> He <a href="/wiki/Pitcher" title="Pitcher">pitched</a> and played <a href="/wiki/Outfielder" title="Outfielder">outfield</a> and between 1910 to 1925.

我想提取 href 的值第一个 未加括号 的属性 <a>本文档中的标记。

在这种情况下,正确答案是提取 "/wiki/Baseball" , href第二个链接的属性,因为第一个链接的 href , /wiki/Chicago,_Illinois , 在括号内。

请注意 <a>标签本身可以在其 href 中包含括号s,所以像“从 HTML 中去掉所有括号”这样天真的方法是不正确的。

最好的方法是什么?我很确定我将需要使用 Nokogiri 的 SAX 解析器,但如果有更简单的方法我会喜欢的。

最佳答案

您可以尝试使用前面的文本节点具有相同数量的左括号和右括号的第一个链接。

require 'nokogiri'

def first_non_parenthesized_href(html)
    doc = Nokogiri::HTML(html)
    return doc.css('a').find{ |a|
        previous_text = a.xpath('preceding::text()').collect(&:text).join
        previous_text.count('(') == previous_text.count(')')
    }['href']   
end

# Original example
html = %q{<p><b>James Henry 'Jimmie' Lyons</b> (born in <a href="/wiki/Chicago,_Illinois" title="Chicago, Illinois" class="mw-redirect">Chicago, Illinois</a> - November 6, 1892 - October 10, 1963) was a <a href="/wiki/Baseball" title="Baseball">baseball</a> player in the <a href="/wiki/Negro_League_baseball" title="Negro League baseball" class="mw-redirect">Negro Leagues</a>.<sup id="cite_ref-5" class="reference"><a href="#cite_note-5"><span>[</span>5<span>]</span></a></sup> He <a href="/wiki/Pitcher" title="Pitcher">pitched</a> and played <a href="/wiki/Outfielder" title="Outfielder">outfield</a> and between 1910 to 1925.}
puts first_non_parenthesized_href(html)
#=> "/wiki/Baseball"

# Example in comment
html = %q{<p><b>Science</b> (from <a href="/wiki/Latin_language" title="Latin language" class="mw-redirect">Latin</a> <i>scientia</i>, meaning "knowledge"<sup id="cite_ref-OnlineEtDict_1-0" class="reference"><a href="#cite_note-OnlineEtDict-1"><span>[</span>1<span>]</span></a></sup>) is a systematic enterprise that builds and organizes <a href="/wiki/Knowledge" title="Knowledge">knowledge</a> in the form of testable explanations and predictions about the <a href="/wiki/Universe" title="Universe">universe</a>.<sup id="cite_ref-wilson_2-0" class="reference"><a href="#cite_note-wilson-2"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-3" class="reference"><a href="#cite_note-3"><span>[</span>3<span>]</span></a></sup> In an older and closely related meaning, "science" also refers to a body of knowledge itself, of the type that can be rationally explained and reliably applied. A practitioner of science is known as a <a href="/wiki/Scientist" title="Scientist">scientist</a>.</p>}
puts first_non_parenthesized_href(html)
#=> "/wiki/Knowledge"

关于html - 从 Nokogiri 文档中提取不在括号内的链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19769700/

相关文章:

javascript - 使用 React 和 Web 组件

ruby - 无法使用种子本地主机 :27017 连接到副本集

arrays - 比较两个具有相同键的哈希数组

ruby - Nokogiri Ruby 'require' 问题

ruby - 保存来自网站的所有图像文件

javascript - 如何在不使用 flash 的情况下将文本复制到剪贴板?

PHP 上传表单不会上传超过 16 个文件

javascript - 这个脚本是否遵循通用标准? (Javascript/HTML)

ruby - 学习 Ruby 中的插入排序

xml - 将 XML 集合(Pivotal Tracker 故事)转换为 Ruby 哈希/对象