ruby - 在 hpricot/nokogiri 中搜索 h2 元素之前的所有元素

标签 ruby parsing nokogiri hpricot wiktionary

我正在尝试解析维基词典条目以检索所有英语定义。我能够检索所有定义，问题是某些定义是其他语言的。我想做的是以某种方式仅检索具有英文定义的 HTML block 。我发现，如果有其他语言条目，可以通过以下方式检索英语定义之后的 header :

header = (doc/"h2")[3]

所以我只想搜索此标题元素之前的所有元素。我认为使用 header.preceding_siblings() 可能可以，但这似乎不起作用。有什么建议吗？

最佳答案

您可以通过 Nokogiri 使用访客模式。此代码将删除从其他语言定义的 h2 开始的所有内容:

require 'nokogiri'
require 'open-uri'

class Visitor
  def initialize(node)
    @node = node
  end

  def visit(node)
    if @remove || @node == node
      node.remove
      @remove = true
      return
    end
    node.children.each do |child|
      child.accept(self)
    end
  end
end

doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2]  #In this case, the Italian h2 is at index 2.  Your page may differ

doc.root.accept(Visitor.new(node))  #Removes all page contents starting from node

关于ruby - 在 hpricot/nokogiri 中搜索 h2 元素之前的所有元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/1452443/

上一篇：ruby-on-rails - 首页的 Controller ？

下一篇：ruby - 如何将 Ruby1.9 与 Shoes 一起使用？

相关文章：

jquery - 无法获取 Json

javascript - 如何在 Javascript 中解析 Json 数组？

java - 如何检查字符串是否能够转换为 float 或 int

ruby - 保存来自网站的所有图像文件

ruby - 在 Ruby 中调试堆栈级别太深

mysql - 以最有效的方式获取铁路中的时差(以分钟为单位)

ruby-on-rails - rbenv : version `2.0.0' is not installed

ruby - 卡在 ruby 或 { || } 带有 if 条件的运算符

html - 如何在 Nokogiri 中正确解析这个错误的 html？

ruby - 在 Ruby 2.0.0p353(基于 rvm 的安装)下安装 Nokogiri 1.6.1 失败(OSX Mavericks)？