ruby - 带有过滤器的 Nokogiri next_element

假设我有一个格式错误的 html 页面:

<table>
 <thead>
  <th class="what_I_need">Super sweet text<th>
 </thead>
 <tr>
  <td>
    I also need this
  </td>
  <td>
    and this (all td's in this and subsequent tr's)
  </td>
 </tr>
 <tr>
   ...all td's here too
 </tr>
 <tr>
   ...all td's here too
 </tr>
</table>

在 BeautifulSoup 上，我们能够获得 <th>然后调用findNext("td") . Nokogiri 有 next_element调用，但这可能不会返回我想要的(在本例中，它将返回 tr 元素)。

有没有办法过滤next_element Nokogiri的呼唤？例如next_element("td") ？

编辑

为了澄清，我将查看许多网站，其中大多数网站的格式各不相同。

例如，下一个站点可能是:

<table>
 <th class="what_I_need">Super sweet text<th>
 <tr>
  <td>
    I also need this
  </td>
  <td>
    and this (all td's in this and subsequent tr's)
  </td>
 </tr>
 <tr>
   ...all td's here too
 </tr>
 <tr>
   ...all td's here too
 </tr>
</table>

除了 tr 之外，我不能假设任何结构在类别为 what_I_need 的项下方

最佳答案

首先，请注意您的收尾 th标签格式错误:<th> .应该是</th> .修复有帮助。

一种方法是在找到 th 后使用 XPath 导航到它。节点:

require 'nokogiri'

html = '
<table>
<thead>
  <th class="what_I_need">Super sweet text<th>
</thead>
<tr>
  <td>
    I also need this
  </td>
<tr>
</table>
'

doc = Nokogiri::HTML(html)

th = doc.at('th.what_I_need')
th.text # => "Super sweet text"
td = th.at('../../tr/td')
td.text # => "\n    I also need this\n  "

这是在利用 Nokogiri 使用 CSS 访问器或 XPath 的能力，并且非常透明。

一旦你有了 <th>节点，您还可以使用 Node 的一些方法进行导航:

th.parent.next_element.at('td').text # => "\n    I also need this\n  "

另一种方法是从表格的顶部开始向下看:

table = doc.at('table')
th = table.at('th')
th.text # => "Super sweet text"
td = table.at('td')
td.text # => "\n    I also need this\n  "

如果你需要访问所有<td>您可以轻松地遍历表格中的标签:

table.search('td').each do |td|
  # do something with the td...
  puts td.text
end

如果你想要所有<td>的内容通过它们包含 <tr>遍历行，然后遍历单元格:

table.search('tr').each do |tr|
  cells = tr.search('td').map(&:text)
  # do something with all the cells
end

关于ruby - 带有过滤器的 Nokogiri next_element，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11461055/

ruby - 带有过滤器的 Nokogiri next_element

上一篇：ruby - DCI，角色是否应该向数据对象添加属性？

下一篇：ruby - watir-webdriver 忽略错误如何