ruby - 使用 Mechanize/Nogokiri 按文本搜索

我正在尝试从许多与此类似的页面中抓取一些关于平均 GPA 的数据以及更多数据:

http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.ptcas.org/ptcas/public/Listing.aspx?seqn=3200&navid=10737426783')
gpa_headers = page.xpath('//h3[contains(text(), "GPA")]')
pp gpa_headers

我的问题是 gpa_headers 为 nil，但至少有一个包含“GPA”的 h3 元素。

是什么导致了这个问题？我想这可能是因为页面有动态元素，Mechanize 对此有一些问题，但我可以 puts page.body 并且输出包括:

... <h3 style="text-align:center;">GPA REQUIREMENT</h3> ...

根据我的理解，应该可以在我使用的 xpath 中找到它。

如果有更好的方法，我也想知道。

最佳答案

这看起来是网站 DOM 结构的问题，因为它包含一个名为 style 的标签。它没有被关闭，看起来像这样:

<td colspan='7'><style='text-align:center;font-style:italic'>The
institution has been granted Candidate for Accreditation status by the
Commission on Accreditation in Physical Therapy Education (1111 North
Fairfax Street, Alexandria, VA, 22314; phone: 703.706.3245; email: <a
href='mailto:accreditation@apta.org'>accreditation@apta.org</a>).
Candidacy is not an accreditation status nor does it assure eventual
accreditation. Candidate for Accreditation is a pre-accreditation
status of affiliation with the Commission on Accreditation in Physical
Therapy Education that indicates the program is progressing toward
accreditation.</td>

如您所见，td标签关闭但内部 style从来没有。

如果您不需要这部分代码，我建议您在尝试使用整个 response 之前删除它.我没有 ruby 的经验但我会做类似的事情:

获取响应的原始主体。
替换与此正则表达式匹配的部分 '(<style=\'.*)</td>'使用空字符串，或自己关闭标签。
使用这个新的响应主体。

现在您可以使用 xpath 选择器了。

关于ruby - 使用 Mechanize/Nogokiri 按文本搜索，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41254686/

ruby - 使用 Mechanize/Nogokiri 按文本搜索

上一篇：ruby-on-rails - Ruby如何将Namespace的模块功能委托(delegate)给Namespace::Base内部类

下一篇：ruby-on-rails - 传递给 Controller 的模型 Ruby Rails 验证失败

ruby - 使用 Mechanize/Nogokiri 按文本搜索

上一篇：ruby-on-rails - Ruby如何将Namespace的模块功能委托(delegate)给Namespace::Base内部类

下一篇：ruby-on-rails - 传递给 Controller ​​的模型 Rub​​y Rails 验证失败

下一篇：ruby-on-rails - 传递给 Controller 的模型 Ruby Rails 验证失败