ruby - Nokogiri/Mechanize 提取 div 内容?

标签 ruby web-scraping screen-scraping nokogiri mechanize

我正在解析一个评级网站,以找出给定公司的评级。

评级可以在 1 之间变化和 5 ,它们都可以用以下代码提取:

a = Mechanize.new
page = a.get(url)
reviews = page.search(".reviewcontent")
reviews.each do |r|
  rating = r.at_css(".s1, .s2, .s3, .s4, .s5")
  puts rating          # => <span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
                              <meta itemprop="worstRating" content="1">
                            <meta itemprop="bestRating" content="5">
                            <meta itemprop="ratingValue" content="5"></span>
  puts rating.inspect  # => #<Nokogiri::XML::Element:0x3fe0e108783c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fe0e1087440 name="class" value="s5">, #<Nokogiri::XML::Attr:0x3fe0e108742c name="itemprop" value="reviewRating">, #<Nokogiri::XML::Attr:0x3fe0e1087404 name="itemscope">, #<Nokogiri::XML::Attr:0x3fe0e10873dc name="itemtype" value="http://schema.org/Rating">] children=[#<Nokogiri::XML::Text:0x3fe0e108648c "\r\n            ">, #<Nokogiri::XML::Element:0x3fe0e108634c name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e108625c name="itemprop" value="worstRating">, #<Nokogiri::XML::Attr:0x3fe0e1086248 name="content" value="1">]>, #<Nokogiri::XML::Element:0x3fe0e10898bc name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e10897cc name="itemprop" value="bestRating">, #<Nokogiri::XML::Attr:0x3fe0e10897b8 name="content" value="5">]>, #<Nokogiri::XML::Element:0x3fe0e1088b10 name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fe0e1088994 name="itemprop" value="ratingValue">, #<Nokogiri::XML::Attr:0x3fe0e1088980 name="content" value="5">]>]>
end

我对这条线感兴趣:<meta itemprop="ratingValue" content="5">尤其是 content 的值(value)在这种情况下是 5 .

我如何提取这个值?

编辑:
puts reviews.to_html给出了这个结果:
<div class="reviewcontent">
    <p class="r-m ">
        <span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
            <meta itemprop="worstRating" content="1">
<meta itemprop="bestRating" content="5">
<meta itemprop="ratingValue" content="5"></span>
    </p>


<time datetime="2011-09-15T18:16:10.0000000+02:00" class="ndate strong" title="15. september 2011 - 18:16:10" pubdate>
    15. september 2011
    <span title="2011-09-15T18:16:10.0000000+02:00"></span>
</time><meta itemprop="dateCreated" content="2011-09-15T18:16:10.0000000+02:00">
<h3 itemprop="headline" class="summary da">
            <a href="http://www.trustpilot.dk/review/scandicfly.dk/4e7240ea00006400020e3b0e" class="showReview">Tip Top</a>
        </h3>
        <p itemprop="reviewBody">
            Bestilte en del fluer, en krogskærper og andre småting.<br>Kom 3 dage efter bestilling og alt var, som det skulle.
        </p>
        <span class="imagezoom">

        </span>
        <div class="actions">

            <input type="hidden" name="ReviewId" value="4e7240ea00006400020e3b0e"><input type="hidden" name="UserName" value="Strit"><a href="http://www.trustpilot.dk/review/scandicfly.dk/4e7240ea00006400020e3b0e#allcomments" class="comments fb-comments-label" id="FB-comment-box-0">
                        <span></span>
                        Kommentar (<comments-count     href="http://trustpilot.com/review/scandicfly.dk#4e7240ea00006400020e3b0e">?</comments-count>)
                </a>
                <a class="useful" data-reviewid="4e7240ea00006400020e3b0e" href="#"><span>    </span>
                    Find nyttig
                </a>

                <a class="replyAsCompany" href="#"><span></span>
                    Svar som firma
                </a>

                <a class="report" data-reviewid="932622" href="#"><span></span>
                    Rapportér
                </a>

        </div>
        <div class="fb-comments-wrapper">
            <div class="social-guidelines"><a href="/social">Sociale retningslinjer</a></div>
        </div>
            <div class="companyComments" id="CompanyComments_932622">
            <div class="companyComments" id="CompanyComments_4e7240ea00006400020e3b0e">    
            </div>
        </div>

    </div><div class="reviewcontent">
    <p class="r-m ">
        <span class="s5" itemprop="reviewRating" itemscope     itemtype="http://schema.org/Rating">
            <meta itemprop="worstRating" content="1">
<meta itemprop="bestRating" content="5">
<meta itemprop="ratingValue" content="5"></span>
    </p>


<time datetime="2011-04-05T16:05:06.0000000+02:00" class="ndate" title="5. april 2011 - 16:05:06" pubdate>
    5. april 2011
    <span title="2011-04-05T16:05:06.0000000+02:00"></span>
</time><meta itemprop="dateCreated" content="2011-04-05T16:05:06.0000000+02:00">
<h3 itemprop="headline" class="summary da">
            <a     href="http://www.trustpilot.dk/review/scandicfly.dk/4d9b3db2000064000209035f"     class="showReview">en god og flot oplevelse</a>
        </h3>
        <p itemprop="reviewBody">
            Købte en fiskestang hos ScandicFly. Faktra ordrebekræftigelse og det hele     præsenteret meget flot. Der kom desuden et notis om min fiskestang var afsendt.<br>Et par dage efter kom min fiskestang med posten forsvarligt pakket ind.
        </p>
        <span class="imagezoom">

        </span>
        <div class="actions">

            <input type="hidden" name="ReviewId" value="4d9b3db2000064000209035f"><input type="hidden" name="UserName" value="Peter Leter"><a href="http://www.trustpilot.dk/review/scandicfly.dk/4d9b3db2000064000209035f#allcomments" class="comments fb-comments-label" id="FB-comment-box-1">
                    <span></span>
                    Kommentar (<comments-count     href="http://trustpilot.com/review/scandicfly.dk#4d9b3db2000064000209035f">?</comments-count>)
                </a>
                <a class="useful" data-reviewid="4d9b3db2000064000209035f" href="#"><span></span>
                    Find nyttig
                </a>

                <a class="replyAsCompany" href="#"><span></span>
                    Svar som firma
                </a>

                <a class="report" data-reviewid="590687" href="#"><span></span>
                    Rapportér
                </a>

        </div>
        <div class="fb-comments-wrapper">
            <div class="social-guidelines"><a href="/social">Sociale retningslinjer</a></div>
        </div>
        <div class="companyComments" id="CompanyComments_590687">
            <div class="companyComments" id="CompanyComments_4d9b3db2000064000209035f">    
            </div>
        </div>

最佳答案

您可以在下方xpath在那之后:

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-_HTML_
<span class="s5" itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
      <meta itemprop="worstRating" content="1">
    <meta itemprop="bestRating" content="5">
    <meta itemprop="ratingValue" content="5">
</span>
_HTML_

doc.at("//meta[@itemprop = 'bestRating']/@content").to_s
# => "5"

在你的情况下写如下:
r.at_css(".s1, .s2, .s3, .s4, .s5").at("//meta[@itemprop = 'bestRating']/@content").to_s

关于ruby - Nokogiri/Mechanize 提取 div 内容?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17950144/

相关文章:

css - compass/Sinatra 加载模块

python - Scrapy 蜘蛛输出 empy csv 文件

python - Python 中的网页抓取动态内容

类似于 HtmlUnit 的 C# 库

c# - 使用 C# 仅下载网页的第一部分(长度未知)

ruby-on-rails - 如何从我的 Rails 应用程序中删除意外输出?

ruby - 为什么 instance_exec 不覆盖 Proc 对象绑定(bind)中存在的局部变量?

ruby-on-rails - 创建一个 ruby​​ 类,在引用时返回一个字符串

python - 使用 scrapy 进行网页抓取时的字符编码问题

php - 以编程方式从 Google 获取关键字点击次数的最简单方法