ruby - 随机超时 :Error Exception in Ruby with Mechanize Gem

标签 ruby web-scraping mechanize rails-activerecord mechanize-ruby

我正在使用 Ruby 1.9.3-p327 构建一个应用程序,该应用程序获取解析一些页面(抓取),然后根据一些值将一些列插入/更新到数据库中。为了获取解析,应用程序使用 Mechanize gem,并且通过 activerecord gem 访问数据库(MySQL)。

我遇到的奇怪问题是,有时 Timeout::Error 异常会随机引发,有时永远不会发生,但也许再过两天就会发生,并且记录或页面类型不同。异常日志为:

/root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/protocol.rb:146:in `rescue in rbuf_fill': too many connection resets (due to Timeout::Error - Timeout::Error) after 0 requests on 21716860, last used 1378984537.2796552 seconds ago (Net::HTTP::Persistent::Error)
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/protocol.rb:140:in `rbuf_fill'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/protocol.rb:122:in `readuntil'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/protocol.rb:132:in `readline'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:2562:in `read_status_line'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:2551:in `read_new'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:1319:in `block in transport_request'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:1316:in `catch'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:1316:in `transport_request'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/1.9.1/net/http.rb:1293:in `request'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/net-http-persistent-2.9/lib/net/http/persistent.rb:986:in `request'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/mechanize-2.7.2/lib/mechanize/http/agent.rb:257:in `fetch'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/mechanize-2.7.2/lib/mechanize.rb:432:in `get'
    from /root/notificador-corte/lib/downloader.rb:10:in `fetch'
    from /root/notificador-corte/worker.rb:63:in `fetch_page'
    from /root/notificador-corte/worker.rb:49:in `process_causa'
    from /root/notificador-corte/worker.rb:41:in `block in worker_main_cycle'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/activerecord-4.0.0/lib/active_record/relation/delegation.rb:13:in `each'
    from /root/.rbenv/versions/1.9.3-p327/lib/ruby/gems/1.9.1/gems/activerecord-4.0.0/lib/active_record/relation/delegation.rb:13:in `each'
    from /root/notificador-corte/worker.rb:39:in `worker_main_cycle'
    from /root/notificador-corte/worker.rb:26:in `run'
    from /root/notificador-corte/app.rb:12:in `<main>'

downloader.rb 第 10 行包含 fetch 方法的定义:

def fetch(url)
    begin
      @agent.get(url) )
    rescue Errno::ETIMEDOUT, Timeout::Error => exception
    end
  end

第 63 行的 worker.rb 包含对 fetch 方法的调用。

阅读文档,说我应该尝试为代理 (Mechanize) 设置 read_timeoutopen_timeout 属性,并尝试使用 idle_timeoutkeep_alive,但错误仍然是随机的。

Gemfile 的内容是:

gem 'activerecord', "~> 4.0.0" 
gem 'mechanize', "~> 2.7.1"
gem 'mysql', '~> 2.9.1'
gem 'actionmailer', "~> 4.0.0" 
gem 'rspec', "~> 2.14.1"

最佳答案

我认为这不一定是您的代码中的错误或自行 Mechanize 的错误。很可能是网络问题。

我宁愿在 rescue 语句中实现一个策略,这样您就可以确保,无论何时发生此错误,您都可以确保稍后“重试”。

关于ruby - 随机超时 :Error Exception in Ruby with Mechanize Gem,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18770426/

相关文章:

java - 使用java解析robot.txt并判断一个url是否被允许

javascript - 通过 Ruby 构建 JavaScript 请求

ruby - 如何通过 Mechanize 和 Nokogiri 抓取数据?

Python,需要帮助使用 mechanize 制作 'brute force'

java - 设置 Ruby/Java 构建/部署的最佳方式?

ruby - 如何在 ruby​​ 中使用 getoptlong 类?

python - 抓取元素上缺少类/id 的数据

python - Ruby 解压到 Python

mysql - Rails - MySql 应用程序 - 报告和图表要求

python - 用于网络抓取的 requests.post 脚本不起作用