ruby - 使用ruby的open-uri访问特定站点时出现503错误

我一直在使用下面的代码来抓取一个网站，但我认为我可能抓取的次数过多，导致我自己被完全禁止访问该网站。就像，我仍然可以在我的浏览器上访问该站点，但是任何涉及 open-uri 和该站点的代码都会向我抛出 503 站点不可用错误。我认为这是特定于站点的，因为 open-uri 仍然适用于 google 和 facebook。有解决办法吗？

require 'rubygems'
require 'hpricot'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.quora.com/What-is-the-best-way-to-get-ove$

topic = doc.at('span a.topic_name span').content
puts topic

最佳答案

有变通办法，但最好的办法是按照他们的规定做一个好公民。您可能想确认您正在关注他们的 Terms of Service :

If you operate a search engine or robot, or you republish a significant fraction of all Quora Content (as we may determine in our reasonable discretion), you must additionally follow these rules:

您必须使用描述性的用户代理 header 。
您必须始终遵循 robots.txt。
您必须明确如何联系您，可以在您的用户代理字符串中，也可以在您的网站上(如果有的话)。

您可以使用 OpenURI 轻松设置您的用户代理 header :

Additional header fields can be specified by an optional hash argument.

  open("http://www.ruby-lang.org/en/",
    "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "foo@bar.invalid",
    "Referer" => "http://www.ruby-lang.org/") {|f|
    # ...
  }

可以从 http://www.quora.com/robots.txt 检索 Robots.txt。您需要解析它并遵守其设置，否则他们会再次禁止您。

此外，您可能希望通过在循环之间休眠来限制代码的速度。

此外，如果您正在为他们的网站搜索内容，您可能需要查看本地缓存页面，或使用其中一个搜索包。写一个爬虫很容易。编写一个可以很好地与网站配合使用的软件需要更多的工作，但总比根本无法抓取他们的网站要好。

关于ruby - 使用ruby的open-uri访问特定站点时出现503错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8628337/

ruby - 使用ruby的open-uri访问特定站点时出现503错误

上一篇：css - @fontface 适用于 http ://www. domain.com 但不适用于 http ://domain. com

下一篇：asp.net - 在没有提示的情况下从 asp.net 页面下载 .CSV/.TXT