ruby-on-rails - 如何 "crawl"只有根 URL 与 Anemone ?

标签 ruby-on-rails ruby ruby-on-rails-3

在下面的示例中,我希望 anemone 仅在根 URL (example.com) 上执行。我不确定是否应该应用 on_page_like 方法,如果是的话我需要什么模式。

  require 'anemone'
    Anemone.crawl("http://www.example.com/") do |anemone|
      anemone.on_pages_like(???) do |page|
        # some code to execute
      end
    end

最佳答案

require 'anemone'
Anemone.crawl("http://www.example.com/", :depth_limit => 1) do |anemone|
  # some code to execute
end

您还可以在选项哈希中指定以下内容,以下是默认值:

# run 4 Tentacle threads to fetch pages
:threads => 4,
# disable verbose output
:verbose => false,
# don't throw away the page response body after scanning it for links
:discard_page_bodies => false,
# identify self as Anemone/VERSION
:user_agent => "Anemone/#{Anemone::VERSION}",
# no delay between requests
:delay => 0,
# don't obey the robots exclusion protocol
:obey_robots_txt => false,
# by default, don't limit the depth of the crawl
:depth_limit => false,
# number of times HTTP redirects will be followed
:redirect_limit => 5,
# storage engine defaults to Hash in +process_options+ if none specified
:storage => nil,
# Hash of cookie name => value to send with HTTP requests
:cookies => nil,
# accept cookies from the server and send them back?
:accept_cookies => false,
# skip any link with a query string? e.g. http://foo.com/?u=user
:skip_query_strings => false,
# proxy server hostname
:proxy_host => nil,
# proxy server port number
:proxy_port => false,
# HTTP read timeout in seconds
:read_timeout => nil

我个人的经验是 Anemone 速度不是很快并且有很多角落案例。缺少文档(如您所见),并且作者似乎没有维护该项目。 YMMV。我试过 Nutch很快但没有玩那么多但它似乎更快。没有基准,抱歉。

关于ruby-on-rails - 如何 "crawl"只有根 URL 与 Anemone ?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14227555/

相关文章:

Ruby 单行代码和警告 : 'else without rescue is useless'

ruby - gem 很慢

ruby-on-rails - 这是通过 Controller 进行质量分配的安全方法吗?

ruby-on-rails - “错误”部分适用于 'New' 用户但不适用于 'Edit' 用户

ruby-on-rails - 如何遍历数组的数组

ruby-on-rails - 如何仅针对列更新的子集更新模型的 "updated_at"字段?

ruby-on-rails-3 - 使用带有 Geokit-rails3 位置 gem 的 Rails 3

json - 使用 Rails 3 应用程序的 ios json 身份验证

ruby-on-rails - 添加自定义数据属性Rails图像标签

ruby-on-rails - 将 2 个 Rails 应用程序组合到一个代码库