我正在使用 Sidekiq 和 Mechanize 构建一个简单的网络蜘蛛。
当我为一个域运行它时,它工作正常。当我为多个域运行它时,它失败了。我相信原因是web_page
被另一个 Sidekiq 工作人员实例化时被覆盖,但我不确定这是否属实或如何解决。
# my scrape_search controller's create action searches on google.
def create
@scrape = ScrapeSearch.build(keywords: params[:keywords], profession: params[:profession])
agent = Mechanize.new
scrape_search = agent.get('http://google.com/') do |page|
search_result = page.form...
search_result.css("h3.r").map do |link|
result = link.at_css('a')['href'] # Narrowing down to real search results
@domain = Domain.new(some params)
ScrapeDomainWorker.perform_async(@domain.url, @domain.id, remaining_keywords)
end
end
end
我正在为每个域创建一个 Sidekiq 作业。我正在寻找的大多数域应该只包含几页,因此不需要每页都有子作业。
这是我的 worker :
class ScrapeDomainWorker
include Sidekiq::Worker
...
def perform(domain_url, domain_id, keywords)
@domain = Domain.find(domain_id)
@domain_link = @domain.protocol + '://' + domain_url
@keywords = keywords
# First we scrape the homepage and get the first links
@domain.to_parse = ['/'] # to_parse is an array of PATHS to parse for the domain
mechanize_path('/')
@domain.verified << '/' # verified is an Array field containing valid domain paths
get_paths(@web_page) # Now we should have to_scrape populated with homepage links
@domain.scraped = 1 # Loop counter
while @domain.scraped < 100
@domain.to_parse.each do |path|
@domain.to_parse.delete(path)
@domain.scraped += 1
mechanize_path(path) # We create a Nokogiri HTML doc with mechanize for the valid path
...
get_paths(@web_page) # Fire this to repopulate to_scrape !!!
end
end
@domain.save
end
def mechanize_path(path)
agent = Mechanize.new
begin
@web_page = agent.get(@domain_link + path)
rescue Exception => e
puts "Mechanize Exception for #{path} :: #{e.message}"
end
end
def get_paths(web_page)
paths = web_page.links.map {|link| link.href.gsub((@domain.protocol + '://' + @domain.url), "") } ## This works when I scrape a single domain, but fails with ".gsub for nil" when I scrape a few domains.
paths.uniq.each do |path|
@domain.to_parse << path
end
end
end
这在我抓取单个域时有效,但因
.gsub for nil
而失败为 web_page
当我抓取几个域时。
最佳答案
您可以将代码包装在另一个类中,然后在您的工作人员中创建该类的对象:
class ScrapeDomainWrapper
def initialize(domain_url, domain_id, keywords)
# ...
end
def mechanize_path(path)
# ...
end
def get_paths(web_page)
# ...
end
end
还有你的 worker :
class ScrapeDomainWorker
include Sidekiq::Worker
def perform(domain_url, domain_id, keywords)
ScrapeDomainWrapper.new(domain_url, domain_id, keywords)
end
end
另外,请记住
Mechanize::Page#links
可能是 nil
.
关于ruby - Sidekiq Mechanize 覆盖实例,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37799967/