ruby - 怎么看别人的论坛

标签 ruby httpwebrequest mechanize screen-scraping

我的 friend 有一个论坛,里面满是包含信息的帖子。有时她想查看她论坛中的帖子,然后得出结论。目前,她通过点击她的论坛来审查帖子,并生成一个不一定准确的数据图片(在她的大脑中),她从中得出结论。我今天的想法是,我可能会敲出一个快速的 Ruby 脚本来解析必要的 HTML,让她真正了解数据在说什么。

今天第一次使用Ruby的net/http库,遇到了一个问题。虽然我的浏览器可以毫无问题地查看我 friend 的论坛,但 Net::HTTP.new("forumname.net") 方法似乎会产生以下错误:

由于目标机器主动拒绝,无法建立连接。 - 连接(2)

谷歌搜索那个错误,我了解到它与 MySQL(或类似的东西)有关,不希望像我这样爱管闲事的人远程在那里闲逛:出于安全原因。这对我来说很有意义,但它让我想知道:我的浏览器如何在我 friend 的论坛上闲逛,但我的小 Ruby 脚本却没有闲逛权。我的脚本有什么方法可以告诉服务器它不是威胁吗?我只想要阅读权而不想要写作权?

谢谢大家,

z.

最佳答案

抓取网站?使用 mechanize :

#!/usr/bin/ruby1.8

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new
page = agent.get("http://xkcd.com")
page = page.link_with(:text=>'Forums').click
page = page.link_with(:text=>'Mathematics').click
page = page.link_with(:text=>'Math Books').click
#puts page.parser.to_html    # If you want to see the html you just got
posts = page.parser.xpath("//div[@class='postbody']")
for post in posts
  title = post.at_xpath('h3//text()').to_s
  author = post.at_xpath("p[@class='author']//a//text()").to_s
  body = post.xpath("div[@class='content']//text()").collect do |div|
    div.to_s
  end.join("\n")
  puts '-' * 40
  puts "title: #{title}"
  puts "author: #{author}"
  puts "body:", body
end

第一部分输出:

----------------------------------------
title: Math Books
author: Cleverbeans
body:
This is now the official thread for questions about math books at any level, fr\
om high school through advanced college courses.
I'm looking for a good vector calculus text to brush up on what I've forgotten.\
 We used Stewart's Multivariable Calculus as a baseline but I was unable to pur\
chase the text for financial reasons at the time. I figured some things may hav\
e changed in the last 12 years, so if anyone can suggest some good texts on thi\
s subject I'd appreciate it.
----------------------------------------
title: Re: Multivariable Calculus Text?
author: ThomasS
body:
The textbooks go up in price and new pretty pictures appear. However, Calculus \
really hasn't changed all that much.
If you don't mind a certain lack of pretty pictures, you might try something li\
ke Widder's Advanced Calculus from Dover. it is much easier to carry around tha\
n Stewart. It is also written in a style that a mathematician might consider no\
rmal. If you think that you might want to move on to real math at some point, i\
t might serve as an introduction to the associated style of writing.

关于ruby - 怎么看别人的论坛,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2060247/

相关文章:

ruby-on-rails - Rails.cache.fetch 不缓存对象

ruby-on-rails - redirect_to 方法如何在 ruby​​ 中工作

c# - 基础连接已关闭。发送时发生意外错误

c# - 尝试 POST 到 RESTful 服务时尝试 webRequest.GetRequestStream 时发生 System.Net.WebException

ruby - 使用 Mechanize

ruby-on-rails - Rails 检查模型以查看所有字段是否为空

ruby-on-rails - 访问子模型 rails 中的父属性

c# - MonoTouch 或 iOS 网络堆栈是否正在占用我的 HTTP DELETE 请求正文?

ruby - 为什么在 Ruby Mechanize 中没有检测到字段?

python - 如何使用 Python 和 Mechanize 抓取网站