ruby - 如何修复 nokogiri (yahoo) 桌面刮刀?

标签 ruby csv xpath

18 个月前,我们使用 ruby​​ 和 nokogiri 制作了一个小表格抓取器,输出到 csv 文件。对页面结构的更改使输出不尽如人意。以下是我们使用的简化版本:

#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'

url = "http://finance.yahoo.com/q/op?s=FISV&date=1426809600"#mar
doc = Nokogiri::HTML(open(url))
csv = CSV.open("output.csv", 'w')
doc.xpath('//table//tr').each do |row|
tarray = [] #temporary array
row.xpath('td').each do |cell|
    tarray << cell.text #Build array of that row of data.
end
csv << tarray #Write that row out to csv file
#puts "#{row}"
end

csv.close

当前输出:

"^M

^M

^M

✕^M

[修改]^M

                    ^M

                "

"^M

        50.00^M

    ","^M

        FISV150320C00050000^M

    ","^M

        19.70^M

不用说这种类型的输出很难处理。

在尝试了 xpath 和 csv 库的多种组合后,终于意识到是时候寻求帮助了。

假设以下代码段不包含 csv:

#!/usr/bin/ruby
require 'open-uri'
require 'nokogiri'
url = "http://finance.yahoo.com/q/op?s=FISV&date=1426809600"#mar
#url = "http://finance.yahoo.com/q/op?s=FISV&date=1434672000"#jun
doc = Nokogiri::HTML(open(url))

doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s)   {2,}/m, '\1'), "\", "
end
print "\n"
end

生成类似于以下内容的输出:

" 50.00 ", " FISV150320C00050000 ", " 19.70 ", " 26.90 ", " 30.50 ", " 0.00 ", " 0.00% ", " 5 ", " 0 ", " 83.20% ", 

顶层(输出到 csv)版本中需要更改什么才能使其工作得更好?

最佳答案

假设您要将“Calls”和“Puts”表中的数据转储到CSV 中你可以这样做:

require 'csv'
require 'nokogiri'
require 'open-uri'

def options_to_csv(url)
  CSV.generate do |csv|
    doc = Nokogiri::HTML(open(url))
    doc.xpath('//tr[@data-row]').each do |tr|
      csv << tr.xpath('td').map { |td| td.text.strip }
    end
  end
end

url = 'http://finance.yahoo.com/q/op?s=FISV&date=1426809600'
options_to_csv(url) # =>
# 50.00,FISV150320C00050000,19.70,26.90,29.00,0.00,0.00%,5,0,110.06%
# 55.00,FISV150320C00055000,11.91,22.00,24.00,0.00,0.00%,21,21,90.33%
# 60.00,FISV150320C00060000,17.48,18.30,19.00,0.00,0.00%,5,22,71.97%
# 65.00,FISV150320C00065000,10.70,13.30,14.00,0.00,0.00%,26,85,54.49%
# 70.00,FISV150320C00070000,8.90,8.40,8.90,0.00,0.00%,1,504,34.42%
# 75.00,FISV150320C00075000,3.80,3.70,4.10,0.00,0.00%,1,318,22.07%
# 80.00,FISV150320C00080000,0.55,0.45,0.60,0.00,0.00%,24,1435,14.55%
# 50.00,FISV150320P00050000,0.55,0.00,0.15,0.00,0.00%,6,10,83.98%
# 55.00,FISV150320P00055000,0.05,0.00,0.15,0.00,0.00%,3,14,68.16%
# 60.00,FISV150320P00060000,0.15,0.00,0.20,0.00,0.00%,1,84,56.06%
# 65.00,FISV150320P00065000,0.20,0.00,0.20,0.00,0.00%,3,166,47.56%
# 70.00,FISV150320P00070000,0.10,0.00,0.20,0.00,0.00%,14,472,32.13%
# 75.00,FISV150320P00075000,0.20,0.15,0.30,0.00,0.00%,42,557,18.80%
# 80.00,FISV150320P00080000,1.60,1.75,2.00,0.00,0.00%,22,91,15.06%

请注意,这些表还具有 ID“optionsCallsTable”和“optionsPutsTable”,因此您可以使用该信息轻松分隔行。

关于ruby - 如何修复 nokogiri (yahoo) 桌面刮刀?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28751812/

相关文章:

ruby-on-rails - ruby 和 Rails : Statement Modifiers in Views?

c++ - 使用 libxml2 执行递归 XPath 查询的最有效方法是什么?

vba - 如何通过Selenium和VBA根据html从第2行第2列中提取文本8

html - 在其他 HTML 标签中使用 "link_to"

mysql - Rails 脚手架方法的最佳实践

ruby - 查找类方法是在外部还是在内部调用

r - 如何检查 CSV 文件是否有逗号或分号作为分隔符?

postgresql - Postgres COPY TO NULL 整数

c - 将 CSV 元素导入 C 中的二维数组

javascript - 对存储在变量中的 xml 运行 Xpath(获取 "Uncaught TypeError: undefined is not a function")