18 个月前,我们使用 ruby 和 nokogiri 制作了一个小表格抓取器,输出到 csv 文件。对页面结构的更改使输出不尽如人意。以下是我们使用的简化版本:
#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
url = "http://finance.yahoo.com/q/op?s=FISV&date=1426809600"#mar
doc = Nokogiri::HTML(open(url))
csv = CSV.open("output.csv", 'w')
doc.xpath('//table//tr').each do |row|
tarray = [] #temporary array
row.xpath('td').each do |cell|
tarray << cell.text #Build array of that row of data.
end
csv << tarray #Write that row out to csv file
#puts "#{row}"
end
csv.close
当前输出:
"^M
^M
^M
✕^M
[修改]^M
^M
"
"^M
50.00^M
","^M
FISV150320C00050000^M
","^M
19.70^M
不用说这种类型的输出很难处理。
在尝试了 xpath 和 csv 库的多种组合后,终于意识到是时候寻求帮助了。
假设以下代码段不包含 csv:
#!/usr/bin/ruby
require 'open-uri'
require 'nokogiri'
url = "http://finance.yahoo.com/q/op?s=FISV&date=1426809600"#mar
#url = "http://finance.yahoo.com/q/op?s=FISV&date=1434672000"#jun
doc = Nokogiri::HTML(open(url))
doc.xpath('//table//tr').each do |row|
row.xpath('td').each do |cell|
print '"', cell.text.gsub("\n", ' ').gsub('"', '\"').gsub(/(\s) {2,}/m, '\1'), "\", "
end
print "\n"
end
生成类似于以下内容的输出:
" 50.00 ", " FISV150320C00050000 ", " 19.70 ", " 26.90 ", " 30.50 ", " 0.00 ", " 0.00% ", " 5 ", " 0 ", " 83.20% ",
顶层(输出到 csv)版本中需要更改什么才能使其工作得更好?
最佳答案
假设您要将“Calls”和“Puts”表中的数据转储到CSV 中你可以这样做:
require 'csv'
require 'nokogiri'
require 'open-uri'
def options_to_csv(url)
CSV.generate do |csv|
doc = Nokogiri::HTML(open(url))
doc.xpath('//tr[@data-row]').each do |tr|
csv << tr.xpath('td').map { |td| td.text.strip }
end
end
end
url = 'http://finance.yahoo.com/q/op?s=FISV&date=1426809600'
options_to_csv(url) # =>
# 50.00,FISV150320C00050000,19.70,26.90,29.00,0.00,0.00%,5,0,110.06%
# 55.00,FISV150320C00055000,11.91,22.00,24.00,0.00,0.00%,21,21,90.33%
# 60.00,FISV150320C00060000,17.48,18.30,19.00,0.00,0.00%,5,22,71.97%
# 65.00,FISV150320C00065000,10.70,13.30,14.00,0.00,0.00%,26,85,54.49%
# 70.00,FISV150320C00070000,8.90,8.40,8.90,0.00,0.00%,1,504,34.42%
# 75.00,FISV150320C00075000,3.80,3.70,4.10,0.00,0.00%,1,318,22.07%
# 80.00,FISV150320C00080000,0.55,0.45,0.60,0.00,0.00%,24,1435,14.55%
# 50.00,FISV150320P00050000,0.55,0.00,0.15,0.00,0.00%,6,10,83.98%
# 55.00,FISV150320P00055000,0.05,0.00,0.15,0.00,0.00%,3,14,68.16%
# 60.00,FISV150320P00060000,0.15,0.00,0.20,0.00,0.00%,1,84,56.06%
# 65.00,FISV150320P00065000,0.20,0.00,0.20,0.00,0.00%,3,166,47.56%
# 70.00,FISV150320P00070000,0.10,0.00,0.20,0.00,0.00%,14,472,32.13%
# 75.00,FISV150320P00075000,0.20,0.15,0.30,0.00,0.00%,42,557,18.80%
# 80.00,FISV150320P00080000,1.60,1.75,2.00,0.00,0.00%,22,91,15.06%
请注意,这些表还具有 ID“optionsCallsTable”和“optionsPutsTable”,因此您可以使用该信息轻松分隔行。
关于ruby - 如何修复 nokogiri (yahoo) 桌面刮刀?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28751812/