r - 将长字符串转换为 data.frame

标签 r csv dataframe read.table

这是一个菜鸟问题,但我对此感到很疯狂。我有一个名为 bar.list 的字符向量,我是从 FTP 服务器下载的。向量看起来像这样:

"\"\",\"times\",\"open\",\"high\",\"low\",\"close\",\"numEvents\",\"volume\"\r\n\"1\",2015-05-18 06:50:00,23.98,23.98,23.5,23.77,421,0\r\n\"2\",2015-05-18 07:50:00,23.77,23.9,23.34,23.6,720,0\r\n\"3\",2015-05-18 08:50:00,23.6,23.6,23.32,23.42,720,0\r\n\"4\",2015-05-18 09:50:00,23.44,23.91,23.43,23.66,720,0\r\n\"5\",2015-05-18 10:50:00,23.67,24.06,23.59,24.02,720,0\r\n\"6\",2015-05-18 11:50:00,24.02,24.04,23.32,23.33,720,0\r\n\"7\",2015-05-18 12:50:00,23.33,23.42,22.74,22.81,720,0\r\n\"8\",2015-05-18 13:50:00,22.79,22.92,22.49,22.69,720,0\r\n\"9\",2015-05-18 14:50:00,22.69,22.7,22.14,22.14,481,0\r\n\"10\",2015-05-19 06:50:00,21.09,21.49,20.82,21.47,421,0\r\n\"11\",2015-05-19 07:50:00,21.48,21.68,21.46,21.51,720,0\r\n\"12\",2015-05-19 08:50:00,21.51,21.93,21.45,21.92,720,0\r\n\"13\",2015-05-19 09:50:00,21.92,21.92,21.55,21.55,720,0\r\n\"

我需要将此向量转换为可用格式,但是
> read.table(bars.list, header = TRUE, sep = ",", quote = "", dec = ".")
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file '"","times","open","high","low","close","numEvents","volume"
"1",2015-05-18 06:50:00,23.98,23.98,23.5,23.77,421,0
"2",2015-05-18 07:50:00,23.77,23.9,23.34,23.6,720,0
"3",2015-05-18 08:50:00,23.6,23.6,23.32,23.42,720,0
"4",2015-05-18 09:50:00,23.44,23.91,23.43,23.66,720,0

我不清楚为什么 R 告诉我无法打开某些连接,因为该对象已作为参数粘贴到函数中。输出 R Shows me with a warning sign 已经非常接近我需要的了...

最佳答案

这里有两个选项。第一个提供对当前代码的修复,第二个着眼于更简单更有效的替代方案。

选项 1: read.table() 中的第一个参数是 file .您正在从矢量而不是文件中读取数据,因此您需要使用 text参数,与 text = bars.list .

另外,我们可以用 gsub() 去掉所有的引号。首先然后使用read.csv()而不是 read.table()header = TRUEsep = ","是那里的默认值。

read.csv(text = gsub("\"", "", bars.list), row.names = 1)
#                  times  open  high   low close numEvents volume
# 1  2015-05-18 06:50:00 23.98 23.98 23.50 23.77       421      0
# 2  2015-05-18 07:50:00 23.77 23.90 23.34 23.60       720      0
# 3  2015-05-18 08:50:00 23.60 23.60 23.32 23.42       720      0
# 4  2015-05-18 09:50:00 23.44 23.91 23.43 23.66       720      0
# 5  2015-05-18 10:50:00 23.67 24.06 23.59 24.02       720      0
# 6  2015-05-18 11:50:00 24.02 24.04 23.32 23.33       720      0
# 7  2015-05-18 12:50:00 23.33 23.42 22.74 22.81       720      0
# 8  2015-05-18 13:50:00 22.79 22.92 22.49 22.69       720      0
# 9  2015-05-18 14:50:00 22.69 22.70 22.14 22.14       481      0
# 10 2015-05-19 06:50:00 21.09 21.49 20.82 21.47       421      0
# 11 2015-05-19 07:50:00 21.48 21.68 21.46 21.51       720      0
# 12 2015-05-19 08:50:00 21.51 21.93 21.45 21.92       720      0
# 13 2015-05-19 09:50:00 21.92 21.92 21.55 21.55       720      0

对我来说,这比使用 quote 效果更好。参数在 read.csv() .

选项 2: fread()来自 data.table 包也很好用。它更快,代码更清晰。无需使用gsub()用它。我们可以放bars.list直接输入并删除第一列。
data.table::fread(bars.list, drop = 1)

现在,由于最后的 \",您将收到此方法的警告。引用。您可以接受它,也可以通过删除最后一个引号来获得无警告的结果。
data.table::fread(sub("\"$", "", bars.list), drop = 1)

数据:
bars.list <- "\"\",\"times\",\"open\",\"high\",\"low\",\"close\",\"numEvents\",\"volume\"\r\n\"1\",2015-05-18 06:50:00,23.98,23.98,23.5,23.77,421,0\r\n\"2\",2015-05-18 07:50:00,23.77,23.9,23.34,23.6,720,0\r\n\"3\",2015-05-18 08:50:00,23.6,23.6,23.32,23.42,720,0\r\n\"4\",2015-05-18 09:50:00,23.44,23.91,23.43,23.66,720,0\r\n\"5\",2015-05-18 10:50:00,23.67,24.06,23.59,24.02,720,0\r\n\"6\",2015-05-18 11:50:00,24.02,24.04,23.32,23.33,720,0\r\n\"7\",2015-05-18 12:50:00,23.33,23.42,22.74,22.81,720,0\r\n\"8\",2015-05-18 13:50:00,22.79,22.92,22.49,22.69,720,0\r\n\"9\",2015-05-18 14:50:00,22.69,22.7,22.14,22.14,481,0\r\n\"10\",2015-05-19 06:50:00,21.09,21.49,20.82,21.47,421,0\r\n\"11\",2015-05-19 07:50:00,21.48,21.68,21.46,21.51,720,0\r\n\"12\",2015-05-19 08:50:00,21.51,21.93,21.45,21.92,720,0\r\n\"13\",2015-05-19 09:50:00,21.92,21.92,21.55,21.55,720,0\r\n\""

关于r - 将长字符串转换为 data.frame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33986471/

相关文章:

使用 X11 窗口的 R 脚本仅打开一秒钟

python - 使用 Pandas 计算不规则时间序列的每日平均值

c# - 使用 csvhelper 进行映射/写入协助

Python:根据另一个列值从DataFrame中删除重复项

R Keras 压平层 - 得到形状为 1 的数组

r - 在ubuntu上编译R包,ELF头无效

r - 在 R 中解析文本文件并提取信息

python - 如何使用 Python CSV Writer 保留尾随零

python - pandas 数据框按特定值分组

python - 在pandas中加入2个具有不同列名的数据框