读取嵌入双引号和逗号的 CSV 文件

标签 r csv data.table

我正在尝试使用 data.table 包中的 fread() 函数读取脏 CSV 文件,但在字符串值中嵌入双引号和逗号时遇到问题,即引用字段中存在未转义的双引号。以下示例数据说明了我的问题。它由 3 行/行和 6 列组成,第一行包含列名称:

"SA","SU","CC","CN","POC","PAC"
"NE","R","000","H "B", O","1","8"
"A","A","000","P","E,5","8"

第一个问题在第 2 行,其中嵌入了一对双引号和一个逗号:"H "B", O"。第二个问题在第 3 行,双引号内有一个逗号:"E,5"。我尝试过以下方法:

尝试 1

library(data.table)
x1 <- fread(file = "example.csv", quote = "\"")

输出:

> x1
     V1 "SA" "SU"   "CC" "CN" "POC" "PAC"
1: "NE"  "R"    0 "H "B"   O"   "1"     8
2:  "A"  "A"    0    "P"   "E    5"     8

消息:

Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

结论:结果不正确,因为它添加了新列V1

尝试 2

x2 <- fread(file = "example.csv", quote = "")

输出:

> x2
     V1 "SA"  "SU"   "CC" "CN" "POC" "PAC"
1: "NE"  "R" "000" "H "B"   O"   "1"   "8"
2:  "A"  "A" "000"    "P"   "E    5"   "8"

消息:

Detected 6 column names but the data has 7 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

结论:结果不正确,因为它添加了新列V1..

解决方案?

我正在寻找一种获得类似于

的输出的方法
> x3
   SA SU CC       CN POC PAC
1: NE  R  0 H 'B', O   1   8
2:  A  A  0        P E,5   8

最好使用 fread(),但欢迎其他建议。

最佳答案

您可以尝试事先清理数据并将双引号替换为单引号。

x = readLines('my_file.csv')
y = gsub('","', "','", x) # replace double quotes for each field
y = gsub('^"|"$', "'", y) # replace trailing and leading double quotes
z = paste(y, collapse='\n') # turn it back into a table for fread to read
df = fread(z, quote="'")
df

   SA SU CC       CN POC PAC
1: NE  R  0 H "B", O   1   8
2:  A  A  0        P E,5   8

我无法确认这是否有效,因为我不知道您的文件有多大,但这可能是一种值得的方法。

关于读取嵌入双引号和逗号的 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52957453/

相关文章:

xml - 在 Go 中将通用 csv 转换为 xml

linux - 如何在bash脚本中用逗号分割列表

r - R中将字符串聚合成向量

R:按组在 data.table 列中找到第一个非 NA 观察

c++ - 在带有 Rcpp 的 R 包中使用头文件 (.h) 和 cpp 文件

r - 对于 R 中的循环(意外的符号错误)

r - 如何对分组拆分产生的矩阵列表中的数据进行重新绑定(bind)、排列和格式化

r - Shinydashboard 可以使用 Tabpanels 并具有导航栏吗?

ruby - 如何从 Ruby 中的给定行 n 开始读取文件(CSV)

r - 将 data.table 的列(名称和值)传递给函数