r - 为什么在读取数据框时我的列名称中出现 X.？

我问了question about this a few months back ，我以为答案已经解决了我的问题，但我再次遇到了问题，并且该解决方案对我不起作用。

我正在导入 CSV:

orders <- read.csv("<file_location>", sep=",", header=T, check.names = FALSE)

这是数据帧的结构:

str(orders)

'data.frame':   3331575 obs. of  2 variables:
 $ OrderID  : num  -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
 $ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...

如果我运行 length在第一列 OrderID 上执行命令，我得到这个:

length(orders$OrderID)
[1] 0

如果我运行 length在 OrderDate 上，它正确返回:

length(orders$OrderDate)
[1] 3331575

这是 head 的复制/粘贴CSV的。

OrderID,OrderDate
-2034590217,2011-10-14
-2034590216,2011-10-14
-2031892773,2011-10-24
-2031892767,2011-10-21
-2021008573,2011-12-08
-2021008572,2011-12-07
-2021008571,2011-12-07
-2021008570,2011-12-07
-2021008569,2011-12-07

现在，如果我重新运行 read.csv ，但取出check.names选项，dataframe的第一列现在名称开头有一个 X.。

orders2 <- read.csv("<file_location>", sep=",", header=T)

str(orders2)

'data.frame':   3331575 obs. of  2 variables:
 $ X.OrderID: num  -2034590217 -2034590216 -2031892773 -2031892767 -2021008573 ...
 $ OrderDate: Factor w/ 402 levels "2010-10-01","2010-10-04",..: 263 263 269 268 301 300 300 300 300 300 ...

length(orders$X.OrderID)
[1] 3331575

这可以正常工作。

我的问题是为什么 R在第一个列名称的开头添加 X.？从 CSV 文件中可以看到，没有特殊字符。它应该是一个简单的负载。添加check.names ，虽然将从 CSV 导入名称，但会导致数据无法正确加载以供我执行分析。

我可以做什么来解决这个问题？

旁注:我意识到这是一个小问题 - 我只是更沮丧的是，我认为我加载正确，但没有得到我预期的结果。我可以使用 colnames(orders)[1] <- "OrderID" 重命名该列，但仍然想知道为什么它不能正确加载。

最佳答案

read.csv() 是更通用的 read.table() 函数的包装器。后一个函数有参数 check.names ，记录为:

check.names: logical.  If ‘TRUE’ then the names of the variables in the
         data frame are checked to ensure that they are syntactically
         valid variable names.  If necessary they are adjusted (by
         ‘make.names’) so that they are, and also to ensure that there
         are no duplicates.

如果您的 header 包含语法上无效的标签，则 make.names() 将根据无效名称将其替换为有效名称，删除无效字符并可能在前面添加 X :

R> make.names("$Foo")
[1] "X.Foo"

这记录在 ?make.names 中:

Details:

    A syntactically valid name consists of letters, numbers and the
    dot or underline characters and starts with a letter or the dot
    not followed by a number.  Names such as ‘".2way"’ are not valid,
    and neither are the reserved words.

    The definition of a _letter_ depends on the current locale, but
    only ASCII digits are considered to be digits.

    The character ‘"X"’ is prepended if necessary.  All invalid
    characters are translated to ‘"."’.  A missing value is translated
    to ‘"NA"’.  Names which match R keywords have a dot appended to
    them.  Duplicated values are altered by ‘make.unique’.

您看到的行为与 read.table() 加载数据的记录方式完全一致。这表明您的 CSV 文件的标题行中存在语法无效的标签。请注意上面 ?make.names 中的一点，字母是什么取决于系统的区域设置；例如，CSV 文件可能包含文本编辑器将显示的有效字符，但如果 R 未在同一区域设置中运行，则该字符可能在那里无效？

我会查看 CSV 文件并识别标题行中的所有非 ASCII 字符；标题行中也可能存在不可见字符(或转义序列；\t？)。在读取具有无效名称的文件并将其显示在控制台中之间可能会发生很多事情，这可能会掩盖无效字符，因此不要认为它没有显示任何错误 check.names 表示文件正常。

发布 sessionInfo() 的输出也很有用。

关于r - 为什么在读取数据框时我的列名称中出现 X.？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10441437/

r - 为什么在读取数据框时我的列名称中出现 X.？

上一篇：asp.net-mvc - MVC 在一个 View 中显示多个表？

下一篇：entity-framework - 如何在 EF Code First 中对表进行单一化？