R数据帧: headers based on existing row containing text and numbers

标签 r dataframe

由于某些特定于我的 R 程序的原因,我想根据 R 中数据框中的现有列和行分配列名和行名。 也就是说,第一行必须成为列名,第一列必须成为行名。

我最初认为这很简单,使用:

colnames(myDataFrame) <- myDataFrame[1,]
rownames(MyDataFrame) <- myDataFrame[,1]

因为它也写在this topic .

但我在数据框的第一行和第一列中有很多情况要处理:只有文本、带数字的文本、文本或数字... 这就是为什么这有时不起作用的原因。查看第一行仅包含文本的示例:

我首先加载完全没有标题的数据框:

> tab <- read.table(file, header = FALSE, sep = "\t")
> tab
         V1   V2  V3   V4   V5     V6  V7   V8   V9
1      TEST this  is only text hoping  it will work
2         I    4   0    0    0      0   0    0    1
3    really    7   6    6    3     10   6   10   10
4      hope  187 141  140  129    130 157  138  168

这是我的数据框,没有行名和列名。 我希望“TEST this is only text hopeing it will work”成为我的专栏名称。 这不起作用:

> colnames(tab) <- tab[1,]
> tab
          2   10   9    9   10      8   9    8    9
1      TEST this  is only text hoping  it will work
2         I    4   0    0    0      0   0    0    1
3    really    7   6    6    3     10   6   10   10
4      hope  187 141  140  129    130 157  138  168

虽然这有效:

> colnames(tab) <- as.character(unlist(tab[1,]))
> tab
       TEST this  is only text hoping  it will work
1      TEST this  is only text hoping  it will work
2         I    4   0    0    0      0   0    0    1
3    really    7   6    6    3     10   6   10   10
4      hope  187 141  140  129    130 157  138  168

我认为问题是因为 R 有时会将第一列或第一行视为因素。 但正如您所见:

> is.factor(tab[1,])
FALSE

即使它没有被 R 转换为因子,它也可能会失败。

我试图在我的程序中提示“as.character(unlist()))”,但在我可能遇到的其他一些情况下,它不再有效!... 请参阅第一行中包含文本和数字的示例:

> otherTab <- read.table(otherFile, header = FALSE, sep = "\t")
> otherTab
               V1      V2     V3    V4  V5  V6    V7     V8    V9
1            TEST this45 is 486text 725 with ca257 some numbers
2        number45       4      0     0   0   0     0      0     1
3        254every       7      6     6   3  10     6     10    10
4           where     187    141   140 129 130   157    138   168

> colnames(otherTab) <- as.character(unlist(otherTab[1,]))
> otherTab
                6      10      9     7 725   8     9      8     9
1            TEST this45 is 486text 725 with ca257 some numbers
2        number45       4      0     0   0   0     0      0     1
3        254every       7      6     6   3  10     6     10    10
4           where     187    141   140 129 130   157    138   168

那么如何以简单的方式处理这些不同的情况(因为这似乎是一个如此简单的问题)? 非常感谢。

最佳答案

发生这种情况是因为,在您的初始数据框中,V5 是一个类型为“int”的列,而不是一个因素(因此您的第一个数据框中有两个不同的类型行)

#> str(df)
#'data.frame':  4 obs. of  9 variables:
# $ V1: Factor w/ 4 levels "254every","TEST",..: 2 3 1 4
# $ V2: Factor w/ 4 levels "187","4","7",..: 4 2 3 1
# $ V3: Factor w/ 4 levels "0","141","6",..: 4 1 3 2
# $ V4: Factor w/ 4 levels "0","140","486text",..: 3 1 4 2
# $ V5: int  725 0 3 129
# $ V6: Factor w/ 4 levels "0","10","130",..: 4 1 2 3
# $ V7: Factor w/ 4 levels "0","157","6",..: 4 1 3 2
# $ V8: Factor w/ 4 levels "0","10","138",..: 4 1 2 3
# $ V9: Factor w/ 4 levels "1","10","168",..: 4 1 2 3

向量的所有元素必须属于同一类型。当您尝试 unlist() 并将值存储在一个向量中以传递给 colnames() 时,您实际上传递了一个“int”向量(因为 R 将元素强制转换为普通类型):

#> str(unlist(df[1,]))
# Named int [1:9] 2 4 4 3 725 4 4 4 4
# - attr(*, "names")= chr [1:9] "V1" "V2" "V3" "V4" ...

如果您修改数据框的结构以指定 V5 列是一个因素,您的初始方法将起作用:

df[,5] <- as.factor(df[,5])
colnames(df) <- unlist(df[1,])

你会得到:

#> df
#      TEST this45  is 486text 725 with ca257 some numbers
#1     TEST this45  is 486text 725 with ca257 some numbers
#2 number45      4   0       0   0    0     0    0       1
#3 254every      7   6       6   3   10     6   10      10
#4    where    187 141     140 129  130   157  138     168

如果您不想修改您的列类型,您可以在强制转换为向量并传递给 colnames( ):

colnames(df) <- lapply(df[1,], as.character)

哪些结果:

#> df
#      TEST this45  is 486text 725 with ca257 some numbers
#1     TEST this45  is 486text 725 with ca257 some numbers
#2 number45      4   0       0   0    0     0    0       1
#3 254every      7   6       6   3   10     6   10      10
#4    where    187 141     140 129  130   157  138     168

数据

structure(list(V1 = structure(c(2L, 3L, 1L, 4L), .Label = c("254every", 
"TEST", "number45", "where"), class = "factor"), V2 = structure(c(4L, 
2L, 3L, 1L), .Label = c("187", "4", "7", "this45"), class = "factor"), 
    V3 = structure(c(4L, 1L, 3L, 2L), .Label = c("0", "141", 
    "6", "is"), class = "factor"), V4 = structure(c(3L, 1L, 4L, 
    2L), .Label = c("0", "140", "486text", "6"), class = "factor"), 
    V5 = c(725L, 0L, 3L, 129L), V6 = structure(c(4L, 1L, 2L, 
    3L), .Label = c("0", "10", "130", "with"), class = "factor"), 
    V7 = structure(c(4L, 1L, 3L, 2L), .Label = c("0", "157", 
    "6", "ca257"), class = "factor"), V8 = structure(c(4L, 1L, 
    2L, 3L), .Label = c("0", "10", "138", "some"), class = "factor"), 
    V9 = structure(c(4L, 1L, 2L, 3L), .Label = c("1", "10", "168", 
    "numbers"), class = "factor")), .Names = c("V1", "V2", "V3", 
"V4", "V5", "V6", "V7", "V8", "V9"), class = "data.frame", row.names = c("1", 
"2", "3", "4"))

关于R数据帧: headers based on existing row containing text and numbers,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28240183/

相关文章:

r - 按列名合并 3 个 data.frames

python - 更改多级层次结构中的索引 Pandas Dataframe

r - 通过R中的模糊字符串匹配和分组汇总创建新变量的有效方法

python - Pyspark Dataframe 上的 Pivot String 列

python - 更改 Pandas 中日期时间列的时区并添加为分层索引

r - 将数学符号从 R 数据框导出到 MS Word 表

python - Pandas 数据帧 : selection of multiple elements in several columns

R复制行并用重复的id除以值

r - 在 R 中使用循环使用不同的数据集运行回归?

r - 低于平均值的列的条件色调