r - tidyr 枢轴更宽 : Duplicate issue

标签 r dplyr pivot tidyr

我正在尝试使用更宽的数据透视来减少数据中的行数并添加新列。但是,列数增加,但行数保持不变。理想情况下,每个“指标”应该是一个观察值,其中数据年份、公司、市场、国家/地区等列是相同的。我认为该问题可能是由于重复观察造成的,但不明白 IndicatorID 列如何无法解决此问题?

我的数据示例:

    LongTest <- structure(list(DataYear = c(2018L, 2017L, 2016L, 2018L, 2017L, 
2016L, 2018L, 2017L, 2016L), Company = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), .Label = "One", class = "factor"), Market = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Total", class = "factor"), 
    Country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ALL", class = "factor"), 
    ISO = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ALL", class = "factor"), 
    Sector = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Insurance", class = "factor"), 
    Division = c(NA, NA, NA, NA, NA, NA, NA, NA, NA), Furtherdetails1 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), Furtherdetails2 = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA), Indicator = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Tax Avoidance", 
    "Turnover"), class = "factor"), IndicatorID = c(20L, 20L, 
    20L, 20L, 20L, 20L, 26L, 26L, 26L), InputName = structure(c(3L, 
    3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Number of employees", 
    "Profit before tax (Attributable to shareholder profit)", 
    "Tax Paid"), class = "factor"), InputCode = structure(c(2L, 
    2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("InputA", "InputB"
    ), class = "factor"), UnitRequired = structure(c(2L, 2L, 
    2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("#", "GBP"), class = "factor"), 
    Value = c(4.47e+08, 6.2e+08, 6.47e+08, 2.129e+09, 2.003e+09, 
    1.193e+09, 37628, 42431, 39833.44), UniqueID = 1:9), class = "data.frame", row.names = c(NA, 
-9L))

我当前使用的代码:

outTest <- pivot_wider(LongTest, names_from = InputCode, values_from = c(Value, UnitRequired, InputName))

当我使用完整数据框时,我收到此错误消息:

Warning messages:
1: Values in `InputName` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(InputName = list)` to suppress this warning.
* Use `values_fn = list(InputName = length)` to identify where the duplicates arise
* Use `values_fn = list(InputName = summary_fun)` to summarise duplicates 
2: Values in `UnitRequired` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(UnitRequired = list)` to suppress this warning.
* Use `values_fn = list(UnitRequired = length)` to identify where the duplicates arise
* Use `values_fn = list(UnitRequired = summary_fun)` to summarise duplicates 
3: Values in `Value` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(Value = list)` to suppress this warning.
* Use `values_fn = list(Value = length)` to identify where the duplicates arise
* Use `values_fn = list(Value = summary_fun)` to summarise duplicates 

理想的输出是这样的:

    structure(list(DataYear = c(2018L, 2017L, 2016L, 2018L, 2017L, 
2016L), Company = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "One", class = "factor"), 
    Market = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Total", class = "factor"), 
    Country = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "ALL", class = "factor"), 
    ISO = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "ALL", class = "factor"), 
    Sector = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "Insurance", class = "factor"), 
    Division = c(NA, NA, NA, NA, NA, NA), Furtherdetails1 = c(NA, 
    NA, NA, NA, NA, NA), Furtherdetails2 = c(NA, NA, NA, NA, 
    NA, NA), Indicator = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("Tax Avoidance", 
    "Turnover"), class = "factor"), IndicatorID = c(20L, 20L, 
    20L, 26L, 26L, 26L), Value_InputA = c(2129000000L, 2003000000L, 
    1193000000L, NA, NA, NA), InputName_InputA = structure(c(2L, 
    2L, 2L, 1L, 1L, 1L), .Label = c("", "Profit before tax (Attributable to shareholder profit)"
    ), class = "factor"), UnitRequired_InputA = structure(c(2L, 
    2L, 2L, 1L, 1L, 1L), .Label = c("", "GBP"), class = "factor"), 
    Value_InputB = c(4.47e+08, 6.2e+08, 6.47e+08, 37628, 42431, 
    39833.44), InputName_InputB = structure(c(2L, 2L, 2L, 1L, 
    1L, 1L), .Label = c("Number of employees", "Tax Paid"), class = "factor"), 
    UnitRequired_InputB = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("#", 
    "GBP"), class = "factor")), class = "data.frame", row.names = c(NA, 
-6L))

任何帮助将不胜感激!

谢谢

最佳答案

在他的 comment 中使用 @Ronak Shah 的建议要创建一个 row 列,以下似乎可以做到这一点。我添加了第二个分组列,Indicator

library(tidyverse)

LongTest %>%
  group_by(InputCode, Indicator) %>% 
  mutate(row = row_number()) %>%
  pivot_wider(id_cols = c(row, Indicator),
              names_from = InputCode, 
              values_from = c(Value, UnitRequired, InputName)) %>%
  select(-row)

关于r - tidyr 枢轴更宽 : Duplicate issue,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60258879/

相关文章:

R- 删除字符串中的重音符号

r - 使用 lapply 函数的并行版本保留列表名称

r - 如何通过 R 中的 dplyr/tidyverse 将分组行复制到列中?

r - mutate() 试图在使用美元符号运算符时使用全局变量的值进行提取

python - SQLAlchemy 中的数据透视表

r - 合并列,根据其他df更新列,填充NA

r - 如何删除数据框中的行?

mysql - 如何使用 MySQL 或 R 获取 JSON 是否包含特定文本

php - 如何使用 AND 条件在数据透视表中搜索多个匹配项

R:根据变量名称中的字符串将数字数据从列透视到行