我有一个巨大的数据框 df1,其过于简化的版本由 3 列组成:“单词”、“频率”和“字母”:
Words Frequency Letters
flower/tree 0.15 a(0.1)
tree 0.67 a(0.4)
planet 0.85 b(0.4)
tree/planet 0.42 c(0.5)
tree 0.89 a(0.6)
flower 0.21 b(0.4)
flower/planet 0.53 b
planet 0.07 a
使用 R(dplyr、应用族函数等)我想计算“字母”列中的每个字母(a、b、c)与“单词”中的每个单词关联的次数列(花、树、行星),以迭代方式依赖于“频率”列值的频率仓。有 4 个 bin:[0, 0.25]、[0.25, 0.5]、[0.5, 0.75]、[0.75, 1]。
我期望输出数据帧 df2 看起来像这样:
Bin Word Letters count_letters
0-0.25 flower a 1
0-0.25 flower b 1
0-0.25 tree a 1
0-0.25 planet a 1
0.25-0.5 tree c 1
0.25-0.5 planet c 1
0.5-0.75 flower b 1
0.5-0.75 tree a 1
0.5-0.75 planet b 1
0.75-1 tree a 1
0.75-1 planet b 1
最佳答案
您可以使用 cut
来存储 Frequency
、substr
来清理 Letters
和 tidyr: :separate_rows
取消嵌套 Word
。与 dplyr::count
聚合,就可以了:
library(tidyverse)
df %>% separate_rows(Words) %>%
count(Words,
Letters = substr(Letters, 1, 1), # use regex if more than one letter
Frequency = cut(Frequency, breaks = seq(0, 1, .25)))
## Source: local data frame [11 x 4]
## Groups: Frequency, Words [?]
##
## Frequency Words Letters n
## <fctr> <chr> <chr> <int>
## 1 (0,0.25] flower a 1
## 2 (0,0.25] flower b 1
## 3 (0,0.25] planet a 1
## 4 (0,0.25] tree a 1
## 5 (0.25,0.5] planet c 1
## 6 (0.25,0.5] tree c 1
## 7 (0.5,0.75] flower b 1
## 8 (0.5,0.75] planet b 1
## 9 (0.5,0.75] tree a 1
## 10 (0.75,1] planet b 1
## 11 (0.75,1] tree a 1
关于r - 计算与其他列的双重类别关联的列中的特定字符。根据频率仓迭代进行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42237800/