r - 使用表格时选择数据框中最频繁的元素

标签 r dataframe frequency contingency

我有一个要在其上使用表格的数据框列表。该列表如下所示:

pronouns <- data.frame(pronounciation = c("juː","juː","juː","ju","ju","jə","jə","hɪm","hɪm","hɪm", "həm","ðɛm"), words = c("you","you","you","you","you","you","you","him","him","him","him","them"))
articles <- data.frame(pronounciation = c("ðiː","ði","ði","ðə","ðə","ði","ðə","eɪ","eɪ","æɪ","æɪ","eɪ","eɪ","eɪ","e"), words = c("the","the","the","the","the","the","the","a","a","a","a","a","a","a","a"))
numbers <- data.frame(pronounciation = c("wʌn","wʌn","wʌn","wʌn","wan","wa:n","tuː","tuː","tuː","tuː","tu","tu","tuː","tuː","θɹiː"), words = c("one","one","one","one","one","one","two","two","two","two","two","two","two","two","three"))
ls <- list(pronouns, articles, numbers)

ls[[1]]
   pronounciation words
1             juː   you
2             juː   you
3             juː   you
4              ju   you
5              ju   you
6              jə   you
7              jə   you
8             hɪm   him
9             hɪm   him
10            hɪm   him
11            həm   him
12            ðɛm  them

从这个数据帧列表中,我想使用 table() 提取 $words 的列联表,同时选择每个单词最常见的发音。所需的结果在 ls_out 中:

pronouns_out <- data.frame(pronounciation = c("juː","hɪm","ðɛm"), words = c("you","him","them"), occurence = c(7,4,1))
articles_out <- data.frame(pronounciation = c("ði","eɪ"), words = c("the","a"), occurence = c(7,8))
numbers_out <- data.frame(pronounciation = c("wʌn","tuː","θɹiː"), words = c("one","two","three"), occurence = c(6,8,1))
ls_out <- list(pronouns_out, articles_out, numbers_out)

ls_out[[1]]
  pronounciation words occurence
1            juː   you         7
2            hɪm   him         4
3            ðɛm  them         1

如果两个或多个发音频率相同(如 ls[[2]] 中的 ði 和 ðə),则需要随机选择一个发音。

非常欢迎任何关于如何做到这一点的建议。

最佳答案

使用table(和lapply):

ff = function(pronounce, word) 
{
    tab = table(word, pronounce)
    data.frame(pronounciation = colnames(tab)[max.col(tab, "random")], 
               words = rownames(tab),
               occurences = unname(rowSums(tab)))
}

lapply(ls, function(x) ff(x$pronounciation, x$words))

#[[1]]
#     pronounciation words occurences
#1        h<U+026A>m   him          4
#2 <U+00F0><U+025B>m  them          1
#3        ju<U+02D0>   you          7
#
#[[2]]
#  pronounciation words occurences
#1      e<U+026A>     a          8
#2      <U+00F0>i   the          7
#
#[[3]]
#      pronounciation words occurences
#1         w<U+028C>n   one          6
#2 θ<U+0279>i<U+02D0> three          1
#3         tu<U+02D0>   two          8   

关于r - 使用表格时选择数据框中最频繁的元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19563511/

相关文章:

r - 根据另一列中的位置值在 data.table 的一列中选择分隔数据

r - 任意重新排序 R 中的直方图列

python - 如何将多列初始化为现有的 pandas DataFrame

python - 如何根据条件从另一个数据帧更新数据帧值

arduino - Arduino 可以在微秒内采样 1-4 kHz 的音频吗?

r - 关于重命名函数的问题

r - 在 ggplot2 中绘制每个级别的平均值

python - 基于集中度结合 GeoPandas Dataframe 和 Pandas Dataframe

r - 使用table()在R中创建3个可变频率表

MATLAB - 绘制 .wav 文件的时频图