我是 R 的新手,对循环有疑问
在我的真实数据集中,有 80 个国家/地区的 15 个部门和 6 类组织的 7000 个观测值,但这里是一个简化的示例。
country <- c("a","a","a","a","a","a","b","b","b","b","b","b",
"c","c","c","c","c","c","d","d","d","d","d","d")
sector <- c("a","a","a","b","c","c","a","b","b","b","c","c",
"b","b","b","b","c","c","a","a","b","b","c","c")
organization <-c("a","b","c","c","b","a","a","b","b","c","b","b",
"c","a","a","b","b","c","c","b","a","a","b","c")
budget <-c(2,4,3,5,9,7,5,4,3,6,1,2,4,5,6,1,5,3,4,2,3,5,4,6)
table <- data.frame(country, sector, organization, budget)
我想要的是:
- 特定国家/地区特定部门中不同类型组织的数量。
- 分配给不同类型组织的部门总预算的百分比。
我首先必须制作一个子集以仅选择来自国家“a”和部门“a”的信息
smalltable <-subset(table, (country == "a") & (sector == "a"))
然后回答我的第一个问题,一个国家的一个部门中每种类型的组织有多少
smalltable$count <- table(smalltable$organization)
然后我需要找到财务的百分比
smalltable$percentage <- smalltable$budget / sum(smalltable$budget)
然后我用了tapply
N <- tapply(smalltable$count, smalltable$organization, FUN=sum)
financialshare <- tapply(smalltable$percentage, smalltable$organization, FUN=sum)
最后结合这个:
total <- data.frame (smalltable$country,smalltable$sector,smalltable$organization, N,financialshare)
total
这是我需要的小 table !
但是我的所有 15 个部门和所有 80 个国家/地区都需要这个,所以我需要某种循环功能来运行所有部门的循环并为每个国家/地区重复此循环。 我需要尽可能精简这些表格,将有关 1 个国家(即 15 个部门)的所有信息汇总到一张表格中。还应从表中删除零值以节省空间。
我需要如何进行?
最佳答案
我会给出一个data.table
答案
library(data.table)
my_table=data.table(country, sector, organization, budget)
by_org=my_table[, list(count=.N, budget=sum(budget)),
keyby=list(country, sector, organization)]
total_budgets=my_table[, list(total_budget=sum(budget)),
keyby=list(country, sector)]
joined_table= total_budgets[by_org]
joined_table[,percentage:=budget/total_budget]
来自 Matthew 的编辑:在 v1.8.1 中,按组使用 :=
,不需要连接,因此它更容易和更快,并且 total_budget
列被添加到右边比它在 v1.8.0 中使用 join 的地方更自然:
DT = data.table(country, sector, organization, budget)
ans = DT[, list(count=.N, budget=sum(budget)),
keyby=list(country, sector, organization)]
ans[, total_budget:=sum(budget), by=list(country,sector)]
ans[, percentage:=budget/total_budget]
结果(使用 v1.8.1):
country sector organization count budget total_budget percentage
1: a a a 1 2 9 0.2222222
2: a a b 1 4 9 0.4444444
3: a a c 1 3 9 0.3333333
4: a b c 1 5 5 1.0000000
5: a c a 1 7 16 0.4375000
6: a c b 1 9 16 0.5625000
7: b a a 1 5 5 1.0000000
8: b b b 2 7 13 0.5384615
9: b b c 1 6 13 0.4615385
10: b c b 2 3 3 1.0000000
11: c b a 2 11 16 0.6875000
12: c b b 1 1 16 0.0625000
13: c b c 1 4 16 0.2500000
14: c c b 1 5 8 0.6250000
15: c c c 1 3 8 0.3750000
16: d a b 1 2 6 0.3333333
17: d a c 1 4 6 0.6666667
18: d b a 2 8 8 1.0000000
19: d c b 1 4 10 0.4000000
20: d c c 1 6 10 0.6000000
这里需要注意两点:首先,就计数和总和而言,您的问题有点模糊和矛盾,但希望我的代码片段在我所做的计算方面足够 self 解释。
其次,在 R
中循环大量观察并不是惯用的,因为这往往很慢。大多数使用 R
编程一段时间的人都倾向于使用向量运算、plyr
、data.table
或其他类似的包。
但为了完整起见,循环构造如下:
for (item in list)
{
...
}
遍历公共(public)索引...
for (i in 1:length(object))
{
...
}
关于r - R中的双循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11099701/