下面的 data.frame
应该是逆对和一些条件的子集:
> foo
ID Day Period Start End
1 11 1 morning Central Park Alphabet Village
2 11 1 morning Central Park Alphabet Village
3 11 1 evening Alphabet Village Grammercy
4 54 1 morning Union Square Chinatown
5 67 1 morning Midtown Harlem
6 67 1 morning Harlem Midtown
7 69 1 morning Greenpoint Prospect Heights
8 54 1 evening Chinatown Union Square
9 77 1 morning Park Slope Williamsburg
10 73 1 evening Williamsburg Park Slope
11 88 2 morning Grammercy Battery Park
12 88 2 morning Battery Park SoHo
13 88 2 evening Battery Park Grammercy
14 69 2 evening Prospect Heights Greenpoint
15 88 2 evening Grammercy Battery Park
例如,Start
和 End
站逆对必须落在
相同的 Day
,具有相同的 ID
而第一个必须发生在早上,第二个必须发生在晚上。 *编辑: 需要注意的是,只有一个 Start-End 可用于与 End-Start 配对。也就是说,一旦形成一对,原来的Start-End就不能再用来形成另一对。例如,记录 15
不能与记录 13
配对,因为 13
已被“占用”。
子集的输出总是偶数。在这种情况下,它将是:
ID Day Period Start End
3 54 1 morning Union Square Chinatown
7 54 1 evening Chinatown Union Square
10 88 2 morning Grammercy Battery Park
11 88 2 evening Battery Park Grammercy
我不确定 subset()
函数是否应该与 for 循环一起使用或如何构建循环。它应该这样说 - 如果 start
和 end
等于下一行的 end
和 start
并且 ID
= ID
, Day
= Day
第一条记录的Period
= "早上”,而第二条记录 = “晚上”
我认为代码应该以这样的开头:if(foo[i-1,"start"] == foo[i,"end"]) & (foo[i-1,"end "] == foo[i,"start"])
但我不确定。这个想法是保留所有满足这些条件的逆对。将不胜感激对要采取的步骤的任何指导和解释。
示例数据:
> dput(foo)
structure(list(ID = c(11L, 11L, 11L, 54L, 67L, 67L, 69L, 54L,
77L, 73L, 88L, 88L, 88L, 69L, 88L), Day = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), Period = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L), .Label = c("evening",
"morning"), class = "factor"), Start = structure(c(3L, 3L, 1L,
11L, 8L, 7L, 6L, 4L, 9L, 12L, 5L, 2L, 2L, 10L, 5L), .Label = c("Alphabet Village",
"Battery Park", "Central Park", "Chinatown", "Grammercy", "Greenpoint",
"Harlem", "Midtown", "Park Slope", "Prospect Heights", "Union Square",
"Williamsburg"), class = "factor"), End = structure(c(1L, 1L,
4L, 3L, 6L, 7L, 9L, 11L, 12L, 8L, 2L, 10L, 4L, 5L, 2L), .Label = c("Alphabet Village",
"Battery Park", "Chinatown", "Grammercy", "Greenpoint", "Harlem",
"Midtown", "Park Slope", "Prospect Heights", "SoHo", "Union Square",
"Williamsburg"), class = "factor")), .Names = c("ID", "Day",
"Period", "Start", "End"), class = "data.frame", row.names = c(NA,
-15L))
最佳答案
按“ID”、“Day”分组后,过滤
unique
元素计数大于 1 的“Period”(ndistinct
),然后将 factor
列更改为 character
并执行与 OP 帖子中的条件匹配的 filter
library(dplyr)
foo %>%
group_by(ID, Day) %>%
filter(n_distinct(Period)>1) %>%
mutate(Start = as.character(Start), End = as.character(End)) %>%
filter(Start[1]==End[n()] & Start[n()] == End[1])
# ID Day Period Start End
# (int) (int) (fctr) (chr) (chr)
#1 54 1 morning Union Square Chinatown
#2 54 1 evening Chinatown Union Square
#3 88 2 morning Grammercy Battery Park
#4 88 2 evening Battery Park Grammercy
在dplyr
0.5.0及以上版本中,我们可以使用mutate_if
foo %>%
group_by(ID, Day) %>%
filter(n_distinct(Period)>1) %>%
mutate_if(is.factor, as.character) %>%
filter(Start[1]==End[n()] & Start[n()] == End[1])
# ID Day Period Start End
# <int> <int> <chr> <chr> <chr>
#1 54 1 morning Union Square Chinatown
#2 54 1 evening Chinatown Union Square
#3 88 2 morning Grammercy Battery Park
#4 88 2 evening Battery Park Grammercy
关于r - 通过反向对子集数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41517834/