r - 如果 Sequence <= 3 则将所有值解码为零,并保留某些信息

标签 r select data.table sequence

我之前问过类似的问题,但我需要一些进一步的输出,并决定发布一个新问题。

我有一个像这样的 data.table 对象:

library(data.table)
cells <- c(100, 1,1980,1,0,1,1,0,1,0,
       150, 1,1980,1,1,1,0,0,0,1,
       99 , 1,1980,1,1,1,1,0,0,0,
       899, 1,1980,0,1,0,1,1,1,1,
       789, 1,1982,1,1,1,0,1,1,1 )
colname <- c("number","sex", "birthy", "2004","2005", "2006", "2007", "2008", "2009","2010")
rowname <- c("1","2","3","4","5")
y <- matrix(cells, nrow=5, ncol=10, byrow=TRUE, dimnames =   list(rowname,colname))
y <- data.table(y, keep.rownames = TRUE)

2004 列中的值 1 表示此人在 2004 年期间连续受保。之前 3 年受保的人可以参与研究。我需要此 data.table 的一个子集,其中包含满足以下条件的所有观察结果:2004+2005+2006 = 3 或 2005+2006+2007 = 或 2006+2007+...

#using melt and rle function to restrucure the data
tmp <- melt(y, id = "rn", measure.vars = patterns("^20"),
        variable.factor = FALSE, variable.name = "year")[, rle(value), by = rn]

#subset data based on condition, keeping only the first relevant sequence
tmp2 <- tmp[(values == 1 & lengths >= 3), .(rn,lengths)][, .SD[1,], by=rn]
##selecting only rows with value=1 and min 3 in a row
##keeping only the variable rn
tmp3 <- tmp[values == 1, which(max(lengths) >= 3), by = rn]$rn

##using the row-number to select obersvations from data.table
##merging length of sequence
dt <- merge(y[as.integer(tmp3)],tmp2, by="rn")

如果它们不是序列的一部分,有没有办法将所有 1 变为 0?例如 rn==4 变量“2005”需要为零。

我还需要一个新变量“begy”,其中包含序列开始的年份。例如rn==5begy==2004。任何建议将不胜感激...

最佳答案

新解决方案:

# define a custom function in order to only keep the sequences
# with 3 (or more) consecutive years
rle3 <- function(x) {
  r <- rle(x)
  r$values[r$lengths < 3 & r$values == 1] <- 0
  inverse.rle(r)
}

# replace all '1'-s that do not belong to a sequence of at least 3 to '0'
# create 'begy'-variable
melt(y, id = 1:4, measure.vars = patterns("^20"),
     variable.factor = FALSE, variable.name = "year"
     )[, value := rle3(value), by = rn
       ][, begy := year[value == 1][1], rn
         ][, dcast(.SD[!is.na(begy)], ... ~ year, value.var = "value")]

给出:

   rn number sex birthy begy 2004 2005 2006 2007 2008 2009 2010
1:  2    150   1   1980 2004    1    1    1    0    0    0    0
2:  3     99   1   1980 2004    1    1    1    1    0    0    0
3:  4    899   1   1980 2007    0    0    0    1    1    1    1
4:  5    789   1   1982 2004    1    1    1    0    1    1    1

旧解决方案:

# define a custom function in order to only keep the sequences
# with 3 (or more) consecutive years
rle3 <- function(x) {
  r <- rle(x)
  r$values[r$lengths < 3 & r$values == 1] <- 0
  inverse.rle(r)
}

# create a reference 'data.table' with only the row to keep
# and the start year of the (first) sequence (row 5 has 2 sequences of 3)
x <- melt(y, id = "rn", measure.vars = patterns("^20"),
          variable.factor = FALSE, variable.name = "year"
          )[, value := rle3(value), by = rn
            ][value == 1, .SD[1], rn]

# join 'x' with 'y' to add 'begy' and filter out the row with no sequences of 3
y[x, on = "rn", begy := year][!is.na(begy)]

给出:

   rn number sex birthy 2004 2005 2006 2007 2008 2009 2010 begy
1:  2    150   1   1980    1    1    1    0    0    0    1 2004
2:  3     99   1   1980    1    1    1    1    0    0    0 2004
3:  4    899   1   1980    0    1    0    1    1    1    1 2007
4:  5    789   1   1982    1    1    1    0    1    1    1 2004

关于r - 如果 Sequence <= 3 则将所有值解码为零,并保留某些信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52034420/

相关文章:

r - 带有 MS-Word 输出的 Bookdown 中的表格交叉引用?

R Shiny : multiple use in ui of same renderUI in server?

mysql - 比如搜索区分大小写?

r - 如何在 R 中的 data.table 中按两个条件选择行

r - 匹配来去位置数据

python - 使用 R 或 Python 从简单的列表颜色中绘制包含国家/地区的世界地图

r - 创建一个新列作为列表返回

html - 当列表打开时隐藏选择列表的选定值

Perl dbi sqlite 'select * ..' 只返回第一个元素

r - Difftime 在 R 中使用 data.table IDate 很慢