r - 通过字符串解析合并数据框行

标签 r string text dataframe string-concatenation

我正在尝试将具有以下结构的对话导入到数据框中:

conversation<-data.frame(
             uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                         "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                         "01/08/2015 2:59:19 pm: Person 1: Same here"))

这种结构将使解析日期、时间、人物和消息变得相对容易。但是在某些情况下,消息带有换行符,因此数据帧结构错误,如下所示:

conversation_errors<-data.frame(
                     uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                 "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                 "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
                                 "lend me your arms,",
                                 "fast as thunderbolts,",
                                 "for a pillow on my journey."))

您将如何合并这些实例?有没有我不知道的包裹?

所需的函数将简单地识别缺失的结构并与前一行“合并”,这样我会得到:

conversation_fixed<-data.frame(
                    uniquerow=c("01/08/2015 2:49:49 pm: Person 1: Hello",
                                "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
                                "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: lend me your arms, fast as thunderbolts, for a pillow on my journey."))

有什么想法吗?

最佳答案

假设您可以使用时间戳(在下面的 properDataRegex 中表示)正确识别结构正确的行,那么就可以做到:

mydata <- c("01/08/2015 2:49:49 pm: Person 1: Hello",
            "01/08/2015 2:51:49 pm: Person 2: Nice to meet you",
            "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku: ",
            "lend me your arms,",
            "fast as thunderbolts,",
            "for a pillow on my journey.",
            "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method",
            "but it will get the job done.")

properDataRegex <- "^\\d{2}/\\d{2}/\\d{4}\\s"
improperDataBool <- !grepl(properDataRegex, mydata)
while (sum(improperDataBool)) {
    mergeWPrevIndex <- which(c(FALSE, !improperDataBool[-length(improperDataBool)]) & 
                             improperDataBool)
    mydata[mergeWPrevIndex - 1] <- 
        paste(mydata[mergeWPrevIndex - 1], mydata[mergeWPrevIndex])
    mydata <- mydata[-mergeWPrevIndex]
    improperDataBool <- !grepl(properDataRegex, mydata)
}

mydata
## [1] "01/08/2015 2:49:49 pm: Person 1: Hello"                                                                                                    
## [2] "01/08/2015 2:51:49 pm: Person 2: Nice to meet you"                                                                                         
## [3] "01/08/2015 2:59:19 pm: Person 1: Same here, let me tell you a haiku:  lend me your arms, fast as thunderbolts, for a pillow on my journey."
## [4] "07/07/2015 3:29:00 pm: Person 1: This is not the most efficient method but it will get the job done."

在这里,mydata 是一个字符向量,但当然现在可以像您在问题中那样制作成 data.frame,或者使用 read.table() 解析它read.fwf()

关于r - 通过字符串解析合并数据框行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31259941/

相关文章:

html - 如何点击模糊的太阳?

r - 当变量包含向量的任何一个元素时,如何返回 bool 值?

r - 使用 data 中指定的颜色名称作为 geom_bar 中的填充颜色

runjags 对象太大

r - 使用 ggplot 绘制具有两个 y 刻度的图形

javascript - 如何找到每个逗号并将其替换为 >

python - 从字符串转换为 base-64 中的数字

javascript - 使用 JavaScript 正则表达式分割字符串但保留分隔符?

Python 2.7 Tkinter 如何更改按钮文本的文本颜色

javascript - 在 focusout/blur 上检索文本