我有一个数据框,其中每一行都有一个唯一的对话 ID。每个对话都由独特的帖子组成,这些帖子可以是传入帖子或回复帖子(但不能同时是两者)(客户发布传入帖子,代理发布回复)。每个帖子都有一个情绪分数。
我想通过测量第一个传入帖子和最后一个传入帖子之间的差异来计算情绪变化。下面是示例数据框。
# A tibble: 11 x 11
conversationID postID postType conversationOrd… incomingOrder responseOrder createdDate closedDate convResponseHan… tar sentence_score
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 25455628 4.74e7 Incomin… 2 2 NA 10/07/2019… 10/07/201… NA NA 0
2 25455725 4.74e7 Incomin… 1 1 NA 10/07/2019… 10/10/201… NA NA 0
3 25455725 4.74e7 Incomin… 2 2 NA 10/07/2019… 10/10/201… NA NA 0
4 25455725 4.74e7 Incomin… 3 3 NA 10/07/2019… 10/10/201… NA NA 0
5 25455725 4.18e6 Response 4 NA 1 10/08/2019… 10/10/201… 23.4 748. 0.184
6 25456349 4.74e7 Incomin… 1 1 NA 10/07/2019… 10/08/201… NA NA 0.3
7 25456349 4.18e6 Response 2 NA 1 10/07/2019… 10/08/201… 3.17 5.15 0.440
8 25456349 4.74e7 Incomin… 3 2 NA 10/07/2019… 10/08/201… NA NA 0.113
9 25456349 4.18e6 Response 4 NA 2 10/07/2019… 10/08/201… 0.67 3.03 0.786
10 25456349 4.74e7 Incomin… 5 3 NA 10/07/2019… 10/08/201… NA NA 0.214
11 25456349 4.18e6 Response 6 NA 3 10/07/2019… 10/08/201… 1.58 2.43 0.251
在理想情况下,我想要另一个名为 sentimentConversion 的列,它指示对话(从客户的角度来看)是从正面变为负面,从负面变为正面,还是保持不变。
这是 dput()
的输出。
structure(list(conversationID = c(25455628, 25455725, 25455725,
25455725, 25455725, 25456349, 25456349, 25456349, 25456349, 25456349,
25456349), postID = c(47371258, 47371485, 47371486, 47373371,
4184259, 47373084, 4181224, 47374183, 4181324, 47375140, 4181430
), postType = c("Incoming Post", "Incoming Post", "Incoming Post",
"Incoming Post", "Response", "Incoming Post", "Response", "Incoming Post",
"Response", "Incoming Post", "Response"), conversationOrder = c(2,
1, 2, 3, 4, 1, 2, 3, 4, 5, 6), incomingOrder = c(2, 1, 2, 3,
NA, 1, NA, 2, NA, 3, NA), responseOrder = c(NA, NA, NA, NA, 1,
NA, 1, NA, 2, NA, 3), createdDate = c("10/07/2019 08:45:14 PM -0400",
"10/07/2019 08:48:25 PM -0400", "10/07/2019 08:48:25 PM -0400",
"10/07/2019 09:20:26 PM -0400", "10/08/2019 09:16:24 AM -0400",
"10/07/2019 09:15:45 PM -0400", "10/07/2019 09:20:52 PM -0400",
"10/07/2019 09:35:47 PM -0400", "10/07/2019 09:38:47 PM -0400",
"10/07/2019 09:55:49 PM -0400", "10/07/2019 09:58:13 PM -0400"
), closedDate = c("10/07/2019 08:49:36 PM -0400", "10/10/2019 09:16:44 AM -0400",
"10/10/2019 09:16:44 AM -0400", "10/10/2019 09:16:44 AM -0400",
"10/10/2019 09:16:44 AM -0400", "10/08/2019 09:06:33 PM -0400",
"10/08/2019 09:06:33 PM -0400", "10/08/2019 09:06:33 PM -0400",
"10/08/2019 09:06:33 PM -0400", "10/08/2019 09:06:33 PM -0400",
"10/08/2019 09:06:33 PM -0400"), convResponseHandleTime = c(NA,
NA, NA, NA, 23.42, NA, 3.17, NA, 0.67, NA, 1.58), tar = c(NA,
NA, NA, NA, 748.28, NA, 5.15, NA, 3.03, NA, 2.43), sentence_score = c(0,
0, 0, 0, 0.183532587096449, 0.3, 0.439929079364222, 0.1125, 0.785712147332011,
0.21354963890361, 0.251196909045889)), row.names = c(NA, -11L
), class = c("tbl_df", "tbl", "data.frame"))
最佳答案
由于您只查看 Incoming Post
以确定情绪分数的变化,因此您可以按此变量进行过滤。
在选择 conversationID
定义的每个组中的第一个和最后一个之前,我会按 converationOrder
进行安排以确保它们按顺序排列。
sentimentChange
将从第一个减去最后一个。然后它将根据大于或小于零(或相同)为您的 sentimentChange
编码。
library(tidyverse)
df %>%
filter(postType == "Incoming Post") %>%
arrange(conversationID, conversationOrder) %>%
group_by(conversationID) %>%
summarise(sentimentChange = last(sentence_score) - first(sentence_score)) %>%
mutate(sentimentConversion = case_when(
sentimentChange < 0 ~ "Down",
sentimentChange > 0 ~ "Up",
sentimentChange == 0 ~ "Same"))
输出
# A tibble: 3 x 3
conversationID sentimentChange sentimentConversion
<dbl> <dbl> <chr>
1 25455628 0 Same
2 25455725 0 Same
3 25456349 -0.0865 Down
关于r - 第一个和最后一个传入帖子之间的情感得分差异,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59957926/