r - 如何删除某些句子后面的文字?

标签 r dataframe

我有一个包含 n 行的数据框,其中包含一些文本。其中一些行包含我想删除的额外文本,并且额外文本恰好出现在一些特定句子之后。

举个例子:

df = structure(list(Text = c("The text you see here is fine, no problem with this.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this. We are now ready to take your questions. Life is great even if it is too hot to work at the moment.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this.", 
"The text you see here is fine, no problem with this. We are now at your disposal for questions. I really need to remove this bit that comes after since I don't need it. Hopefully SE will sort this out.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this. Transcript of the questions asked and the answers. Summertime is nice.", 
"The text you see here is fine, no problem with this.", "The text you see here is fine, no problem with this."
)), class = "data.frame", row.names = c(NA, -12L))

我想得到:

#                                                               Text
# 1                                                     The text you see here is fine, no problem with this.
# 2                                                     The text you see here is fine, no problem with this.
# 3            The text you see here is fine, no problem with this. We are now ready to take your questions.
# 4                                                     The text you see here is fine, no problem with this.
# 5                                                     The text you see here is fine, no problem with this.
# 6          The text you see here is fine, no problem with this. We are now at your disposal for questions.
# 7                                                     The text you see here is fine, no problem with this.
# 8                                                     The text you see here is fine, no problem with this.
# 9                                                     The text you see here is fine, no problem with this.
# 10 The text you see here is fine, no problem with this. Transcript of the questions asked and the answers.
# 11                                                    The text you see here is fine, no problem with this.
# 12                                                    The text you see here is fine, no problem with this.

数据框是真实数据框的简化表示。额外的文本(在示例中始终相同,但在实际情况中有所不同)始终出现在以下三个句子之后:我们现在随时为您解答问题。所提问题和答案的记录。以及我们现在已准备好回答您的问题。

谁能帮我解决这个问题吗?

你真的会让我很开心。

谢谢!

最佳答案

我们可以使用sub

df$Text <- sub("I really need to remove .*", "", df$Text)

我们可以创建一个模式向量并使用 for 循环

patvec <- c("We are now at your disposal for questions.", 
    "Transcript of the questions asked and the answers.", 
  "We are now ready to take your questions.",
  "I really need to remove this bit that comes after since I don't need it.")

# // loop over the sequence of pattern vector
for(i in seq_along(patvec)) {
     # // create a regex pattern to capture the strings
     # // including the pattern vector elements
     tmppat <- paste0("^(.*", patvec[i], ").*")
     # // use sub with replacement on the captured group i.e. string inside (..)
     # // assign and update the column Text
     df$Text <- sub(tmppat, "\\1", df$Text)
  }

-输出

df
                                                                                                      #Text
#1                                                     The text you see here is fine, no problem with this.
#2                                                     The text you see here is fine, no problem with this.
#3            The text you see here is fine, no problem with this. We are now ready to take your questions.
#4                                                     The text you see here is fine, no problem with this.
#5                                                     The text you see here is fine, no problem with this.
#6          The text you see here is fine, no problem with this. We are now at your disposal for questions.
#7                                                     The text you see here is fine, no problem with this.
#8                                                     The text you see here is fine, no problem with this.
#9                                                     The text you see here is fine, no problem with this.
#10 The text you see here is fine, no problem with this. Transcript of the questions asked and the answers.
#11                                                    The text you see here is fine, no problem with this.
#12                                                    The text you see here is fine, no problem with this.
 

注意:即使有数十万个模式向量元素,这也应该可以正常工作

关于r - 如何删除某些句子后面的文字?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63206707/

相关文章:

r - 使用 data.table 查找时间戳对之间重叠的持续时间

python - 在 Pandas 中每行连接列

python - Pandas :如何创建年周变量?

python - 按子字符串条件过滤 pandas DataFrame

python - 求 pandas 中某些列的总和

r - RMSE的插入符二进制分类

rbind 两个 data.frame 保留行顺序和行名称

python - 将 2 列中的值合并为 pandas 数据框中的单列

arrays - R如何存储锯齿状数组

python - 更改 pandas datetime64 列的时间组件