python - 如何通过标点符号拆分 Pandas 列中的长字符串

我有一个 df 看起来像这样:

words                                              col_a   col_b  
I guess, because I have thought over that. Um,       1       0 
That? yeah.                                          1       1
I don't always think you're up to something.         0       1

我想在出现标点字符的任何地方拆分 df.words (.,?!:;)成一个单独的行。但是，我想为每个新行保留原始行中的 col_b 和 col_b 值。例如，上面的 df 应该是这样的:

words                                              col_a   col_b  
I guess,                                             1       0
because I have thought over that.                    1       0
Um,                                                  1       0 
That?                                                1       1
yeah.                                                1       1
I don't always think you're up to something.         0       1

最佳答案

一种方法是使用 str.findall 带图案(.*?[.,?!:;])匹配任何这些标点符号和它前面的字符(非贪婪)，并分解结果列表:

(df.assign(words=df.words.str.findall(r'(.*?[.,?!:;])'))
   .explode('words')
   .reset_index(drop=True))

                                          words  col_a  col_b
0                                      I guess,      1      0
1             because I have thought over that.      1      0
2                                           Um,      1      0
3                                         That?      1      1
4                                         yeah.      1      1
5  I don't always think you're up to something.      0      1

关于python - 如何通过标点符号拆分 Pandas 列中的长字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61331415/

上一篇：python - pysftp 库在 AWS lambda 层中不起作用

下一篇：android - 重复的 id，标签为空，带有 androidx.navigation.fragment.NavHostFragment 的另一个 fragment

python - 如何从推特上抓取所有主题

python - 类型错误 : 'DataFrame' object is not callable

nlp - 是否可以将单词附加到现有的 OpenNLP POS 语料库/模型？

python Pandas :get rolling value of one Dataframe by rolling index of another Dataframe

python - 我在 pandas DataFrame 中有字符串索引，如何通过 startswith 选择？

python - 从字符串 NLP 中删除英语 "crap"单词的策略，例如 "um"、 "uh"

python - 如何使用 Twisted(或 Autobahn)连接到 socket.io 服务器？

Python-计算二维列表中的元素频率

pandas - Pandas 数据框复制有什么作用？