python - 提取每对行之间的共同术语

所以我有这样的数据框:

df = pd.DataFrame(np.array(['This here is text','My Text was here','This was not ready']), columns=['Text'])

                 Text
0    This here is text
1    My Text was here
2    This was not ready
3    nothing common

我想创建一个具有以下结果的新数据框:

row1 row2    common_text
  0    1        here,text
  0    2        this
  1    2        was

一个新的数据框，其中包含每对行之间的所有常见单词。另外，如果两行没有任何共同点，则忽略该对，就像 1,3 和 0,3 的情况一样。

我的问题是，有没有更快或Pythonic的方法来做到这一点，而不是迭代所有行两次以提取常用术语并将它们存储在一起？

最佳答案

from itertools import combinations

result = []

# Iterate through each pair of rows.
for row_1, row_2 in combinations(df['Text'].index, 2):
    # Find set of lower case words stripped of whitespace for each row in pair.
    s1, s2  = [set(df.loc[row, 'Text'].lower().strip().split()) for row in (row_1, row_2)]
    # Find the common words to the pair of rows.
    common = s1.intersection(s2)
    if common:
        # If there are words in common, append to the results as a common separated string (could also append the set of list of words).
        result.append([row_1, row_2, ",".join(common)])

>>> pd.DataFrame(result, columns=['row1', 'row2', 'common_text'])
   row1  row2 common_text
0     0     1   text,here
1     0     2        this
2     1     2         was

关于python - 提取每对行之间的共同术语，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46314492/

上一篇：python - 在 pytest 装置中导入应用程序时出错

下一篇：python - 如何组合两个keras生成器函数

相关文章：

pandas - 如何在 Pandas 数据框中使用 apply 返回多列

python - 如何根据时间间隔拆分 Pandas 数据框

python - 分组依据和 SUM 列

python - 如何在 pandas 数据框中使用 groupby 来获取以下数据的平均值？

jquery - 如何使用 cherrypy 进行异步 ajax 调用？

pandas - 如何获得 pandas Multiindex 上两个子列之间的百分比？

python - Pandas 在多索引中连接级别

python - 为什么在类上设置描述符会覆盖描述符？

python - Pandas 在用户定义的函数中使用 Numpy Vectorization 而不是使用 loops/lambda.apply()

python - 如何合并pandas中的两个数据框？