所以我有这样的数据框:
df = pd.DataFrame(np.array(['This here is text','My Text was here','This was not ready']), columns=['Text'])
Text
0 This here is text
1 My Text was here
2 This was not ready
3 nothing common
我想创建一个具有以下结果的新数据框:
row1 row2 common_text
0 1 here,text
0 2 this
1 2 was
一个新的数据框,其中包含每对行之间的所有常见单词。另外,如果两行没有任何共同点,则忽略该对,就像 1,3 和 0,3 的情况一样。
我的问题是,有没有更快或Pythonic的方法来做到这一点,而不是迭代所有行两次以提取常用术语并将它们存储在一起?
最佳答案
from itertools import combinations
result = []
# Iterate through each pair of rows.
for row_1, row_2 in combinations(df['Text'].index, 2):
# Find set of lower case words stripped of whitespace for each row in pair.
s1, s2 = [set(df.loc[row, 'Text'].lower().strip().split()) for row in (row_1, row_2)]
# Find the common words to the pair of rows.
common = s1.intersection(s2)
if common:
# If there are words in common, append to the results as a common separated string (could also append the set of list of words).
result.append([row_1, row_2, ",".join(common)])
>>> pd.DataFrame(result, columns=['row1', 'row2', 'common_text'])
row1 row2 common_text
0 0 1 text,here
1 0 2 this
2 1 2 was
关于python - 提取每对行之间的共同术语,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46314492/