python - 在 pandas 或 python 中逐组比较 2 列

我目前这里有一个数据集，我不确定如何比较各组是否具有相似的值。这是我的数据集的示例

type   value
a       1
a       2
a       3
a       4

b       2
b       3
b       4
b       5

c       1
c       3
c       4



d       2
d       3
d       4

我想知道哪些行是相似的，因为所有(一种类型中的值)都存在于另一种类型中。例如，类型 d 的值为 2,3,4，类型 a 的值为 1,2,3,4 所以这是“相似”或者可以被认为是相同的，所以我希望它输出一些东西来告诉我 d 与 A 相似。

预期的输出应该是这样的


type   value            similarity
a       1         A is similar to B and D
a       2
a       3
a       4

b       2         b is similar to a and d
b       3
b       4
b       5

c       1         c is similar to a 
c       3
c       4



d       2         d is similar to a and b
d       3
d       4

不确定这是否可以在 python 或 pandas 中完成，但非常感谢指导，因为我真的迷路了，不知道从哪里开始

输出也不必是我刚才作为示例的内容，它可以只是另一个 csv，告诉我哪些类型是相似的并且

最佳答案

我会使用集合运算。

假设相似性意味着至少有 N 个共同点:

from itertools import combinations

# define minimum number of common items
N = 3

# aggregate as sets
s = df.groupby('type')['value'].agg(set)

# generate all combinations of sets
# and check is the intersection is at least N items
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

# concat and add the reversed combinations (a/b -> b/a)
# we could have used a product in the first part but this
# would have required performing the computations twice
similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

# update the first row of each group with the string
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

输出:

   type  value               similarity
0     a      1  a is similar to b, c, d
1     a      2                      NaN
2     a      3                      NaN
3     a      4                      NaN
4     b      2     b is similar to d, a
5     b      3                      NaN
6     b      4                      NaN
7     b      5                      NaN
8     c      1        c is similar to a
9     c      3                      NaN
10    c      4                      NaN
11    d      2     d is similar to a, b
12    d      3                      NaN
13    d      4                      NaN

假设相似性意味着一组是另一组的子集:

from itertools import combinations

s = df.groupby('type')['value'].agg(set)

out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

输出:

   type  value            similarity
0     a      1  a is similar to c, d
1     a      2                   NaN
2     a      3                   NaN
3     a      4                   NaN
4     b      2     b is similar to d
5     b      3                   NaN
6     b      4                   NaN
7     b      5                   NaN
8     c      1     c is similar to a
9     c      3                   NaN
10    c      4                   NaN
11    d      2  d is similar to a, b
12    d      3                   NaN
13    d      4                   NaN

关于python - 在 pandas 或 python 中逐组比较 2 列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/75308744/

python - 在 pandas 或 python 中逐组比较 2 列

假设相似性意味着至少有 N 个共同点:

假设相似性意味着一组是另一组的子集:

上一篇：kotlin - 如何在 jetpack compose 中将光标从一个文本字段传递到另一文本字段？

下一篇：php - docker-compose、caddy 和 php-fpm 的基本配置