我目前这里有一个数据集,我不确定如何比较各组是否具有相似的值。这是我的数据集的示例
type value
a 1
a 2
a 3
a 4
b 2
b 3
b 4
b 5
c 1
c 3
c 4
d 2
d 3
d 4
我想知道哪些行是相似的,因为所有(一种类型中的值)都存在于另一种类型中。例如,类型 d 的值为 2,3,4,类型 a 的值为 1,2,3,4 所以这是“相似”或者可以被认为是相同的,所以我希望它输出一些东西来告诉我 d 与 A 相似。
预期的输出应该是这样的
type value similarity
a 1 A is similar to B and D
a 2
a 3
a 4
b 2 b is similar to a and d
b 3
b 4
b 5
c 1 c is similar to a
c 3
c 4
d 2 d is similar to a and b
d 3
d 4
不确定这是否可以在 python 或 pandas 中完成,但非常感谢指导,因为我真的迷路了,不知道从哪里开始
输出也不必是我刚才作为示例的内容,它可以只是另一个 csv,告诉我哪些类型是相似的并且
最佳答案
我会使用集合运算。
假设相似性意味着至少有 N 个共同点:
from itertools import combinations
# define minimum number of common items
N = 3
# aggregate as sets
s = df.groupby('type')['value'].agg(set)
# generate all combinations of sets
# and check is the intersection is at least N items
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
# concat and add the reversed combinations (a/b -> b/a)
# we could have used a product in the first part but this
# would have required performing the computations twice
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
# update the first row of each group with the string
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
输出:
type value similarity
0 a 1 a is similar to b, c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d, a
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN
假设相似性意味着一组是另一组的子集:
from itertools import combinations
s = df.groupby('type')['value'].agg(set)
out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
)
similarity = (
pd.concat([out, out.swaplevel()])
.loc[lambda x: x].reset_index(-1)
.groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)
print(df)
输出:
type value similarity
0 a 1 a is similar to c, d
1 a 2 NaN
2 a 3 NaN
3 a 4 NaN
4 b 2 b is similar to d
5 b 3 NaN
6 b 4 NaN
7 b 5 NaN
8 c 1 c is similar to a
9 c 3 NaN
10 c 4 NaN
11 d 2 d is similar to a, b
12 d 3 NaN
13 d 4 NaN
关于python - 在 pandas 或 python 中逐组比较 2 列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75308744/