python - 在 pandas 或 python 中逐组比较 2 列

标签 python pandas

我目前这里有一个数据集,我不确定如何比较各组是否具有相似的值。这是我的数据集的示例

type   value
a       1
a       2
a       3
a       4

b       2
b       3
b       4
b       5

c       1
c       3
c       4



d       2
d       3
d       4


我想知道哪些行是相似的,因为所有(一种类型中的值)都存在于另一种类型中。例如,类型 d 的值为 2,3,4,类型 a 的值为 1,2,3,4 所以这是“相似”或者可以被认为是相同的,所以我希望它输出一些东西来告诉我 d 与 A 相似。

预期的输出应该是这样的


type   value            similarity
a       1         A is similar to B and D
a       2
a       3
a       4

b       2         b is similar to a and d
b       3
b       4
b       5

c       1         c is similar to a 
c       3
c       4



d       2         d is similar to a and b
d       3
d       4


不确定这是否可以在 python 或 pandas 中完成,但非常感谢指导,因为我真的迷路了,不知道从哪里开始

输出也不必是我刚才作为示例的内容,它可以只是另一个 csv,告诉我哪些类型是相似的并且

最佳答案

我会使用集合运算。

假设相似性意味着至少有 N 个共同点:

from itertools import combinations

# define minimum number of common items
N = 3

# aggregate as sets
s = df.groupby('type')['value'].agg(set)

# generate all combinations of sets
# and check is the intersection is at least N items
out = (pd.Series([len(a&b)>=N for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

# concat and add the reversed combinations (a/b -> b/a)
# we could have used a product in the first part but this
# would have required performing the computations twice
similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

# update the first row of each group with the string
df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

输出:

   type  value               similarity
0     a      1  a is similar to b, c, d
1     a      2                      NaN
2     a      3                      NaN
3     a      4                      NaN
4     b      2     b is similar to d, a
5     b      3                      NaN
6     b      4                      NaN
7     b      5                      NaN
8     c      1        c is similar to a
9     c      3                      NaN
10    c      4                      NaN
11    d      2     d is similar to a, b
12    d      3                      NaN
13    d      4                      NaN

假设相似性意味着一组是另一组的子集:

from itertools import combinations

s = df.groupby('type')['value'].agg(set)

out = (pd.Series([a.issubset(b) or b.issubset(a) for a, b in combinations(s, 2)],
                 index=pd.MultiIndex.from_tuples(combinations(s.index, 2)))
      )

similarity = (
 pd.concat([out, out.swaplevel()])
   .loc[lambda x: x].reset_index(-1)
   .groupby(level=0)['level_1'].apply(lambda g: f"{g.name} is similar to {', '.join(g)}")
)

df.loc[~df['type'].duplicated(), 'similarity'] = df['type'].map(similarity)

print(df)

输出:

   type  value            similarity
0     a      1  a is similar to c, d
1     a      2                   NaN
2     a      3                   NaN
3     a      4                   NaN
4     b      2     b is similar to d
5     b      3                   NaN
6     b      4                   NaN
7     b      5                   NaN
8     c      1     c is similar to a
9     c      3                   NaN
10    c      4                   NaN
11    d      2  d is similar to a, b
12    d      3                   NaN
13    d      4                   NaN

关于python - 在 pandas 或 python 中逐组比较 2 列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75308744/

相关文章:

python - 使用 Scrapy 抓取 arXiv xml 数据

Pandas:DataFrame 中的 DataFrame

python - 添加到集合中时,字符串分成单个字符

以 Python 方式将 header 添加到 csv 文件

python-3.x - 类型错误 : unhashable type: 'Int64Index'

python - 按行格式化 Pandas 数据框

python - 比较 pandas 中具有字符串列表的两列

python - 匹配和求和来自 2 个具有不相等行的数据帧的列

java - 当您将变量绑定(bind)到某些数据时会发生什么?

python - 如何查找每小时的占用情况?