python - 使用pandas，如何过滤两列中具有相似值的行

我有一个大数据框(约 1000 万行)。每行有:

类别
起始位置
结束位置

如果两行属于同一类别，并且起始位置和结束位置以 +-5 公差重叠，我只想保留其中一行。例如

1, cat1, 10, 20
2, cat1, 12, 21
3, cat2, 10, 25

我想过滤掉 1 或 2。

我现在做的事情效率不高，

import pandas as pd
df = pd.read_csv('data.csv', sep='\t', header=None)
dfs = []
for seq in df.category.unique():
    dfs[seq] = df[df.category == seq]
for index, row in df.iterrows():
    if index in discard:
        continue
    df_2 = dfs[row.category]
    res = df_2[(abs(df_2.start - row.start) <= params['min_distance']) & (abs(df_2.end - row.end) <= params['min_distance'])]
    if len(res.index) > 1:
        discard.extend(res.index.values)
    rows.append(row)
df = pd.DataFrame(rows)

我还尝试了一种使用数据帧的排序版本的不同方法。

my_index = 0
indexes = []
discard = []
count = 0
curr = 0
total_len = len(df.index)
while my_index < total_len - 1:
    row = df.iloc[[my_index]]
    cond = True
    next_index = 1
    while cond:
        second_row = df.iloc[[my_index + next_index]]
        c1 = (row.iloc[0].category == second_row.iloc[0].category)
        c2 = (abs(second_row.iloc[0].sstart - row.iloc[0].sstart) <= params['min_distance'])
        c3 = (abs(second_row.iloc[0].send - row.iloc[0].send) <= params['min_distance'])
        cond =  c1 and c2 and c3
        if cond and (c2 amd c3):
            indexes.append(my_index)
            cond = True
        next_index += 1
    indexes.append(my_index)
    my_index += next_index
indexes.append(total_len - 1)

问题是这个解决方案并不完美，有时它会漏掉一行，因为重叠可能是前面的几行，而不是下一行

我正在寻找有关如何以对 Pandas 更友好的方式解决此问题的任何想法(如果存在)。

最佳答案

这里的方法应该是这样的:

pandas.groupby 按类别
groupby 结果上的 agg(Func)
Func 应该实现在类别内查找最佳范围的逻辑(排序搜索、平衡树或其他任何内容)

关于python - 使用pandas，如何过滤两列中具有相似值的行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55455841/

python - 使用pandas，如何过滤两列中具有相似值的行

上一篇：python - Pandas - 创建具有相同优先级类别的分类对象

下一篇：python - Django:除非删除父模型，否则防止删除子模型