鉴于此示例:

df = pd.DataFrame({'col1':['id1','id2','id3'],
                  'col2':['name1','foobar','name3'],
                  'col3':[{'am', 'e1', 'me', 'na'},{'ar', 'ba', 'fo', 'ob', 'oo'},{'am', 'e3', 'me', 'na'}]})

    col1    col2    col3
0   id1     name1   {na, e1, me, am}
1   id2     foobar  {ar, fo, ba, oo, ob}
2   id3     name3   {na, e3, me, am}

目标是将 df 与满足两个集合交集的匹配阈值的所有行进行子集化。

我的解决方案:

def subset_by_intersection_threshold(set_1, set_2, threshold):
    intersection = len(list(set_1.intersection(set_2)))
    union = (len(set_1) + len(set_2)) - intersection
    return float(intersection / union)>threshold

使用 jaccard 函数和 pandas apply 按阈值过滤所有与条件匹配的行(本例中为 0.4 匹配)。

set_words=set(['na','me'])

df[df.col3.apply(lambda x: subset_by_intersection_threshold(set(x), set_words,0.4))]

由于我觉得这个解决方案有点暴力模式，所以我提出这个问题是为了学习考虑执行时间的更有效的替代方案。

添加在我的个人笔记本电脑中执行的基准测试分数:

从慢到快:

%timeit df.col3.apply(lambda x: original(set(x), set_words, 0.4))  # 74 ms per loop
%timeit df.col3.apply(lambda x: jpp(x, set_words, 0.4))            # 32.3 ms per loop
%timeit list(map(lambda x: jpp(x, set_words, 0.4), df['col3']))    # 13.9 ms
%timeit [jpp(x, set_words, 0.4) for x in df['col3']]               # 12.2 ms

最佳答案

通过避免不必要的 list 创建和 float/set 转换，您可以将性能提高约 2 倍。为了获得额外的提升，可以通过使用列表理解构建的 bool 值列表进行索引。通常情况下，pd.Series.apply 可能不如列表理解中的常规循环。

def original(set_1, set_2, threshold):
    intersection = len(list(set_1.intersection(set_2)))
    union = (len(set_1) + len(set_2)) - intersection
    return float(intersection / union)>threshold

def jpp(set_1, set_2, threshold):
    intersection = len(set_1 & set_2)
    union = (len(set_1) + len(set_2)) - intersection
    return (intersection / union) > threshold

set_words = {'na', 'me'}

df = pd.concat([df]*10000)

%timeit df.col3.apply(lambda x: original(set(x), set_words, 0.4))  # 74 ms per loop
%timeit df.col3.apply(lambda x: jpp(x, set_words, 0.4))            # 32.3 ms per loop
%timeit [jpp(x, set_words, 0.4) for x in df['col3']]               # 23.4 ms per loop

关于python - 选定行的性能，其中条件是与集合的匹配百分比，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51947860/

python - 选定行的性能，其中条件是与集合的匹配百分比

添加在我的个人笔记本电脑中执行的基准测试分数:

上一篇：python - 如何使 Django 数据库缓存的条目过期？

下一篇：python - 有没有更好的方法根据大小写进行名称分类？