我的数据框:
pd.DataFrame({'company':['Chipotle','Branchburg Chipotle','Chipotle NJ','Chipotle 8853','The Home Depot','Home Depot','28211 Home Depot','Wendys','BJs','Buffalo wings'],
'address':['123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'123 Main Street Branchburg NJ 08853'
,'1220 N Wendover Rd Charlotte NC 28211'
,'1220 N Wendover Rd Charlotte NC 28211'
,'1220 N Wendover Rd Charlotte NC 28211'
,'2805 Whitson St Selma CA 93662'
,'2805 Whitson St Selma CA 93662'
,'2805 Whitson St Selma CA 93662']})
company address
0 Chipotle 123 Main Street Branchburg NJ 08853
1 Branchburg Chipotle 123 Main Street Branchburg NJ 08853
2 Chipotle NJ 123 Main Street Branchburg NJ 08853
3 Chipotle 8853 123 Main Street Branchburg NJ 08853
4 The Home Depot 1220 N Wendover Rd Charlotte NC 28211
5 Home Depot 1220 N Wendover Rd Charlotte NC 28211
6 28211 Home Depot 1220 N Wendover Rd Charlotte NC 28211
7 Wendy's 2805 Whitson St Selma CA 93662
8 BJ's 2805 Whitson St Selma CA 93662
9 Buffalo wings 2805 Whitson St Selma CA 93662
我必须按地址分组并找到公司列中的常用词并将其写入新列“计数”。因此,对于第一个地址,常用字是 chipotle,因此计数为 1;对于第二个地址,常用字是 home depot,因此计数 2;对于第三个地址,没有常用字,因此计数 0
预期输出
company address count
0 Chipotle 123 Main Street Branchburg NJ 08853 1
1 The Home Depot 1220 N Wendover Rd Charlotte NC 28211 2
2 Wendy's 2805 Whitson St Selma CA 93662 0
我可以考虑迭代数据帧并使用集合交集,但这个过程太慢了。有没有 Pandas 方法可以实现这一点?
最佳答案
from functools import reduce
import operator
def log(x):
inters = reduce(operator.and_, [set(r) for r in x.str.split()])
if inters: return (' '.join(inters), len(inters))
return (x.iloc[0], 0)
df.groupby('address').agg(log).company.apply(pd.Series).rename({0: 'company', 1: 'count'}, axis=1)
company count
address
1220 N Wendover Rd Charlotte NC 28211 Home Depot 2
123 Main Street Branchburg NJ 08853 Chipotle 1
2805 Whitson St Selma CA 93662 Wendys 0
如果 Pandas 0.20
.rename(columns={0: 'company', 1: 'count'})
关于python - Pandas 分组并查找公共(public)字符串的数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51310599/