给定一个将 ID 映射到名称的数据框 df1
表:
id
names
a 535159
b 248909
c 548731
d 362555
e 398829
f 688939
g 674128
和第二个数据框 df2
,其中包含名称列表:
names foo
0 [a, b, c] 9
1 [d, e] 16
2 [f] 2
3 [g] 3
像这样从 df1
中为每一行中的每个列表项检索 ID 的矢量化方法是什么?
names foo ids
0 [a, b, c] 9 [535159, 248909, 548731]
1 [d, e] 16 [362555, 398829]
2 [f] 2 [688939]
3 [g] 3 [674128]
这是使用 apply
实现相同结果的工作方法:
import pandas as pd
import numpy as np
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
df2 = df2.apply(with_apply, axis=1)
最佳答案
我认为 vecorize 这真的很难,提高性能的一个想法是按字典映射 - 解决方案使用 if y in d
if no match in dictioanry:
df1 = df1.set_index('names')
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
如果所有值都匹配:
d = df1['id'].to_dict()
df2['ids2'] = [[d[y] for y in x] for x in df2['names']]
测试 4k 行:
np.random.seed(2020)
mock_uids = np.random.randint(100000, 999999, size=7)
df1=pd.DataFrame({'id':mock_uids, 'names': ['a','b','c','d','e','f','g'] })
df2=pd.DataFrame({'names':[['a','b','c'],['d','e'],['f'],['g']],'foo':[9,16,2,3]})
df2 = pd.concat([df2] * 1000, ignore_index=True)
df1 = df1.set_index('names')
def with_apply(row):
row['ids'] = [ df1.loc[name]['id'] for name in row['names'] ]
return row
In [8]: %%timeit
...: df2.apply(with_apply, axis=1)
...:
928 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %%timeit
...: d = df1['id'].to_dict()
...: df2['ids2'] = [[d[y] for y in x if y in d] for x in df2['names']]
...:
4.25 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: %%timeit
...: df2['ids3'] = list(df1.loc[name]['id'].values for name in df2['names'])
...:
...:
1.66 s ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
关于python - 将列表从一个 Dataframe 行映射到另一个 Dataframe 行的矢量化方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65229666/