python - 提高 pandas 数据框的性能

标签 python python-3.x pandas performance

我正在尝试对 person_id 值进行编码。首先,我创建一个存储 person_id 值的字典,然后将这些值添加到新列中。处理 70K 行数据大约需要 25 分钟。

数据集:https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop

interactions_df = pd.read_csv('./users_interactions.csv')

personId_map = {}
personId_len = range(0,len(set(interactions_df['personId'])))

for i in zip(personId_len, set(interactions_df['personId'])):
    personId_map[i[0]] = i[1]

运行

%%time

def transform_person_id(row):
    if row['personId'] in personId_map.values():
        return int([k for k,v in personId_map.items() if v == row['personId']][0])

interactions_df['new_personId'] = interactions_df.apply(lambda x: transform_person_id(x), axis=1)
interactions_df.head(3)

消耗时间

CPU times: user 25min 46s, sys: 1.58 s, total: 25min 48s
Wall time: 25min 50s

如何优化上面的代码。

最佳答案

如果没有特殊的订购规则,使用factorize :

interactions_df['new_personId'] = pd.factorize(interactions_df.personId)[0]

如果还需要字典:

i, v = pd.factorize(interactions_df.personId)
personId_map = dict(zip(i, v[i]))

数据 - 用于测试的前 20 行:

interactions_df = pd.read_csv('./users_interactions.csv', nrows=20, usecols=['personId'])

#print (interactions_df)

personId_map = {}
personId_len = range(0,len(set(interactions_df['personId'])))

for i in zip(personId_len, set(interactions_df['personId'])):
    personId_map[i[0]] = i[1]

#print (personId_map)

def transform_person_id(row):
    if row['personId'] in personId_map.values():
        return int([k for k,v in personId_map.items() if v == row['personId']][0])

interactions_df['new_personId'] = interactions_df.apply(lambda x: transform_person_id(x), axis=1)
interactions_df['new_personId1'] = pd.factorize(interactions_df.personId)[0]
<小时/>
print (interactions_df)
               personId  new_personId  new_personId1
0  -8845298781299428018             3              0
1  -1032019229384696495             5              1
2  -1130272294246983140             9              2
3    344280948527967603             6              3
4   -445337111692715325             0              4
5  -8763398617720485024            10              5
6   3609194402293569455             4              6
7   4254153380739593270             8              7
8    344280948527967603             6              3
9   3609194402293569455             4              6
10  3609194402293569455             4              6
11  1908339160857512799            11              8
12  1908339160857512799            11              8
13  1908339160857512799            11              8
14  7781822014935525018             1              9
15  8239286975497580612             2             10
16  8239286975497580612             2             10
17  -445337111692715325             0              4
18  2766187446275090740             7             11
19  1908339160857512799            11              8
<小时/>
i, v = pd.factorize(interactions_df.personId)
d = dict(zip(i, v[i]))
print (d)
{0: -8845298781299428018, 1: -1032019229384696495, 2: -1130272294246983140, 
 3: 344280948527967603, 4: -445337111692715325, 5: -8763398617720485024, 
 6: 3609194402293569455, 7: 4254153380739593270, 8: 1908339160857512799,
 9: 7781822014935525018, 10: 8239286975497580612, 11: 2766187446275090740}

性能:

interactions_df = pd.read_csv('./users_interactions.csv', usecols=['personId'])

#print (interactions_df)

In [243]: %timeit interactions_df['new_personId'] = pd.factorize(interactions_df.personId)[0]
2.03 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

关于python - 提高 pandas 数据框的性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55649690/

相关文章:

python - 将 GroupBy 对象 (groupby().size) 转换为字典

python - Pandas 查找时间比较

python - 在 python 脚本中查找美国电话号码

python - reshape 具有多列的 pandas 数据框

python - 从 Pandas 中的 iterrows() 获取行位置而不是行索引

android - Android 安装 psycopg2 termux 错误

python - 如何在 Django API 中显示结果

python - 条形图,带有单独的正值和负值条形图

python - 如何在 pymongo 中使用 mongo 函数?

python - Flask-SQLAlchemy - 按关注者数量显示顺序