我有一个 pandas 数据框 df,如下所示:
student_id category_id count
1 111 10
2 111 5
3 222 8
4 333 5
5 111 6
同样,我有 2000 万行。
我想计算每个student_id 的评分。例如,让我们考虑一个category_id“111”。此类别中有 3 个 Student_id 1、2 和 5。 Student_id 1 有 10 个计数,student_id 2 有 5 个计数,student_id 5 有 6 个计数。 每个student_id对category_id的评分由以下公式计算:
(count per student_id / total number of counts per category_id) * 5
对于学生 ID 1 -> 10/21 * 5 = 2.38
对于学生 ID 2 -> 5/21 *5 = 1.19
对于学生 ID 5 -> 6/21 * 5 = 1.43
下面是我已经必须计算的函数:
countPerStudentID = datasetPandas.groupby('student_id').agg(list)
countPerCategoryID = datasetPandas.groupby('category_id').agg(list)
studentIDMap = dict()
def func1(student_id):
if student_id in studentIDMap:
return studentIDMap[student_id]
runningSum = 0
countList = countPerStudentID.loc[student_id, 'count']
for count in countList:
runningSum += count
studentIDMap[student_id] = runningSum
return studentIDMap[student_id]
#Similar to the above function
categoryIDMap = dict()
def func2(category_id):
if category_id in categoryIDMap:
return categoryIDMap[category_id]
runningSum = 0
countList = countPerCategoryID.loc[category_id, 'count']
for count in countList:
runningSum += count
categoryIDMap[category_id] = runningSum
return categoryIDMap[category_id]
最后我从下面调用这两个函数:
#Calculating rating category-wise
rating = []
for index, row in df.iterrows():
totalCountPerCategoryID = func1(row['category_id'])
totalCountPerStudentID = func2(row['student_id'])
rating.append((totalCountPerStudentID / totalCountPerCategoryID) * 5)
df['rating'] = rating
所需输出:
student_id category_id count rating
1 111 10 2.38
2 111 5 1.19
3 222 8 5
4 333 5 5
5 111 6 1.43
由于数据量巨大,运行起来需要很长时间。我想知道如何优化这个计算
提前致谢
最佳答案
您不需要循环,这是一个 groupby
案例:
df['rating'] = df['count']/df.groupby('category_id')['count'].transform('sum') * 5
输出:
student_id category_id count rating
0 1 111 10 2.380952
1 2 111 5 1.190476
2 3 222 8 5.000000
3 4 333 5 5.000000
4 5 111 6 1.428571
关于python - 如何优化包含 for 循环和数据框中 2000 万行的函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62208118/