我有一个很大的 CSV 文件,它是调用者数据的日志。
我的文件的一小段:
CompanyName High Priority QualityIssue
Customer1 Yes User
Customer1 Yes User
Customer2 No User
Customer3 No Equipment
Customer1 No Neither
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer4 No User
我想按客户出现的频率对整个列表进行排序,因此它会像:
CompanyName High Priority QualityIssue
Customer3 No Equipment
Customer3 No User
Customer3 Yes User
Customer3 Yes Equipment
Customer1 Yes User
Customer1 Yes User
Customer1 No Neither
Customer2 No User
Customer4 No User
我试过 groupby
,但只打印出公司名称和频率而不是其他列,我也试过
df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
和
df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]
但是这些给我错误:
ValueError: The wrong number of items passed 1, indices imply 24
我看过这样的东西:
for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
print "%s: %s" % (key, value)
但这只打印出两列,我想对整个 CSV 进行排序。我的输出应该是按第一列排序的整个 CSV。
提前感谢您的帮助!
最佳答案
这似乎可以满足您的要求,基本上是通过执行 groupby
添加一个计数列和 transform
与 value_counts
然后您可以对该列进行排序:
df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)
输出:
CompanyName HighPriority QualityIssue count
5 Customer3 No User 4
3 Customer3 No Equipment 4
7 Customer3 Yes Equipment 4
6 Customer3 Yes User 4
0 Customer1 Yes User 3
4 Customer1 No Neither 3
1 Customer1 Yes User 3
8 Customer4 No User 1
2 Customer2 No User 1
您可以使用 df.drop
删除无关的列:
df.drop('count', axis=1)
输出:
CompanyName HighPriority QualityIssue
5 Customer3 No User
3 Customer3 No Equipment
7 Customer3 Yes Equipment
6 Customer3 Yes User
0 Customer1 Yes User
4 Customer1 No Neither
1 Customer1 Yes User
8 Customer4 No User
2 Customer2 No User
关于python - 按一列中出现的频率对整个 csv 进行排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30787391/