我想从一些csv files获得一些基本统计数据无需将整个文件加载到内存中。我用两种方式做到这一点,一种看似“智能”的方式使用 pandas,另一种随意的方式使用 csv 我希望 pandas 方式更快,但 csv 方式实际上要快得多。我想知道为什么。
这是我的代码:
import pandas as pd
import csv
movies = pd.read_csv('movies.csv') # movieId,title,genres
movie_count = movies.shape[0] # 9742
movieId_min = ratings.movieId.min()
movieId_max = ratings.movieId.max()
movieId_disperse = movies.movieId.sort_values().to_dict()
movieId_squeeze = {v: k for k, v in movieId_disperse.items()}
def get_ratings_stats():
gp_by_user = []
gp_by_movie = [0] * movie_count
top_rator = (0, 0) # (idx, value)
top_rated = (0, 0) # (idx, value)
rating_count = 0
user_count = 0
last_user = -1
for row in csv.DictReader(open('ratings.csv')):
user = int(row['userId'])-1
movie = movieId_squeeze[int(row['movieId'])]
if last_user != user:
last_user = user
user_count += 1
gp_by_user += [0]
rating_count += 1
gp_by_user[user] += 1
gp_by_movie[movie] += 1
top_rator = (user, gp_by_user[user]) if gp_by_user[user] > top_rator[1] else top_rator
top_rated = (movie, gp_by_movie[movie]) if gp_by_movie[movie] > top_rated[1] else top_rated
top_rator = (top_rator[0]+1, top_rator[1])
top_rated = (movieId_disperse[top_rated[0]], top_rated[1])
return rating_count, top_rator, top_rated
现在如果我更换该行:
for row in csv.DictReader(open('ratings.csv')):
与:
for chunk in pd.read_csv('ratings.csv', chunksize=1000):
for _,row in chunk.iterrows():
代码实际上变慢了 10 倍。
以下是计时结果:
> %timeit get_ratings_stats() # with csv
325 ms ± 9.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
> %timeit get_ratings_stats() # with pandas
3.45 s ± 67.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
任何有关如何使此代码更好/更快/更具可读性的评论将不胜感激
最佳答案
我认为重点是,如果你打算像字典一样对待大而昂贵的数据结构,那么你不应该使用 pandas。问题不应该是如何让 pandas 做得更好,而应该是如何用 pandas 编写代码来做你想做的事情。
import pandas as pd
def get_ratings_stats():
movie_rating_data = pd.read_csv('ratings.csv')
# Get the movie with the best rating
top_movie = movie_rating_data.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
# Get the user with the best rating
top_user = movie_rating_data.loc[:, ['userId', 'rating']].groupby('userId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
return movie_rating_data.shape[0], top_movie, top_user
def get_ratings_stats_slowly():
movies = pd.DataFrame(columns = ["movieId", "ratings"])
users = pd.DataFrame(users = ["userId", "ratings"])
data_size = 0
for chunk in pd.read_csv('ratings.csv', chunksize=1000):
movies = movies.append(chunk.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max'))
users = users.append(chunk.loc[:, ['userId', 'rating']].groupby('userId').agg('max'))
data_size += chunk.shape[0]
top_movie = movies.loc[:, ['movieId', 'rating']].groupby('movieId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
top_user = users.loc[:, ['userId', 'rating']].groupby('userId').agg('max').sort_values(by='rating', ascending=False).iloc[:, 0]
return data_size, top_movie, top_user
我不太确定这就是您想要做的事情,但是您的代码难以理解 - 这应该是一个很好的起点(您可以替换 .agg('max')
如果您对评分数量等感兴趣,请使用 .count()
。
关于python - 如何使用 pandas 高效地执行行操作?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53894014/