python - 使用 Pandas 数据框和嵌套在 Python 中的循环的基于项目的协作过滤器的瓶颈

标签 python performance pandas combinations collaborative-filtering

我有一个包含 100246 行和 7 列的输入数据集(csv 格式)。这是取自 http://grouplens.org/datasets/movielens/ 的电影评级数据.我的数据框的头部是:

In [5]: df.head()
Out[5]: 
   movieId                                       genres  userId      rating  \
0        1  Adventure|Animation|Children|Comedy|Fantasy       1       5   
1        1  Adventure|Animation|Children|Comedy|Fantasy       2       3   
2        1  Adventure|Animation|Children|Comedy|Fantasy       5       4   
3        1  Adventure|Animation|Children|Comedy|Fantasy       6       4   
4        1  Adventure|Animation|Children|Comedy|Fantasy       8       3   

 imdbId       title  relDate  
0  114709  Toy Story      1995  
1  114709  Toy Story      1995  
2  114709  Toy Story      1995  
3  114709  Toy Story      1995  
4  114709  Toy Story      1995

使用此数据集，我使用用户评分之间的欧氏距离计算每对电影之间的相似度分数(即，如果用户样本对两部电影的评分相似，则这两部电影高度相关)。目前，这是通过遍历所有电影对并使用 if 语句仅查找包含当前感兴趣的电影的那些电影对来执行的:

  for i,item in enumerate(df['movieId'].unique()):
      for j, item_comb in enumerate(combinations(df['movieId'].unique(),2)):
        if(item in item_comb ):
              ## calculate the similarity score between item i and the other item in item_comb

但是，鉴于数据集中有 8927 部不同的电影，对的数量约为 40M。这是一个主要的瓶颈。所以我的问题是有哪些方法可以加速我的代码？

最佳答案

在此链接 ( collaborative-filtering scalability ) 中，MongoDB 似乎可用于对超大数据集使用协作过滤器。

Spark ( collaborative-filter with Apache Spark) 也可能合适。

关于python - 使用 Pandas 数据框和嵌套在 Python 中的循环的基于项目的协作过滤器的瓶颈，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33739282/

上一篇：python - 在python中输出表情符号

下一篇：python - 使用 Mac OSX 在 ipython 中导入 cx_oracle 错误

相关文章：

python - 如何使用正则表达式从Python片段中抓取整个句子

performance - magento - 缓慢添加到购物车 - 大量查询

python - NumPy 的/科学的 : Making one series converge towards another after a period of time

Python Pandas : How to drop the *correct* duplicate row?

python - 如何将 Django TestCases 收集到 TestSuites 中并运行它们？

python - 在Azure上运行Python脚本并保存到SQL数据库

python - Pipenv:即使使用 Pipenv 锁也无法解决依赖关系 --clear

c++14 通过引用返回一个值以优化性能

python - pymongo 单次插入太慢，尽管 WriteConcern(w=0)

python - 解决 Pandas 数据框合并与函数的冲突？