pandas - 大型数据集的一次热编码

标签 pandas scikit-learn one-hot-encoding apriori mlxtend

我想使用在 mlxtend 库先验算法中实现的关联规则构建推荐系统。在我的销售数据中，有关于 3600 万笔交易和 5 万种独特产品的信息。我尝试使用 sklearn OneHotEncoder 和 pandas get_dummies() 但它们都给出了 OOM 错误，因为它们无法创建形状为 (36 mil, 50k) 的帧

MemoryError: Unable to allocate 398. GiB for an array with shape (36113798, 50087) and data type uint8

还有其他解决办法吗？

最佳答案

和你一样，一开始我也遇到了 mlxtend 内存不足的错误，但是下面的小改动完全解决了这个问题。
`

from mlxtend.preprocessing import TransactionEncoder   

import pandas as pd

te = TransactionEncoder() 

#te_ary = te.fit(itemSetList).transform(itemSetList)

#df = pd.DataFrame(te_ary, columns=te.columns_)

fitted = te.fit(itemSetList)

te_ary = fitted.transform(itemSetList, sparse=True) # seemed to work good

df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_) # seemed to work good

# now you can call mlxtend's fpgrowth() followed by association_rules()

您还应该在大型交易数据集上使用 fpgrowth 而不是 apriori，因为 apriori 太原始了。 fpgrowth 比 apriori 更智能、更现代，但给出的结果相同。 mlxtend 库同时支持 apriori 和 fpgrowth。

关于pandas - 大型数据集的一次热编码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64136633/

上一篇：python - 为什么通过 pytorch 张量循环如此缓慢(与 Numpy 相比)？

下一篇：python - 使用通过堆栈实现的迭代 DFS 时如何回溯

相关文章：

python - SelectPercentile 评分函数如何工作？

python - ValueError : Expected 2D array, 在拟合模型时得到一维数组

python - One-hot 向量的 3 维组合

python - OneHotEncoding 引发 IndexError : arrays used as indices must be of integer (or boolean) type

python - 如何有效地将 bool 表转换为热向量？

python - 对 1000 万对 1x20 向量执行余弦相似度的最快方法

python - 使用 Pandas 读取大文本文件

R 阶函数的 Python 等效项

python - 在 python pandas 中，如何解压缩列中的列表？

python - 为什么我的 sklearn t-sne 函数在达到最大迭代次数之前退出