python - RandomForestClassifier.fit 在不同机器上使用不同数量的 RAM

出于某种原因，sklearn.ensemble 中的 RandomForestClassifier.fit 在我的本地计算机上仅使用 2.5GB RAM，但在我的服务器上使用几乎 7GB 且训练集完全相同。

没有导入的代码几乎是这样的:

y_train = data_train['train_column']
x_train = data_train.drop('train_column', axis=1)

# Difference in memory consuming starts here
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf = clf.fit(x_train, y_train)
preds = clf.predict(data_test)

我的本地机器是 macbook pro 16GB 内存和 4 核 CPU 我的服务器是 8 GB 内存和 4 核 CPU 的 digitalocean 云上的 Ubuntu 服务器。

sklearn版本为0.18，Python版本为3.5.2

我什至无法想象可能的原因，任何帮助都会非常有帮助。

更新

内存错误出现在 fit 方法中的这段代码中:

# Parallel loop: we use the threading backend as the Cython code
# for fitting the trees is internally releasing the Python GIL
# making threading always more efficient than multiprocessing in
# that case.
trees = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
                 backend="threading")(
    delayed(_parallel_build_trees)(
        t, self, X, y, sample_weight, i, len(trees),
        verbose=self.verbose, class_weight=self.class_weight)
    for i, t in enumerate(trees))

更新 2

关于我的系统的信息:

# local
Darwin-16.1.0-x86_64-i386-64bit
Python 3.5.2 (default, Oct 11 2016, 05:05:28)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

# server
Linux-3.13.0-57-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.1 (default, Dec 18 2015, 00:00:00)
[GCC 4.8.4]
NumPy 1.11.2
SciPy 0.18.1
Scikit-Learn 0.18

还有我的 numpy 配置:

# server
>>> np.__config__.show()
blas_opt_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c
blas_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    define_macros = [('HAVE_CBLAS', None)]
    library_dirs = ['/usr/local/lib']
    language = c


# local
>>> np.__config__.show()
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
blas_mkl_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    extra_compile_args = ['-msse3']
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE

clf 对象的 Repr 在两台机器上是相同的:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

最佳答案

一个可能的解释是您的服务器使用较旧的 scikit-learn。不久之前，sklearn RF 非常耗费内存是一个问题，如果我没记错的话，它已在 0.17 中修复。

关于python - RandomForestClassifier.fit 在不同机器上使用不同数量的 RAM，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40293169/

python - RandomForestClassifier.fit 在不同机器上使用不同数量的 RAM

上一篇：python - matplotlib:TeX 模式下小数点和小数点之间的空格

下一篇：python - Pyspark - 对多个稀疏向量求和(CountVectorizer 输出)