python - 为什么 scikit-learn 中的 GridSearchCV 产生这么多线程

这是我当前正在运行的 GridSearch 的 pstree 输出，我很想知道正在进行什么进程，但有些事情我还无法解释。

 ├─bash─┬─perl───20*[bash───python─┬─5*[python───31*[{python}]]]
 │      │                          └─11*[{python}]]
 │      └─tee
 └─bash───pstree

我删除了不相关的内容。大括号表示线程。

perl 的出现是因为我使用 parallel -j 20 开始我的 python 作业。如您所见，20* 确实显示有 20 个进程。
每个 python 进程之前的 bash 进程是由于使用 source activate venv 激活了 Anaconda 虚拟环境。
在每个 python 进程中，还有另外 5 个 python 进程 (5*) 产生。这是因为我为 GridSearchCV 指定了 n_jobs=5。

我的理解到此为止。

问题:谁能解释为什么还有另外 11 个 python 线程 (11*[{python}]) 以及网格搜索和 31 个 python 线程 (31*[{python}]) 在 5 个网格搜索作业中生成？

更新:添加调用GridSearchCV

的代码

Cs = 10 ** np.arange(-2, 2, 0.1)
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = LogisticRegression()
gs = GridSearchCV(
    clf,
    param_grid={'C': Cs, 'penalty': ['l1'],
                'tol': [1e-10], 'solver': ['liblinear']},
    cv=skf,
    scoring='neg_log_loss',
    n_jobs=5,
    verbose=1,
    refit=True)
gs.fit(Xs, ys)

更新(2017-09-27):

我总结了一个 test code如果有兴趣，您可以轻松复制。

我在 Mac Pro 和多台 Linux 机器上测试了相同的代码，并重现了@igrinis 的结果，但仅限于 Mac Pro。在 linux 机器上，我得到的数字与以前不同，但始终如一。因此生成的线程数可能取决于 GridSearchCV 的特定数据馈送。

python─┬─5*[python───31*[{python}]]
       └─3*[{python}]

注意homebrew/linuxbrew在Mac Pro和linux机器上安装的pstree是不一样的。我在这里发布了我使用的确切版本:

麦克:

pstree $Revision: 2.39 $ by Fred Hucht (C) 1993-2015
EMail: fred AT thp.uni-due.de

Linux:

pstree (PSmisc) 22.20
Copyright (C) 1993-2009 Werner Almesberger and Craig Small

Mac 版本似乎没有显示线程的选项，我认为这可能是结果中看不到它们的原因。我还没有找到在 Mac Pro 上轻松检查线程的方法。如果您碰巧知道一种方法，请发表评论。

更新(2017-10-12)

在另一组实验中，我确认设置环境变量 OMP_NUM_THREADS 会有所不同。

在 export OMP_NUM_THREADS=1 之前，有许多(在本例中为 63 个)线程，如上文所述，使用不明确:

bash───python─┬─23*[python───63*[{python}]]
              └─3*[{python}]

这里没有使用 linux parallel。 n_jobs=23。

在 export OMP_NUM_THREADS=1 之后，没有线程产生，但是 3 个 Python 进程仍然存在，我仍然不知道它们的用途。

bash───python─┬─23*[python]
              └─3*[{python}]

我最初遇到 OMP_NUM_THREADS 因为它会导致我的一些 GridSearchCV 作业出错，错误消息是这样的

OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.

最佳答案

来自 sklearn.GridSearchCV 文档:

n_jobs : int, default=1 Number of jobs to run in parallel.

pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in ‘2*n_jobs’

如果我正确理解文档，GridSearchCV 会生成一堆线程作为网格点的数量，并且只会同时运行 n_jobs。我认为第 31 号是您 40 个可能值的某种上限。尝试使用 pre_dispatch 参数的值。

我认为另外 11 个线程与 GridSearchCV 本身无关，因为它显示在同一级别。我认为这是其他命令的遗留问题。

顺便说一句，我没有在 Mac 上观察到这种行为(只看到 5 个进程由 GridSearchCV 产生，正如人们所期望的那样)所以它可能来自不兼容的库。尝试手动更新 sklearn 和 numpy。

这是我的 pstree 输出(出于隐私原因删除了部分路径):

 └─┬= 00396 *** -fish
   └─┬= 21743 *** python /Users/***/scratch_5.py
     ├─── 21775 *** python /Users/***/scratch_5.py
     ├─── 21776 *** python /Users/***/scratch_5.py
     ├─── 21777 *** python /Users/***/scratch_5.py
     ├─── 21778 *** python /Users/***/scratch_5.py
     └─── 21779 *** python /Users/***/scratch_5.py

对第二条评论的回答:

这实际上是您的代码。刚生成可分离的一维二分类问题:

N = 50000
Xs = np.concatenate( (np.random.random(N) , 3+np.random.random(N)) ).reshape(-1, 1)
ys = np.concatenate( (np.zeros(N), np.ones(N)) )

10 万个样本足以让 CPU 忙一分钟左右。

关于python - 为什么 scikit-learn 中的 GridSearchCV 产生这么多线程，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46351157/

python - 为什么 scikit-learn 中的 GridSearchCV 产生这么多线程

上一篇：python - 如何在 GPU 上计算成对距离矩阵

下一篇：python - tesseract 的 OCR 结果高度不一致