python - 使用 Dask 或 Joblib 进行并行 Sklearn 模型构建

标签 python scikit-learn dask dask-distributed

我有一大套 sklearn 管道,我想与 Dask 并行构建。这是一个简单但天真的顺序方法:

from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2)

pipe_nb = Pipeline([('clf', MultinomialNB())])
pipe_lr = Pipeline([('clf', LogisticRegression())])
pipe_rf = Pipeline([('clf', RandomForestClassifier())])

pipelines = [pipe_nb, pipe_lr, pipe_rf]  # In reality, this would include many more different types of models with varying but specific parameters

for pl in pipelines:
    pl.fit(X_train, Y_train)

请注意,这不是 GridSearchCV 或 RandomSearchCV 问题

对于 RandomSearchCV,我知道如何将它与 Dask 并行化:

dask_client = Client('tcp://some.host.com:8786')  

clf_rf = RandomForestClassifier()
param_dist = {'n_estimators': scipy.stats.randint(100, 500}
search_rf = RandomizedSearchCV(
                clf_rf,
                param_distributions=param_dist, 
                n_iter = 100, 
                scoring = 'f1',
                cv=10,
                error_score = 0, 
                verbose = 3,
               )

with joblib.parallel_backend('dask'):
    search_rf.fit(X_train, Y_train)

但是,我对超参数调整不感兴趣,也不清楚如何修改这段代码,以便与 Dask 并行地适应一组具有自己特定参数的多个不同模型。

最佳答案

dask.delayed 可能是这里最简单的解决方案。

from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2)

pipe_nb = Pipeline([('clf', MultinomialNB())])
pipe_lr = Pipeline([('clf', LogisticRegression())])
pipe_rf = Pipeline([('clf', RandomForestClassifier())])

pipelines = [pipe_nb, pipe_lr, pipe_rf]  # In reality, this would include many more different types of models with varying but specific parameters

# Use dask.delayed instead of a for loop.
import dask.delayed

pipelines_ = [dask.delayed(pl).fit(X_train, Y_train) for pl in pipelines]
fit_pipelines = dask.compute(*pipelines_)

关于python - 使用 Dask 或 Joblib 进行并行 Sklearn 模型构建,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54355236/

相关文章:

python - flask-restful - 当前请求的资源类

java - JAVA和Python中的Windows排序顺序

python - 索引的 Cython 内存 View 应该是 Py_ssize_t 类型还是 int 类型?

python-3.x - Python3 : ValueError: too many values to unpack (expected 2)

python - 如何使用 Dask 将函数应用于大型数据集的单列?

python - 运行 Neo4j Python Bolt 驱动程序示例时,错误 :"ImportError: No module named ' _backend'"

python - 类型错误 : only integer arrays with one element can be converted to an index

python - 使用以前保存的模型获得测试数据的分类准确性

python - 如何让自适应 dask worker 在启动时运行一些代码?

python - 合并列与 dask