apache-spark - PySpark:有没有一种方法可以在一次操作中执行 .fit() 和 .transform() ？

标签 apache-spark pyspark apache-spark-mllib

我正在尝试找出如何优化 PySpark 中的 .fit() 和 .transform()

我有:

pipeline = Pipeline(stages=[topic_vectorizer_A, cat_vectorizer_A,
                            topic_vectorizer_B, cat_vectorizer_B,
                            fil_top_a_vect, fil_top_b_vect,
                            fil_cat_a_vect, fil_cat_b_vect,
                            fil_ent_a_vect, fil_ent_b_vect,                            
                            assembler])

# Note that all the operations in the pipeline are transforms only.
model = pipeline.fit(cleaned)

# wait 12 hours
vectorized_df = model.transform(cleaned)

# wait another XX hours
# save to parquet.

我见过这样的事情:

vectorized_df = model.fit(cleaned).transform(cleaned)

但我不确定这是否相同，或者以某种方式优化了操作

最佳答案

没有什么可做的。如果

stage 是一个 Estimator(如 CountVectorizer)，它在 Pipeline.fit 中进行训练。
stage 是一个 Transformer(如 HashingTF)，它直接返回。

关于apache-spark - PySpark:有没有一种方法可以在一次操作中执行 .fit() 和 .transform() ？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40469243/

上一篇：GNUPLOT - 无法在条形图上显示值

下一篇：jquery - 数据表分页类型不起作用

相关文章：

python - 使用 pyspark 将数据框中的列调用到函数中

apache-spark - Apache Spark 中的非线性 (DAG) ML 管道

python - 修改 PySpark 中 RDD 的两个不同列中的数字符号

python - 从 EMR Spark 连接到 EMR presto - 连接失败

apache-spark - spark read parquet with partition filters vs 完整路径

scala - 出于机器学习目的，使用 "randomSplit"理解在 Scala 中拆分数据的问题

apache-spark - SPARK，ML，调整，CrossValidator : access the metrics

python - 将数据插入数据库时PySpark NoSuchMethodError : sun. nio.ch.DirectBuffer.cleaner

java - 在主节点上触发Emr并提交作业(jar):

scala - 使用 Storehaus 存储 algebird Bloom Filter