apache-spark - Spark : TreeAgregate at IDF is taking ages

我使用的是 Spark 1.6.1，我有一个 DataFrame 如下:

+-----------+---------------------+-------------------------------+
|ID         |dateTime             |title                          |
+-----------+---------------------+-------------------------------+
|809907895  |2017-01-21 23:00:01.0|                               |
|1889481973 |2017-01-21 23:00:06.0|Man charged with murder of ... |
|979847722  |2017-01-21 23:00:09.0|Munster cruise to home Cham... |
|18894819734|2017-01-21 23:00:11.0|Man charged with murder of ... |
|17508023137|2017-01-21 23:00:15.0|Emily Ratajkowski hits the ... |
|10321187627|2017-01-21 23:00:17.0|Gardai urge public to remai... |
|979847722  |2017-01-21 23:00:19.0|Sport                          |
|19338946129|2017-01-21 23:00:33.0|                               |
|979847722  |2017-01-21 23:00:35.0|Rassie Erasmus reveals the ... |
|1836742863 |2017-01-21 23:00:49.0|NAMA sold flats which could... |
+-----------+---------------------+-------------------------------+

我正在进行以下操作:

val aggDF = df.groupBy($"ID")
              .agg(concat_ws(" ", collect_list($"title")) as "titlesText")

然后在 aggDF DataFrame 上，我安装了一个从 titlesText 列中提取 TFIDF 特征的管道(通过应用 tokenizer、stopWordRemover 、HashingTF 然后是 IDF)。

当我调用 pipline.fit(aggDF) 时，代码到达 treeAggregate at IDF.scala:54 阶段(我可以在 UI 上看到)，并且然后它卡在那里，没有任何进展，没有任何错误，我等待了很长时间没有任何进展，也没有关于 UI 的有用信息。

这是我在 UI 中看到的示例(很长一段时间内没有任何变化):

可能的原因是什么？
如何跟踪和调试此类问题？
有没有其他方法可以提取相同的特征？

最佳答案

您是否指定了 a maximum number of features in your HashingTF ？

因为 IDF 必须处理的数据量将与 HashingTF 生成的特征数量成正比，并且很可能不得不将大量数据溢出到磁盘上，这会浪费时间。

关于apache-spark - Spark : TreeAgregate at IDF is taking ages，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41853941/

apache-spark - Spark : TreeAgregate at IDF is taking ages

上一篇：xamarin - 如何在 Xamarin Forms 中进行长按手势？

下一篇：ecmascript-6 - es6 模块是否否定了对 browserify/webpack 的需求？