python - 无法在 Spark 中导入名称 LDA MLlib

标签 python apache-spark pyspark lda apache-spark-mllib

我正在尝试使用 Spark 实现 LDA 并收到此错误。我对 Spark 完全陌生,因此非常感谢您的帮助。

[root@sandbox ~]# spark-submit ./lda.py
Traceback (most recent call last):
  File "/root/./lda.py", line 3, in <module>
    from pyspark.mllib.clustering import LDA, LDAModel
ImportError: cannot import name LDA

这是代码:

from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
import numpy
sc = SparkContext(appName="PythonLDA")
data = sc.textFile("/tutorial/input/askreddit20150801.txt")
parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()

# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)

# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize()) + " words):")
topics = ldaModel.topicsMatrix()
for topic in range(3):
    print("Topic " + str(topic) + ":")
    for word in range(0, ldaModel.vocabSize()):
        print(" " + str(topics[word][topic]))

# Save and load model
model.save(sc, "myModelPath")
sameModel = LDAModel.load(sc, "myModelPath")

当我尝试安装 pyspark.mllib.clustering 时:

[root@sandbox ~]# pip install spark.mllib.clustering
Collecting spark.mllib.clustering
/usr/lib/python2.6/site-packages/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Could not find a version that satisfies the requirement spark.mllib.clustering (from versions: )
No matching distribution found for spark.mllib.clustering

最佳答案

Spark 1.5.0 中引入了 LDA 的 PySpark 包装器。假设您的安装没有损坏,您可能使用 Spark <= 1.4.x。

关于python - 无法在 Spark 中导入名称 LDA MLlib,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33533429/

相关文章:

amazon-web-services - 创建 SparkUI 历史服务器的 CF 模板失败

python - 如何创建具有多个替代值的掩码 (Pandas DataFrame)

java - 检查连接 jpype - java

python - 如何通过循环数据框列列表创建一组自动子图

scala - 在 Spark 中用其上方的非空白值填充列中的空白行

scala - Spark udf 初始化

amazon-web-services - 无论如何将 Spark 分区写入不同的子路径?

python - 在 python 中使用//

apache-spark - 如何显示已排序的 Dataframe 列名称?

hadoop - parquet、avro 和其他 hadoop 文件格式的第一行可以有不同的布局吗?