python-2.7 - 在 Ubuntu 上运行 pyspark.mllib

标签 python-2.7 ubuntu apache-spark pyspark apache-spark-mllib

我正在尝试在 python 中链接 Spark。下面的代码是test.py , 我把它放在 ~/spark/python 下:

from pyspark import SparkContext, SparkConf
from pyspark.mllib.fpm import FPGrowth
conf = SparkConf().setAppName(appName).setMaster(master)
sc = SparkContext(conf=conf)
data = sc.textFile("data/mllib/sample_fpgrowth.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10)
result = model.freqItemsets().collect()
for fi in result:
    print(fi)

我运行 python test.py得到这个错误信息:

Exception in thread "main" java.lang.IllegalStateException: Library directory '/home/user/spark/lib_managed/jars' does not exist.
        at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:249)
        at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:208)
        at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:119)
        at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:195)
        at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:121)
        at org.apache.spark.launcher.Main.main(Main.java:86)
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    conf = SparkConf().setAppName(appName).setMaster(master)
  File "/home/user/spark/python/pyspark/conf.py", line 104, in __init__
    SparkContext._ensure_initialized()
  File "/home/user/spark/python/pyspark/context.py", line 245, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway()
  File "/home/user/spark/python/pyspark/java_gateway.py", line 94, in launch_gateway
    raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number


我搬家test.py~/spark ,我得到:

Traceback (most recent call last):
  File "test.py", line 1, in <module>
    from pyspark import SparkContext, SparkConf
ImportError: No module named pyspark


我从官方网站克隆了 Spark 项目。
操作系统系统:Ubuntu
Java版本:1.7.0_79
Python版本:2.7.11

谁能给我一些解决这个问题的提示?

最佳答案

Spark 程序必须通过“Spark-submit”提交。更多信息:Documentation .

您应该尝试运行:$SPARK_HOME/bin/spark-submit test.py而不是 python test.py .

关于python-2.7 - 在 Ubuntu 上运行 pyspark.mllib,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38323267/

相关文章:

bash - 使用 dropbox_uploader.sh 在特定日期后删除文件

apache-spark - 将 jar 动态加载到 Jupyter 笔记本中的 Spark 驱动程序

python - 使用python转换一个时区的任意日期时间并将其转换为另一个时区

python - 我将如何改善/使其运行更快?

c++ - getenv() 表示定义的环境变量仅在 Eclipse 中运行时未定义

serialization - 有什么方法可以在 Spark ML Pipeline 中序列化自定义 Transformer

python - 将数据框中带有时间戳的多行事件转换为带有开始和结束日期时间的单行

Python-按常量列标题排列 CSV 文件的不同行

python - 如何使用 Python 从 HTTP 响应中获取端口号?

linux - NEO4J : ERROR: Unable to find java. (无法执行/usr/lib/jvm/java-7-oracle/jre/bin/java/bin/java)