python - 标记点对象 pyspark 中的错误

标签 python apache-spark pyspark apache-spark-sql

我正在写一个函数

它将 RDD 作为输入
分割逗号分隔值
然后将每一行转换为标记点对象

最终以数据帧的形式获取输出

code: 

def parse_points(raw_rdd):

    cleaned_rdd = raw_rdd.map(lambda line: line.split(","))
    new_df = cleaned_rdd.map(lambda line:LabeledPoint(line[0],[line[1:]])).toDF()
    return new_df


output = parse_points(input_rdd)

到目前为止，如果我运行代码，没有错误，它工作正常。

但是在添加该行时，

 output.take(5)

我收到错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task   0 in stage 129.0 failed 1 times, most recent failure: Lost task 0.0 in s    stage 129.0 (TID 152, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Py4JJavaError       Traceback (most recent call last)
<ipython-input-100-a68c448b64b0> in <module>()
 20 
 21 output = parse_points(raw_rdd)
 ---> 22 print output.show()

请告诉我错误是什么。

最佳答案

在执行操作之前没有错误的原因: