我有以下代码,我试图从 PipelinedRDD 创建一个 DataFrame
:
print type(simulation)
sqlContext.createDataFrame(simulation)
print
语句打印以下内容:
<class 'pyspark.rdd.PipelinedRDD'>
但是,在下一行我收到此错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
错误有以下痕迹:
---> 13 sqlContext.createDataFrame(simulation)
/databricks/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio)
421
422 if isinstance(data, RDD):
--> 423 rdd, schema = self._createFromRDD(data, schema, samplingRatio)
424 else:
425 rdd, schema = self._createFromLocal(data, schema)
/databricks/spark/python/pyspark/sql/context.py in _createFromRDD(self, rdd, schema, samplingRatio)
308 """
309 if schema is None or isinstance(schema, (list, tuple)):
--> 310 struct = self._inferSchema(rdd, samplingRatio)
最佳答案
似乎无法从您的数据推断出架构。 如果不指定采样率,则仅使用第一行来确定类型。 您应该尝试非零采样率或指定架构,如下所示:
schema = StructType([StructField("int_field", IntegerType()),
StructField("string_field", StringType())])
关于python - 从 RDD 创建 DataFrame 时出错,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38220666/