apache-spark - 无法推断类型 : <type 'unicode' > when converted RDD to DataFrame 的架构

标签 apache-spark dataframe pyspark rdd apache-spark-sql

当我尝试在 Spark 中通过 RDD 转换为 Dataframe 时，出现以下异常“无法推断类型的架构:”

示例:

>> rangeRDD.take(1).foreach(println)
(301,301,10)
>> sqlContext.inferSchema(rangeRDD)
Can not infer schema for type: <type 'unicode'>

有什么办法可以解决吗？我什至尝试自己在 sqlContext.createDataFrame(rdd, schema) 中注入(inject)模式

schema = StructType([
StructField("x", IntegerType(), True),
StructField("y", IntegerType(), True),
StructField("z", IntegerType(), True)]) 
df = sqlContext.createDataFrame(rangeRDD, schema)
print df.first()

但最终出现运行时错误“ValueError: Unexpected tuple u'(301,301,10)' with StructType”

最佳答案

首先尝试解析数据

>>> rangeRDD = sc.parallelize([ u'(301,301,10)'])
>>> tupleRangeRDD = rangeRDD.map(lambda x: x[1:-1]) \
...                        .map(lambda x: x.split(",")) \
...                        .map(lambda x: [int(y) for y in x])
>>> df = sqlContext.createDataFrame(tupleRangeRDD, schema)
>>> df.first()
Row(x=301, y=301, z=10)

关于apache-spark - 无法推断类型 : <type 'unicode' > when converted RDD to DataFrame 的架构，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36586719/

上一篇：python - 如何查找某个时区的下一个上午 7 点

下一篇：json - 用于托管 JSON 的 Google Drive 'host' 终止(2016 年 8 月)

Python Pandas : update dataframe values from another dataframe

r - 如何编辑 SparkDataFrame 的架构？

apache-spark - Spark Streaming 中 Append 模式和 Update 模式的真正区别是什么？

python - dask dataframe.persist() 是否保留下一个查询的结果？

python - PySpark 将列中的空值替换为其他列中的值

python - Pyspark:将不同表中的列相乘

java - Spark重新分区和spark.sql.shuffle.partition的区别

apache-spark - cdh 快速入门 6.3.2 下载链接

pyspark - 将列表列转换为数据框