python - DataFrame - 值错误 : Unexpected tuple with StructType

我正在尝试为 dataframe 创建一个手动模式。我传入的数据是从 json 创建的 RDD。这是我的初始数据:

json2 = sc.parallelize(['{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}'])

然后这里是如何指定模式的:

schema = StructType(fields=[
    StructField(
        name='name',
        dataType=StringType(),
        nullable=True
    ),
    StructField(
        name='pandas',
        dataType=ArrayType(
            StructType(
                fields=[
                    StructField(
                        name='id',
                        dataType=StringType(),
                        nullable=False
                    ),
                    StructField(
                        name='zip',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='pt',
                        dataType=StringType(),
                        nullable=True
                    ),
                    StructField(
                        name='happy',
                        dataType=BooleanType(),
                        nullable=False
                    ),
                    StructField(
                        name='attributes',
                        dataType=ArrayType(
                            elementType=DoubleType(),
                            containsNull=False
                        ),
                        nullable=True

                    )
                ]
            ),
            containsNull=True
        ),
        nullable=True
    )
])

当我使用 sqlContext.createDataFrame(json2, schema) 然后尝试对生成的 dataframe 执行 show() 时，我收到以下错误:

ValueError: Unexpected tuple '{"name": "mission", "pandas": {"attributes": "[0.4, 0.5]", "pt": "giant", "id": "1", "zip": "94110", "happy": "True"}}' with StructType

最佳答案

首先 json2 只是一个 RDD[String]。 Spark 没有关于用于编码数据的序列化格式的特殊知识。此外，它需要一个 RDD 或 Row 或某些产品，但显然不是这种情况。

在 Scala 中你可以使用

sqlContext.read.schema(schema).json(rdd)

使用 RDD[String] 但有两个问题:

在 PySpark 中无法直接访问此方法
即使它是您创建的模式也是无效的:
- pandas 是一个 struct 而不是 array
- pandas.happy 不是 string 而是 boolean
- pandas.attributes 是string 不是array

Schema 仅用于避免类型推断而不用于类型转换或任何其他转换。如果你想转换数据，你必须先解析它:

def parse(s: str) -> Row:
    return ...

rdd.map(parse).toDF(schema)

假设您有这样的 JSON(固定类型):

{"name": "mission", "pandas": {"attributes": [0.4, 0.5], "pt": "giant", "id": "1", "zip": "94110", "happy": true}}

正确的架构如下所示

StructType([
    StructField("name", StringType(), True),
    StructField("pandas", StructType([
        StructField("attributes", ArrayType(DoubleType(), True), True),
        StructField("happy", BooleanType(), True),
        StructField("id", StringType(), True),
        StructField("pt", StringType(), True),
        StructField("zip", StringType(), True))],
    True)])

关于python - DataFrame - 值错误 : Unexpected tuple with StructType，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37399209/

python - DataFrame - 值错误 : Unexpected tuple with StructType

上一篇：python - 程序无法处理唯一的单词

下一篇：python - 读取 txt 矩阵时，如何跳过第一列