dictionary - pyspark : Change nested column datatype

标签 dictionary pyspark

我们如何更改 Pyspark 中嵌套列的数据类型?对于 rxample,如何将 value 的数据类型从 string 更改为 int?

引用:how to change a Dataframe column from String type to Double type in pyspark

{
    "x": "12",
    "y": {
        "p": {
            "name": "abc",
            "value": "10"
        },
        "q": {
            "name": "pqr",
            "value": "20"
        }
    }
}

最佳答案

您可以使用读取json数据

from pyspark import SQLContext

sqlContext = SQLContext(sc)
data_df = sqlContext.read.json("data.json", multiLine = True)

data_df.printSchema()

输出

root
 |-- x: long (nullable = true)
 |-- y: struct (nullable = true)
 |    |-- p: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)
 |    |-- q: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)

现在您可以访问 y 列中的数据

data_df.select("y.p.name")
data_df.select("y.p.value")

输出

abc, 10

好的,解决方案是添加具有正确架构的新嵌套列并删除具有错误架构的列

from pyspark.sql.functions import *
from pyspark.sql import Row

df3 = spark.read.json("data.json", multiLine = True)

# create correct schema from old 
c = df3.schema['y'].jsonValue()
c['name'] = 'z'
c['type']['fields'][0]['type']['fields'][1]['type'] = 'long'
c['type']['fields'][1]['type']['fields'][1]['type'] = 'long'

y_schema = StructType.fromJson(c['type'])

# define a udf to populate the new column. Row are immuatable so you 
# have to build it from start.

def foo(row):
    d = Row.asDict(row)
    y = {}
    y["p"] = {}
    y["p"]["name"] = d["p"]["name"]
    y["p"]["value"] = int(d["p"]["value"])
    y["q"] = {}
    y["q"]["name"] = d["q"]["name"]
    y["q"]["value"] = int(d["p"]["value"])

    return(y)
map_foo = udf(foo, y_schema)

# add the column
df3_new  = df3.withColumn("z", map_foo("y"))

# delete the column
df4 = df3_new.drop("y")


df4.printSchema()

输出

root
 |-- x: long (nullable = true)
 |-- z: struct (nullable = true)
 |    |-- p: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)
 |    |-- q: struct (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)


df4.show()

输出

+---+-------------------+
|  x|                  z|
+---+-------------------+
| 12|[[abc,10],[pqr,10]]|
+---+-------------------+

关于dictionary - pyspark : Change nested column datatype,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45824403/

相关文章:

python - 分析成员对话的数据

python - 如何在 PySpark 中将数据框列从 String 类型更改为 Double 类型?

pyspark - 在 Spark 中创建表需要花费大量时间

python-3.x - Pyspark StreamingQueryException local using query.awaitTermination() - 本地 netcat 流与 Jupyter Notebook 上的 Pyspark 应用程序相结合

python - 解析具有已知键定界符的键值对的字符串

hash - 优化字数

dictionary - f# 字典、不变性和性能

c++ - 重载运算符在使用 std::map 时产生错误

python - 从 Pyspark 列中获取值并将其与 Python 字典进行比较

python - PySpark DecisionTree 模型的精度和召回率与手动结果存在差异