我从 Spark 数组“df_spark”开始:
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
import pyspark.sql.functions as F
spark = SparkSession.builder.master("local").appName("Word Count").config("spark.some.config.option", "some-value").getOrCreate()
np.random.seed(0)
rows = 6
df_pandas = pd.DataFrame({ 'color' : pd.Categorical(np.random.choice(["blue","orange", "red"], rows)),
'animal' : [['cat', 'dog'], ['cat', 'monkey'], ['monkey', 'cat'], ['dog', 'monkey'], ['cat', 'dog'], ['monkey', 'dog']]})
print(df_pandas)
df_spark = spark.createDataFrame(df_pandas)
df_spark.show()
我想以一个新的 Spark 表“df_results_spark”结束,它计算每个字符串“cat”、“monkey”、“dog”在数组中每个类别“red、blue、orange”的出现次数。
df_results_pandas = pd.DataFrame({'color': ['red', 'blue', 'orange'],
'cat': [0, 2, 2],
'dog': [1, 1, 2],
'monkey': [1, 1, 2]})
print(df_results_pandas)
df_results_spark = spark.createDataFrame(df_results_pandas)
df_results_spark.show()
最佳答案
您可以使用 explode()
函数为数组中的每个元素创建一行。
df_spark_exploded = df_spark.selectExpr("color","explode(animal) as animal")
df_spark_exploded.show()
+------+------+
| color|animal|
+------+------+
| blue| cat|
| blue| dog|
|orange| cat|
|orange|monkey|
| blue|monkey|
| blue| cat|
|orange| dog|
|orange|monkey|
|orange| cat|
|orange| dog|
| red|monkey|
| red| dog|
+------+------+
然后使用 pivot()
reshape 数据框并应用计数聚合函数来获取每只动物的数量。
df_results_spark = df_spark_exploded.groupby("color").pivot("animal").count().fillna(0)
df_results_spark.show()
+------+---+---+------+
| color|cat|dog|monkey|
+------+---+---+------+
|orange| 2| 2| 2|
| red| 0| 1| 1|
| blue| 2| 1| 1|
+------+---+---+------+
关于pyspark - 计数数组包含 PySpark 中每个类别的字符串的次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53800524/