python - 如何将 pyspark 数据框中的单元格中的 CSV 值分别分隔为新列及其值

标签 python apache-spark pyspark

当前的 Spark 数据框在一列的单元格级别具有 CSV 值,我尝试将其分解为新列。示例数据框

    a_id                                    features
1   2020     "a","b","c","d","constant1","1","0.1","aa"
2   2021     "a","b","c","d","constant2","1","0.2","ab"
3   2022     "a","b","c","d","constant3","1","0.3","ac","a","b","c","d","constant3","1.1","3.3","acx"
4   2023     "a","b","c","d","constant4","1","0.4","ad"
5   2024     "a","b","c","d","constant5","1","0.5","ae","a","b","c","d","constant5","1.2","6.3","xwy","a","b","c","d","constant5","2.2","8.3","bunr"
6   2025     "a","b","c","d","constant6","1","0.6","af"

特征列有多个 csv 值,其中(a、b、c、d)充当标题,它们在某些单元格(第 3 行和第 5 行)中重复,我只想提取一个标题及其各自的值。预期数据帧的输出如图所示

输出 Spark 数据帧

    a_id       a        d
1   2020   constant1   ["aa"]
2   2021   constant2   ["ab"]
3   2022   constant3   ["ac","acx"]
4   2023   constant4   ["ad"]
5   2024   constant5   ["ae","xwy","bunr"]
6   2025   constant6   ["af"]

如图所示,我只想提取 a 和 d 标题作为新列,其中 a 是常量,d 有多个值,其值作为列表。

请帮忙如何在 pysaprk 中转换它。上面的数据帧是实时流式数据帧。

最佳答案

仅使用 Pyspark/Spark SQL 函数:

  • 从字符串中删除 header
  • 使用 regexp_extract_all 提取子字符串,在每四个 ,
  • 之后将字符串分解为子字符串
  • 分解结果并删除空行
  • 再次分割结果。现在每个 csv 值都是数组的一个元素
  • 从数组的第一个和第四个元素创建列 ad
  • a_id分组
from pyspark.sql import functions as F

header='"a","b","c","d",'
num_headers = header.count(",")

df.withColumn("features", F.expr(f"replace(features, '{header}')")) \
  .withColumn("features", F.expr(f"regexp_extract_all(features, '(([^,]*,?)\\{{{num_headers}}})')")) \
  .withColumn("features", F.explode("features"))\
  .filter("not features =''") \
  .withColumn("features", F.split("features", ",")) \
  .withColumn("a", F.expr("features[0]")) \
  .withColumn("d", F.expr("features[3]")) \
  .groupBy("a_id") \
  .agg(F.first("a").alias("a"), F.collect_list("d").alias("d")) \
  .show(truncate=False)

输出:

+----+----------+---------------------+
|a_id|a         |d                    |
+----+----------+---------------------+
|2020|"constant"|["aa"]               |
|2022|"constant"|["ac", "acx"]        |
|2025|"constant"|["af"]               |
|2023|"constant"|["ad"]               |
|2021|"constant"|["ab"]               |
|2024|"constant"|["ae", "xwy", "bunr"]|
+----+----------+---------------------+

关于python - 如何将 pyspark 数据框中的单元格中的 CSV 值分别分隔为新列及其值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69453197/

相关文章:

python - 在不使用 "For"的情况下,如何使无限循环 "while"能够保存值?

python - 在 PyPI 托管存储库上搜索、注册和安装的异常(exception)情况

python - 检查用户是否具有特定角色

apache-spark - 如何将Row类型转换为Vector以馈给KMeans

python - 使用 Numpy 进行交易

memory - 在 Spark 中,作业完成后内存中还剩下什么?

apache-spark - 在 Spark 中处理 bzipped json 文件?

apache-spark - 如何将列聚合到 JSON 数组中?

python - Datastax Spark Cassandra 连接器模块导入错误

python - PySpark 在 Synapse 链接服务之间切换