python - 有限制的子字符串 (pyspark.sql.Column.substr)

我有一个代码，例如 C78907。我想拆分它:

C78     # level 1
C789    # Level2
C7890   # Level 3
C78907  # Level 4

到目前为止我使用的是:

Df3 = Df2.withColumn('Level_One', concat(Df2.code.substr(1, 3)))
Df4 = Df3.withColumn('Level_two', concat(Df3.code.substr(1, 4)))
Df5 = Df4.withColumn('Level_theree', concat(Df4.code.substr(1, 5)))
Df6 = Df5.withColumn('Level_four', concat(Df5.code.substr(1, 6)))

问题是在查看结果时，第四级(应该是 6 个组件)的代码可能包含第一级或第二级或第三级的代码。

721 7213    7213    7213
758 7580    7580    7580
724 7242    7242    7242
737 7373    73730   73730
789 7895    78959   78959
V06 V061    V061    V061
381 3810    38100   38100

理想情况下，限制可能会有用。我的意思是:

对于一级，只保留 3 个组件。
对于二级 4 组件且不少于。
对于 3 级 5 组件且不少于。
对于 4 级 6 组件且不少于。
如果所需数量的组件不存在，则输入 null 而不是使用前一个进行输入。

期望的输出:

Initial_code   level1  level2   level3   level4        
 7213           721    7213     null      null
 7580           758    7580     null      null
 7242           724    7242     null      null
 73730          737    7373     73730     null
 38100D         381    3810     38100     38100D

最佳答案

您可以使用 pyspark.sql.Column.when() 获得所需的输出和 pyspark.sql.functions.length() .创建列时，检查子字符串的长度是否正确。如果不是，请使用 pyspark.sql.functions.lit() 将该列设置为 None .

例如:

import pyspark.sql.functions as f
df.withColumn('Level_One', f.when(
        f.length(f.col('code').substr(1, 3)) == 3,
        f.col('code').substr(1, 3)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Two', f.when(
        f.length(f.col('code').substr(1, 4)) == 4,
        f.col('code').substr(1, 4)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Three', f.when(
        f.length(f.col('code').substr(1, 5)) == 5,
        f.col('code').substr(1, 5)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Four', f.when(
        f.length(f.col('code').substr(1, 6)) == 6,
        f.col('code').substr(1, 6)
    ).otherwise(f.lit(None)))\
    .show()

输出:

+------+---------+---------+-----------+----------+
|  Code|Level_One|Level_Two|Level_Three|Level_Four|
+------+---------+---------+-----------+----------+
|  7213|      721|     7213|       null|      null|
|  7580|      758|     7580|       null|      null|
|  7242|      724|     7242|       null|      null|
| 73730|      737|     7373|      73730|      null|
|38100D|      381|     3810|      38100|    38100D|
+------+---------+---------+-----------+----------+

关于python - 有限制的子字符串 (pyspark.sql.Column.substr)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49432950/

python - 有限制的子字符串 (pyspark.sql.Column.substr)

上一篇：python - 如何使用 Python boto3 从 AWS DynamoDB 表中获取特定属性的所有项目？

下一篇：python - 将 Pandas 值组合到成员组中