python - pyspark - select() 函数忽略 if 语句

感谢用户@DerekO，his以下示例仅正确获取 varchar 列的最大长度。但是，当我使用加载了 csv 文件的 df 的相同示例时，它会忽略 if 语句并计算所有列的最大长度(包括整数、 double 等)

问题:在不创建自定义架构的情况下，我们如何改进下面的示例 2，使其仅显示 varchar 列的最大长度

示例 1:

from pyspark.sql.functions import col, length, max
from pyspark.sql.types import StringType
    
df = spark.createDataFrame(
    [
        (1, '2', '1'),
        (1, '4', '82'),
        (1, '2', '3'),
    ],
    ['col1','col2','col3']
)

df.select([
    max(length(col(schema.name))).alias(f'{schema.name}_max_length') 
    for schema in df.schema 
    if schema.dataType == StringType()
])
    
+---------------+---------------+
|col2_max_length|col3_max_length|
+---------------+---------------+
|              1|              2|
+---------------+---------------+

示例 2:

from pyspark.sql.functions import col, length, max
from pyspark.sql.types import StringType

df = spark.read.option("delimiter", ',').option("header", 'true').option("escape", '"').option("inferSchema", 'true')\
      .csv("abfss://myContainer@myStorageAccountName" + '.dfs.core.windows.net/' + myFile_path)

df = df.select([max(length(col(schema.name))).alias(f'{schema.name}')
    for schema in df.schema 
    if schema.dataType == StringType()
])

display(df)

#The above code displays lengths of all columns even though `csv` file contains non-varchar columns, as well, as shown below:

for schema in df.schema:
  print(schema.name+" , "+str(schema.dataType))

#Output: The csv has about 80 columns. For brevity I am displaying only the few here
Field , StringType
Field2 , StringType
Field3 , StringType
Field4 , IntegerType
Field5 , DoubleType
Field6 , LongType
Field7, StringType
Field8 , StringType
Field9 , DoubleType
.....
.....

最佳答案

老实说，我现在只是猜测，但也许使用 == 不是最佳实践，我们应该使用 isinstance 代替。

df.select([
    max(length(col(schema.name))).alias(f'{schema.name}_max_length') 
    for schema in df.schema 
    if isinstance(schema.dataType, StringType)
])

关于python - pyspark - select() 函数忽略 if 语句，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/75525825/

python - pyspark - select() 函数忽略 if 语句

上一篇：azure - 尝试将自定义域添加到新应用服务时与现有主机名冲突

下一篇：c# - 添加MicrosoftIdentityWebAppAuthentication 刷新 token