我有一个非常大的表来表示点(>3000 万点)。它可以有两个或树形列代表 x,y,z

不幸的是，其中一些列可以包含字符串('nan'、'nulo'、'vazio' 等) 它们可以在不同的文件中改变，但在表中是不变的

我需要一种方法来删除此字符串并用空值替换它们或删除行

我所做的是在图片和下面的代码中，有更好的原因吗？更灵活？(此代码仅适用于 3d)

def import_file(self,file_path:str,sep:str=',',null_values:str=''):  
 
 #read table
 table =  self.spark.read.load(path=file_path, \
 format='csv', \
 sep=sep, \
 header=False).toDF('x','y','z')
 
 #change the letters to ''
 table.withColumn('x',regexp_replace('x','[a-z]',''))
 table.withColumn('y',regexp_replace('z','[a-z]',''))
 table.withColumn('z',regexp_replace('z','[a-z]',''))

 #replace '' for nulls or TODO:remove columns
 table.replace('',None)

 return table

最佳答案

另一种方法可以使用 UDF 来标记字符串，并且进一步基于您想要删除的列中的任何行组合，您可以轻松地做到这一点

import pyspark.sql.functions as F
import pandas as pd
import numpy as np

@F.udf(returnType=BooleanType())
def mark_strings(inp):

  #### Check if inp is string or not , assuming here you can have numeric rows as well which are to be returned as is

  if isinstance(inp,str) and not pd.isnull(inp):
    if inp.isalpha():
       return True
  
  return False


@F.udf(returnType=StringType())
def replace_strings(inp):

  #### Check if inp is string or not , assuming here you can have numeric rows as well which are to be returned as is

  if isinstance(inp,str) and not pd.isnull(inp):
    if inp.isalpha():
       return np.nan
  
  return inp

删除数据行

table = table.withColumn('x_str_bool',mark_strings(F.col('x')))
table = table.withColumn('y_str_bool',mark_strings(F.col('y')))
table = table.withColumn('z_str_bool',mark_strings(F.col('z')))

##### Assuming if you only want to remove string data rows based on a combination of x and y.

table_filter = table.filter((F.col('x_str_bool') == False) &
(F.col('y_str_bool') == False))

替换数据行

table = table.withColumn('x',replace_strings(F.col('x')))
table = table.withColumn('y',replace_strings(F.col('y')))
table = table.withColumn('z',replace_strings(F.col('z')))

关于python - 从整数列 PySpark 中删除字母，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/67846590/

python - 从整数列 PySpark 中删除字母

删除数据行

替换数据行

上一篇：Javascript 移除对 Enter 键按下的关注

下一篇：html - 悬停 `<div>`时如何使 `<input>`不展开

python - 从整数列 PySpark 中删除字母

删除数据行

替换数据行

上一篇：Javascript 移除对 Enter 键按下的关注

下一篇：html - 悬停 `<div>`时如何使 `&lt;input&gt;`不展开

下一篇：html - 悬停 `<div>`时如何使 `<input>`不展开