apache-spark - Pyspark 数据框中的 regexp_replace

标签 apache-spark hadoop pyspark apache-spark-sql pyspark-dataframes

我跑了regexp_replace Pyspark 数据帧上的命令,之后所有数据的数据类型都更改为字符串。为什么会这样?
下面是我使用 regex_replace 之前的表格

root
 |-- account_id: long (nullable = true)
 |-- credit_card_limit: long (nullable = true)
 |-- credit_card_number: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_number: long (nullable = true)
 |-- amount: long (nullable = true)
 |-- date: string (nullable = true)
 |-- shop: string (nullable = true)
 |-- transaction_code: string (nullable = true)
应用 regexp_replace 后的架构
root
 |-- date_type: date (nullable = true)
 |-- c_phone_number: string (nullable = true)
 |-- c_account_id: string (nullable = true)
 |-- c_credit_card_limit: string (nullable = true)
 |-- c_credit_card_number: string (nullable = true)
 |-- c_amount: string (nullable = true)
 |-- c_full_name: string (nullable = true)
 |-- c_transaction_code: string (nullable = true)
 |-- c_shop: string (nullable = true)
我使用的代码:
df=df.withColumn('c_phone_number',regexp_replace("phone_number","[^0-9]","")).drop('phone_number')
df=df.withColumn('c_account_id',regexp_replace("account_id","[^0-9]","")).drop('account_id')
df=df.withColumn('c_credit_card_limit',regexp_replace("credit_card_limit","[^0-9]","")).drop('credit_card_limit')
df=df.withColumn('c_credit_card_number',regexp_replace("credit_card_number","[^0-9]","")).drop('credit_card_number')
df=df.withColumn('c_amount',regexp_replace("amount","[^0-9 ]","")).drop('amount')
df=df.withColumn('c_full_name',regexp_replace("full_name","[^a-zA-Z ]","")).drop('full_name')
df=df.withColumn('c_transaction_code',regexp_replace("transaction_code","[^a-zA-Z]","")).drop('transaction_code')
df=df.withColumn('c_shop',regexp_replace("shop","[^a-zA-Z ]","")).drop('shop')
为什么会这样?有没有办法将其转换为其原始数据类型,或者我应该再次使用 cast 吗?

最佳答案

您可能想查看来自 spark git 的代码 regexp_replace -

override def nullSafeEval(s: Any, p: Any, r: Any): Any = {
    if (!p.equals(lastRegex)) {
      // regex value changed
      lastRegex = p.asInstanceOf[UTF8String].clone()
      pattern = Pattern.compile(lastRegex.toString)
    }
    if (!r.equals(lastReplacementInUTF8)) {
      // replacement string changed
      lastReplacementInUTF8 = r.asInstanceOf[UTF8String].clone()
      lastReplacement = lastReplacementInUTF8.toString
    }
    val m = pattern.matcher(s.toString())
    result.delete(0, result.length())

    while (m.find) {
      m.appendReplacement(result, lastReplacement)
    }
    m.appendTail(result)

    UTF8String.fromString(result.toString)
  }
  • 上面的代码接受表达式为 Any然后调用toString()就可以了
  • 最后,它再次将结果转换为 toString

  • UTF8String.fromString(result.toString)
    
    引用 - spark-git

    关于apache-spark - Pyspark 数据框中的 regexp_replace,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62699239/

    相关文章:

    scala - Spark : Exception in thread "main" java. lang.VerifyError : class com. fastxml.jackson.module.scala.ser.ScalaIteratorSerializer

    apache-spark - Hadoop 与 Spark 澄清

    Python 字典键值到 Pyspark 中的数据框 where 子句

    apache-spark - 如何在 PySpark 中使用 foreach 或 foreachBatch 写入数据库?

    macos - 在 mac os x 10.8 上设置 hadoop 的问题

    hadoop - ifile EBADF : Bad file descriptor while performing matrix addition 上的预读失败

    Hadoop distcp 异常

    python - Pyspark 错误 - Py4JJavaError : An error occurred while calling o731. 加载

    python - 安排 pyspark 笔记本

    java - javardd中如何通过header进行过滤?