python - PySpark - 字符串匹配以创建新列

标签 python regex apache-spark pyspark apache-spark-sql

我有一个像这样的数据框:

ID             Notes
2345          Checked by John
2398          Verified by Stacy
3983          Double Checked on 2/23/17 by Marsha 

例如,假设只有 3 名员工需要检查:John、Stacy 或 Marsha。我想像这样创建一个新专栏:

ID                Notes                              Employee
2345          Checked by John                          John
2398         Verified by Stacy                        Stacy
3983     Double Checked on 2/23/17 by Marsha          Marsha

regex 和 grep 哪个更好?我应该尝试什么样的功能?谢谢!

编辑:我一直在尝试多种解决方案,但似乎没有任何效果。我应该放弃,而是为每个员工创建具有二进制值的列吗?即:

ID                Notes                             John       Stacy    Marsha
2345          Checked by John                        1            0       0
2398         Verified by Stacy                       0            1       0
3983     Double Checked on 2/23/17 by Marsha         0            0       1

最佳答案

简而言之:

regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

This expression extracts employee name from any position where it is after by then space(s) in text column(col('Notes'))


详细说明:

创建示例数据框

data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),        
('3983', 'Double Checked on 2/23/17 by Marsha')]

df = sc.parallelize(data).toDF(['ID', 'Notes'])

df.show()

+----+--------------------+
|  ID|               Notes|
+----+--------------------+
|2345|     Checked by John|
|2398|   Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+

做需要的导入

from pyspark.sql.functions import regexp_extract, col

df 上使用 regexp_extract(column_name, regex, group_number) 从列中提取 Employee 姓名。

这里正则表达式('(.)(by)(\s+)(\w+)') 表示

  • (.) - 任何字符(换行符除外)
  • (by) - 文字中的by
  • (\s+) - 一个或多个空格
  • (\w+) - 长度为 1 的字母数字或下划线字符

并且 group_number 是 4 因为组 (\w+) 在表达式中处于第 4 个位置

result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))

result.show()

+----+--------------------+--------+
|  ID|               Notes|Employee|
+----+--------------------+--------+
|2345|     Checked by John|    John|
|2398|   Verified by Stacy|   Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...|  Marsha|
+----+--------------------+--------+

Databricks notebook

注意事项:

regexp_extract(col('Notes'), '.by\s+(\w+)', 1)) seems much cleaner version and check the Regex in use here

关于python - PySpark - 字符串匹配以创建新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46410887/

相关文章:

mysql - 在 mySQL 中获取 CSV 值

hadoop - Cassandra 查询灵活性

python - 在 sleep 调用之间捕获信号

python - zipfile.write() : relative path of files, 在 zip 存档中复制

匹配 "*"的 JavaScript 正则表达式,除非前面有反斜杠

Java RegEx 和换行符 - 错误或预期行为?

scala - 我是否使用了正确的框架?

apache-spark - 为什么 Iceberg rewriteDataFiles 不将文件重写为一个文件?

python - 如何向 PyQt 中的事件循环发出自定义事件

python - cv2.imshow 和 cv2.imwrite