我有一个像这样的数据框:
ID Notes
2345 Checked by John
2398 Verified by Stacy
3983 Double Checked on 2/23/17 by Marsha
例如,假设只有 3 名员工需要检查:John、Stacy 或 Marsha。我想像这样创建一个新专栏:
ID Notes Employee
2345 Checked by John John
2398 Verified by Stacy Stacy
3983 Double Checked on 2/23/17 by Marsha Marsha
regex 和 grep 哪个更好?我应该尝试什么样的功能?谢谢!
编辑:我一直在尝试多种解决方案,但似乎没有任何效果。我应该放弃,而是为每个员工创建具有二进制值的列吗?即:
ID Notes John Stacy Marsha
2345 Checked by John 1 0 0
2398 Verified by Stacy 0 1 0
3983 Double Checked on 2/23/17 by Marsha 0 0 1
最佳答案
简而言之:
regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
This expression extracts employee name from any position where it is after by then space(s) in text column(
col('Notes')
)
详细说明:
创建示例数据框
data = [('2345', 'Checked by John'),
('2398', 'Verified by Stacy'),
('2328', 'Verified by Srinivas than some random text'),
('3983', 'Double Checked on 2/23/17 by Marsha')]
df = sc.parallelize(data).toDF(['ID', 'Notes'])
df.show()
+----+--------------------+
| ID| Notes|
+----+--------------------+
|2345| Checked by John|
|2398| Verified by Stacy|
|2328|Verified by Srini...|
|3983|Double Checked on...|
+----+--------------------+
做需要的导入
from pyspark.sql.functions import regexp_extract, col
在 df
上使用 regexp_extract(column_name, regex, group_number)
从列中提取 Employee
姓名。
这里正则表达式('(.)(by)(\s+)(\w+)'
) 表示
- (.) - 任何字符(换行符除外)
- (by) - 文字中的by
- (\s+) - 一个或多个空格
- (\w+) - 长度为 1 的字母数字或下划线字符
并且 group_number 是 4 因为组 (\w+)
在表达式中处于第 4 个位置
result = df.withColumn('Employee', regexp_extract(col('Notes'), '(.)(by)(\s+)(\w+)', 4))
result.show()
+----+--------------------+--------+
| ID| Notes|Employee|
+----+--------------------+--------+
|2345| Checked by John| John|
|2398| Verified by Stacy| Stacy|
|2328|Verified by Srini...|Srinivas|
|3983|Double Checked on...| Marsha|
+----+--------------------+--------+
注意事项:
regexp_extract(col('Notes'), '.by\s+(\w+)', 1))
seems much cleaner version and check the Regex in use here
关于python - PySpark - 字符串匹配以创建新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46410887/