python - 正则表达式查找所有不包含_(下划线)和:(Colon) in PySpark Dataframe column的字符串

标签 python python-3.x regex apache-spark pyspark

我在数据框中有一列名为 “标签” .我需要根据条件提取值。条件是它不应该包含 _(Underscore) 和 :(Colon)。
例如:
“标签”:“嗨，你好，amount_10，amount_90，总计:100”
预期结果:
"new_column": "嗨，你好"
供您引用:
我提取了所有金额标签

collectAmount = udf(lambda s: list(map(lambda amount: amount.split('_')[1] if len(collection) > 0
                        else amount, re.findall(r'(amount_\w+)', s))), ArrayType(StringType()))

productsDF = productsDF.withColumn('amount_tag', collectAmount('tags'))

最佳答案

尝试这个

df.withColumn('new_column',expr('''concat_ws(',',array_remove(transform(split(tags,','), x -> regexp_extract(x,'^(?!.*_)(?!.*:).+$',0)),''))''')).show(2,False)

+-------------------------------------------+----------+
|tags                                       |new_column|
+-------------------------------------------+----------+
|hai, hello, amount_10, amount_90, total:100|hai, hello|
|hai, hello, amount_10, amount_90, total:100|hai, hello|
+-------------------------------------------+----------+

关于python - 正则表达式查找所有不包含_(下划线)和:(Colon) in PySpark Dataframe column的字符串，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/63009059/

上一篇：postgresql - Google Cloud SQL - Postgresql 存储不断增长

下一篇：c - EOF 如何在 C 内部工作？

Python 3.6 - 变量值可以用作字典中键值的一部分吗？

javascript - cz ccTLD 域正则表达式验证

python - 如果重复列值在另一列中具有相同的值，则只保留第一次出现的重复列值

python - 使用虚拟环境时，将 python 程序文件存储在哪里？

python - 在 Python 中，不同的引号在这种情况下意味着什么？

sockets - python3 上的 Tornado

python - 如何在 ubuntu (20.04) 上使用 sudo python -m

java - Extract 通过正则表达式连接字符串中的变量

PHP:多次替换两个字符之间的内容