scala - Spark : Extaract domain from email address in dataframe

标签 scala apache-spark dataframe

我在提取电子邮件域时遇到困难。我有以下数据框。

+---+----------------+
|id |email           |
+---+----------------+
|1  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="345d5d745f5b5f5b1a575b59" rel="noreferrer noopener nofollow">[email protected]</a>     |
|2  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ec808380ac8a9f8dc2839e8b" rel="noreferrer noopener nofollow">[email protected]</a>     |
|3  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="472c282c282d262a2528072a2829692232" rel="noreferrer noopener nofollow">[email protected]</a>|
+---+----------------+

现在我想要一个新的域字段,我将得到:

+---+----------------+------+
|id |email           |domain|
+---+----------------+------+
|1  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed8484ad86828682c38e8280" rel="noreferrer noopener nofollow">[email protected]</a>     |koko  |
|2  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0a6665664a6c796b2465786d" rel="noreferrer noopener nofollow">[email protected]</a>     |fsa   |
|3  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3b50545054515a5659547b565455155e4e" rel="noreferrer noopener nofollow">[email protected]</a>|mon   |
+---+----------------+------+

我尝试做这样的事情:

val test = df_short.withColumn("email", split($"email", "@."))

但得到了错误的输出。有人可以更好地指导我吗?

最佳答案

您可以简单地使用内置的regexp_extract函数从电子邮件地址获取您的域名。

//create an example dataframe
val df = Seq((1, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b6dfdff6ddd9ddd998d5d9db" rel="noreferrer noopener nofollow">[email protected]</a>"),
  (2, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0b6764674b6d786a2564796c" rel="noreferrer noopener nofollow">[email protected]</a>"),
  (3, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c6ada9ada9aca7aba4a986aba9a8e8a3b3" rel="noreferrer noopener nofollow">[email protected]</a>"))
  .toDF("id", "email")

//original dataframe
df.show(false)
//output
//    +---+----------------+
//    |id |email           |
//    +---+----------------+
//    |1  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="90f9f9d0fbfffbffbef3fffd" rel="noreferrer noopener nofollow">[email protected]</a>     |
//    |2  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d7bbb8bb97b1a4b6f9b8a5b0" rel="noreferrer noopener nofollow">[email protected]</a>     |
//    |3  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="69020602060308040b0629040607470c1c" rel="noreferrer noopener nofollow">[email protected]</a>|
//    +---+----------------+

//using regex get the domain name
df.withColumn("domain",
  regexp_extract($"email", "(?<=@)[^.]+(?=\\.)", 0))
  .show(false)

//output
//    +---+----------------+------+
//    |id |email           |domain|
//    +---+----------------+------+
//    |1  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9cf5f5dcf7f3f7f3b2fff3f1" rel="noreferrer noopener nofollow">[email protected]</a>     |koko  |
//    |2  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="74181b18341207155a1b0613" rel="noreferrer noopener nofollow">[email protected]</a>     |fsa   |
//    |3  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="046f6b6f6b6e6569666b44696b6a2a6171" rel="noreferrer noopener nofollow">[email protected]</a>|mon   |
//    +---+----------------+------+

关于scala - Spark : Extaract domain from email address in dataframe,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50923036/

相关文章:

python - 如何根据值在另一个数据框中的位置获取一个数据框中的值

scala - Spark 1.6.1 : Task not serializable when evaluating a classifier on a DataFrame

scala - 使用 future 将仅副作用的函数转换为异步。返回类型是什么?

apache-spark - 比较spark中两个RDD中的数据

r - readTypedObject(con, type) : Unsupported type for deserialization 中的 SparkR 错误

python - 如何仅对 Pandas 数据框中的选定列进行排序

python - 迭代 pandas 数据框并将新值插入空列

Scala 单元测试 : how to validate the returned RDD

postgresql - 使用 scala 在 Jooq 中进行事务和条件更新

linux - Spark配置,SPARK_DRIVER_MEMORY、SPARK_EXECUTOR_MEMORY、SPARK_WORKER_MEMORY有什么区别?