scala - Spark : Extaract domain from email address in dataframe

我在提取电子邮件域时遇到困难。我有以下数据框。

+---+----------------+
|id |email           |
+---+----------------+
|1  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="345d5d745f5b5f5b1a575b59" rel="noreferrer noopener nofollow">[email protected]</a>     |
|2  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ec808380ac8a9f8dc2839e8b" rel="noreferrer noopener nofollow">[email protected]</a>     |
|3  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="472c282c282d262a2528072a2829692232" rel="noreferrer noopener nofollow">[email protected]</a>|
+---+----------------+

现在我想要一个新的域字段，我将得到:

+---+----------------+------+
|id |email           |domain|
+---+----------------+------+
|1  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed8484ad86828682c38e8280" rel="noreferrer noopener nofollow">[email protected]</a>     |koko  |
|2  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0a6665664a6c796b2465786d" rel="noreferrer noopener nofollow">[email protected]</a>     |fsa   |
|3  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3b50545054515a5659547b565455155e4e" rel="noreferrer noopener nofollow">[email protected]</a>|mon   |
+---+----------------+------+

我尝试做这样的事情:

val test = df_short.withColumn("email", split($"email", "@."))

但得到了错误的输出。有人可以更好地指导我吗？

最佳答案

您可以简单地使用内置的regexp_extract函数从电子邮件地址获取您的域名。

//create an example dataframe
val df = Seq((1, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b6dfdff6ddd9ddd998d5d9db" rel="noreferrer noopener nofollow">[email protected]</a>"),
  (2, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0b6764674b6d786a2564796c" rel="noreferrer noopener nofollow">[email protected]</a>"),
  (3, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c6ada9ada9aca7aba4a986aba9a8e8a3b3" rel="noreferrer noopener nofollow">[email protected]</a>"))
  .toDF("id", "email")

//original dataframe
df.show(false)
//output
//    +---+----------------+
//    |id |email           |
//    +---+----------------+
//    |1  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="90f9f9d0fbfffbffbef3fffd" rel="noreferrer noopener nofollow">[email protected]</a>     |
//    |2  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d7bbb8bb97b1a4b6f9b8a5b0" rel="noreferrer noopener nofollow">[email protected]</a>     |
//    |3  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="69020602060308040b0629040607470c1c" rel="noreferrer noopener nofollow">[email protected]</a>|
//    +---+----------------+

//using regex get the domain name
df.withColumn("domain",
  regexp_extract($"email", "(?<=@)[^.]+(?=\\.)", 0))
  .show(false)

//output
//    +---+----------------+------+
//    |id |email           |domain|
//    +---+----------------+------+
//    |1  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9cf5f5dcf7f3f7f3b2fff3f1" rel="noreferrer noopener nofollow">[email protected]</a>     |koko  |
//    |2  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="74181b18341207155a1b0613" rel="noreferrer noopener nofollow">[email protected]</a>     |fsa   |
//    |3  |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="046f6b6f6b6e6569666b44696b6a2a6171" rel="noreferrer noopener nofollow">[email protected]</a>|mon   |
//    +---+----------------+------+

关于scala - Spark : Extaract domain from email address in dataframe，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50923036/

scala - Spark : Extaract domain from email address in dataframe

上一篇：apache-kafka - 如何查询 ksql 中的 map 字段？

下一篇：c# - 如何使用 ClearScript 将 JavaScript 数组传递到主机？