我在提取电子邮件域时遇到困难。我有以下数据框。
+---+----------------+
|id |email |
+---+----------------+
|1 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="345d5d745f5b5f5b1a575b59" rel="noreferrer noopener nofollow">[email protected]</a> |
|2 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ec808380ac8a9f8dc2839e8b" rel="noreferrer noopener nofollow">[email protected]</a> |
|3 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="472c282c282d262a2528072a2829692232" rel="noreferrer noopener nofollow">[email protected]</a>|
+---+----------------+
现在我想要一个新的域字段,我将得到:
+---+----------------+------+
|id |email |domain|
+---+----------------+------+
|1 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="ed8484ad86828682c38e8280" rel="noreferrer noopener nofollow">[email protected]</a> |koko |
|2 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0a6665664a6c796b2465786d" rel="noreferrer noopener nofollow">[email protected]</a> |fsa |
|3 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="3b50545054515a5659547b565455155e4e" rel="noreferrer noopener nofollow">[email protected]</a>|mon |
+---+----------------+------+
我尝试做这样的事情:
val test = df_short.withColumn("email", split($"email", "@."))
但得到了错误的输出。有人可以更好地指导我吗?
最佳答案
您可以简单地使用内置的regexp_extract
函数从电子邮件地址获取您的域名。
//create an example dataframe
val df = Seq((1, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="b6dfdff6ddd9ddd998d5d9db" rel="noreferrer noopener nofollow">[email protected]</a>"),
(2, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="0b6764674b6d786a2564796c" rel="noreferrer noopener nofollow">[email protected]</a>"),
(3, "<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c6ada9ada9aca7aba4a986aba9a8e8a3b3" rel="noreferrer noopener nofollow">[email protected]</a>"))
.toDF("id", "email")
//original dataframe
df.show(false)
//output
// +---+----------------+
// |id |email |
// +---+----------------+
// |1 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="90f9f9d0fbfffbffbef3fffd" rel="noreferrer noopener nofollow">[email protected]</a> |
// |2 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d7bbb8bb97b1a4b6f9b8a5b0" rel="noreferrer noopener nofollow">[email protected]</a> |
// |3 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="69020602060308040b0629040607470c1c" rel="noreferrer noopener nofollow">[email protected]</a>|
// +---+----------------+
//using regex get the domain name
df.withColumn("domain",
regexp_extract($"email", "(?<=@)[^.]+(?=\\.)", 0))
.show(false)
//output
// +---+----------------+------+
// |id |email |domain|
// +---+----------------+------+
// |1 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="9cf5f5dcf7f3f7f3b2fff3f1" rel="noreferrer noopener nofollow">[email protected]</a> |koko |
// |2 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="74181b18341207155a1b0613" rel="noreferrer noopener nofollow">[email protected]</a> |fsa |
// |3 |<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="046f6b6f6b6e6569666b44696b6a2a6171" rel="noreferrer noopener nofollow">[email protected]</a>|mon |
// +---+----------------+------+
关于scala - Spark : Extaract domain from email address in dataframe,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50923036/