我得到了这个数据框:
+------+-----------+--------------------+
|NewsId| expNews| transArr|
+------+-----------+--------------------+
| 1| House|[house, HH, AW1, S] |
| 1|Republicans|[republicans, R, ...|
| 1| Fret|[fret, F, R, EH1, T]|
| 1| About|[about, AH0, B, A...|
我想删除列 transArr 中数组中索引 0 处的每个元素。 预期结果:
+------+-----------+--------------+
|NewsId| expNews| transArr|
+------+-----------+--------------+
| 1| House|[HH, AW1, S] |
| 1|Republicans|[R, ... |
| 1| Fret|[F, R, EH1, T]|
| 1| About|[AH0, B, A... |
有没有一种简单的方法可以使用 Spark 和 Scala 来做到这一点?
最佳答案
检查下面的代码,它比 slice
函数更快
scala> df.show(false)
+------+-----------+---------------------+
|NewsId|expNews |transArr |
+------+-----------+---------------------+
|1 |House |[house, HH, AW1, S] |
|1 |Republicans|[republicans, R, ...]|
|1 |Fret |[fret, F, R, EH1, T] |
|1 |About |[about, AH0, B, A...]|
+------+-----------+---------------------+
scala> df
.withColumn(
"modified_transArr",
array_except(
$"transArr",
array($"transArr"(0))
)
).show(false)
+------+-----------+---------------------+-----------------+
|NewsId|expNews |transArr |modified_transArr|
+------+-----------+---------------------+-----------------+
|1 |House |[house, HH, AW1, S] |[HH, AW1, S] |
|1 |Republicans|[republicans, R, ...]|[R, ...] |
|1 |Fret |[fret, F, R, EH1, T] |[F, R, EH1, T] |
|1 |About |[about, AH0, B, A...]|[AH0, B, A...] |
+------+-----------+---------------------+-----------------+
关于scala - 如何通过Spark Dataframe中的索引删除数组中的元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64433703/