我有一个像这样的数据集:
+----------+------------+
|id |event |
+----------+------------+
| 1 |A |
| 2 |B |
| 3 |C |
| 4 |C |
| 5 |A |
| 6 |D |
| 7 |B |
+----------+------------+
我想修改 id 或添加另一列,其中“事件”列中的所有相等值都具有相同的 id。我希望这些行保持与现在相同的顺序。
这就是我希望数据最终呈现的方式(“id”的值并不重要,只要它对于每个事件都是唯一的)
+----------+------------+
|id |event |
+----------+------------+
| 1 |A |
| 2 |B |
| 3 |C |
| 3 |C |
| 1 |A |
| 4 |D |
| 2 |B |
+----------+------------+
最佳答案
更新
添加monotonically_increasing_id()设置 id 后查看原始输入中的数据:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
output_df = (input_df
.withColumn('order', f.monotonically_increasing_id())
.withColumn('id', f.first('id').over(Window.partitionBy('event'))))
output_df.sort('order').show()
+---+-----+-----------+
| id|event| order|
+---+-----+-----------+
| 1| A| 8589934592|
| 2| B|17179869184|
| 3| C|25769803776|
| 3| C|34359738368|
| 1| A|42949672960|
| 6| D|51539607552|
| 2| B|60129542144|
+---+-----+-----------+
旧
要“保留”数据帧顺序,请创建另一列并保持 id
完整,以便在需要时进行排序:
from pyspark.sql import Window
import pyspark.sql.functions as f
input_df = spark.createDataFrame([
[1, 'A'],
[2, 'B'],
[3, 'C'],
[4, 'C'],
[5, 'A'],
[6, 'D'],
[7, 'B']
], ['id', 'event'])
output_df = input_df.withColumn('group_id', f.first('id').over(Window.partitionBy('event')))
output_df.sort('id').show()
+---+-----+--------+
| id|event|group_id|
+---+-----+--------+
| 1| A| 1|
| 2| B| 2|
| 3| C| 3|
| 4| C| 3|
| 5| A| 1|
| 6| D| 6|
| 7| B| 2|
+---+-----+--------+
关于dataframe - Pyspark:如何为另一列中具有相同值的所有行设置相同的id?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69028786/