我的工作环境主要使用PySpark,但是通过Google一下,在PySpark中转置非常复杂。我想将其保留在 PySpark 中,但如果在 Pandas 中更容易做到这一点,我会将 Spark 数据帧转换为 Pandas 数据帧。我认为数据集不是很大,性能是一个问题。
我想将具有多列的数据框转换为行:
输入:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
Record Hospital Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4
1 Red Cross 1234 Street 429 Effective Effective Normal Effective
2 Alberta Hospital 553 Alberta Road 441 Effecive Normal Normal Effective
3 General Hospital 994 Random Street 923 Normal Effective Normal Effective
输出:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
查看 PySpark 示例后,发现很复杂:PySpark Dataframe melt columns into rows
看看 Pandas 的例子,它看起来容易多了。但 Stack Overflow 上有很多不同的答案,其中一些说法是使用pivot、melt、stack、unstack,还有更多,最终会让人感到困惑。
因此,如果有人有一种简单的方法可以在 PySpark 中做到这一点,我会洗耳恭听。如果没有,我会很乐意接受 Pandas 的答案。
非常感谢您的帮助!
最佳答案
您还可以使用.melt
并指定id_vars
。其他一切都将被考虑 value_vars
。您拥有的 value_vars
列数会将数据帧中的行数乘以该数字,将四列中的所有列信息堆叠到一列中,并将复制 id_var
code> 列转换为您想要的格式:
数据框设置:
import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
'Hospital Address': {0: '1234 Street 429',
1: '553 Alberta Road 441',
2: '994 Random Street 923'},
'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})
代码:
df = (df.melt(id_vars=['Record','Hospital', 'Hospital Address'],
var_name='Name',
value_name='Value')
.sort_values('Record')
.reset_index(drop=True))
df['Record'] = df.index+1
df
Out[1]:
Record Hospital Hospital Address Name Value
0 1 Red Cross 1234 Street 429 Medicine_1 Effective
1 2 Red Cross 1234 Street 429 Medicine_2 Effective
2 3 Red Cross 1234 Street 429 Medicine_3 Normal
3 4 Red Cross 1234 Street 429 Medicine_4 Effective
4 5 Alberta Hospital 553 Alberta Road 441 Medicine_1 Effecive
5 6 Alberta Hospital 553 Alberta Road 441 Medicine_2 Normal
6 7 Alberta Hospital 553 Alberta Road 441 Medicine_3 Normal
7 8 Alberta Hospital 553 Alberta Road 441 Medicine_4 Effective
8 9 General Hospital 994 Random Street 923 Medicine_1 Normal
9 10 General Hospital 994 Random Street 923 Medicine_2 Effective
10 11 General Hospital 994 Random Street 923 Medicine_3 Normal
11 12 General Hospital 994 Random Street 923 Medicine_4 Effective
关于pandas - 堆叠、拆散、融合、旋转、转置?将多列转换为行的简单方法是什么(PySpark 或 Pandas)?),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64179626/