pandas - 堆叠、拆散、融合、旋转、转置?将多列转换为行的简单方法是什么(PySpark 或 Pandas)?)

标签 pandas pyspark pivot transform melt

我的工作环境主要使用PySpark,但是通过Google一下,在PySpark中转置非常复杂。我想将其保留在 PySpark 中,但如果在 Pandas 中更容易做到这一点,我会将 Spark 数据帧转换为 Pandas 数据帧。我认为数据集不是很大,性能是一个问题。

我想将具有多列的数据框转换为行:

输入:

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

Record          Hospital       Hospital Address Medicine_1 Medicine_2 Medicine_3 Medicine_4  
     1         Red Cross        1234 Street 429  Effective  Effective     Normal  Effective    
     2  Alberta Hospital   553 Alberta Road 441   Effecive     Normal     Normal  Effective
     3  General Hospital  994 Random Street 923     Normal  Effective     Normal  Effective

输出:

    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

查看 PySpark 示例后,发现很复杂:PySpark Dataframe melt columns into rows

看看 Pandas 的例子,它看起来容易多了。但 Stack Overflow 上有很多不同的答案,其中一些说法是使用pivot、melt、stack、unstack,还有更多,最终会让人感到困惑。

因此,如果有人有一种简单的方法可以在 PySpark 中做到这一点,我会洗耳恭听。如果没有,我会很乐意接受 Pandas 的答案。

非常感谢您的帮助!

最佳答案

您还可以使用.melt并指定id_vars。其他一切都将被考虑 value_vars。您拥有的 value_vars 列数会将数据帧中的行数乘以该数字,将四列中的所有列信息堆叠到一列中,并将复制 id_var code> 列转换为您想要的格式:

数据框设置:

import pandas as pd
df = pd.DataFrame({'Record': {0: 1, 1: 2, 2: 3},
 'Hospital': {0: 'Red Cross', 1: 'Alberta Hospital', 2: 'General Hospital'},
 'Hospital Address': {0: '1234 Street 429',
  1: '553 Alberta Road 441',
  2: '994 Random Street 923'},
 'Medicine_1': {0: 'Effective', 1: 'Effecive', 2: 'Normal'},
 'Medicine_2': {0: 'Effective', 1: 'Normal', 2: 'Effective'},
 'Medicine_3': {0: 'Normal', 1: 'Normal', 2: 'Normal'},
 'Medicine_4': {0: 'Effective', 1: 'Effective', 2: 'Effective'}})

代码:

df = (df.melt(id_vars=['Record','Hospital', 'Hospital Address'],
              var_name='Name',
              value_name='Value')
     .sort_values('Record')
     .reset_index(drop=True))
df['Record'] = df.index+1
df
Out[1]: 
    Record          Hospital       Hospital Address        Name      Value
0        1         Red Cross        1234 Street 429  Medicine_1  Effective
1        2         Red Cross        1234 Street 429  Medicine_2  Effective
2        3         Red Cross        1234 Street 429  Medicine_3     Normal
3        4         Red Cross        1234 Street 429  Medicine_4  Effective
4        5  Alberta Hospital   553 Alberta Road 441  Medicine_1   Effecive
5        6  Alberta Hospital   553 Alberta Road 441  Medicine_2     Normal
6        7  Alberta Hospital   553 Alberta Road 441  Medicine_3     Normal
7        8  Alberta Hospital   553 Alberta Road 441  Medicine_4  Effective
8        9  General Hospital  994 Random Street 923  Medicine_1     Normal
9       10  General Hospital  994 Random Street 923  Medicine_2  Effective
10      11  General Hospital  994 Random Street 923  Medicine_3     Normal
11      12  General Hospital  994 Random Street 923  Medicine_4  Effective

关于pandas - 堆叠、拆散、融合、旋转、转置?将多列转换为行的简单方法是什么(PySpark 或 Pandas)?),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64179626/

相关文章:

python - 如何让 pandas 停止跳过 TSV 文件中的第一个空白列?

python - 将大型数据帧(pandas)分割成 block (但在分组之后)

python - 在 Jupyter Notebook 中将 PySpark Dataframe 显示为 HTML 表格

python - 将 Jar 添加到独立的 pyspark

apache-spark - 如何将参数动态传递给 Apache Spark 中的过滤函数?

python - 如何快速搜索pandas中的重复值?

python - 使用循环和追加读取多个 Excel 文件

jasper-reports - 尝试在 jasper studio 中添加数据透视表时,如何避免错误递增交叉表数据集?

sql-server - SQL 服务器 2008 R2 : Dynamic query for pivot table with where and having clause

python - 在用户级别分组并对分类数据进行编码