python - 如何按 Python (pandas) 中列中的出现次数对 Dataframe 进行排序

标签 python sorting pandas dataframe

我正在尝试使用 python 中的 pandas 从我的数据(化学物质和蛋白质之间的分数)创建一个数据框。

我希望我的数据框首先显示出现次数最多的蛋白质,因此我之前对数据进行了排序。但是当我制作数据框时,它没有得到预期的结果。

这是我的数据示例:

chemicals   prots   scores
CID000000006    10116.ENSRNOP00000003921    196
CID000000051    10116.ENSRNOP00000003921    246
CID000000085    10116.ENSRNOP00000003921    196
CID000000119    10116.ENSRNOP00000003921    247
CID000000134    10116.ENSRNOP00000008952    159
CID000000135    10116.ENSRNOP00000008952    157
CID000000174    10116.ENSRNOP00000008952    439
CID000000175    10116.ENSRNOP00000001021    858
CID000000177    10116.ENSRNOP00000004027    760

如您所见,“10116.ENSRNOP00000003921”是我数据中出现次数最多的蛋白质。

所以我想得到类似的东西:

             10116.ENSRNOP00000003921     10116.ENSRNOP00000008952  
CID000000006   196                 
CID000000051   246 
CID000000085   196 
CID000000119   247 
CID000000134                                  159   
CID000000135                                  157   
CID000000174                                  439

这是我的代码:

import pandas as pd

df_rat= pd.read_csv("dt_matrix_rat.csv",sep="\t", header=True)
df_rat.columns = ['chemicals','proteins','scores']
df_rat1 = df_rat.pivot(index='chemicals', columns='proteins', values='scores')

df_rat1.to_csv("rat_matrix.csv", sep='\t', index=True  )

最佳答案

我想你需要sort_valuesnotnullsum并获取 cols 的索引。延迟使用 subset:

df1 = df.pivot(index='chemicals', columns='proteins', values='scores')

cols = df1.notnull().sum(axis=0).sort_values(ascending=False).index
print cols
Index([u'10116.ENSRNOP00000003921', u'10116.ENSRNOP00000008952',
       u'10116.ENSRNOP00000004027', u'10116.ENSRNOP00000001021'],
      dtype='object', name=u'proteins')

print df1[cols]
proteins      10116.ENSRNOP00000003921  10116.ENSRNOP00000008952  \
chemicals                                                          
CID000000006                     196.0                       NaN   
CID000000051                     246.0                       NaN   
CID000000085                     196.0                       NaN   
CID000000119                     247.0                       NaN   
CID000000134                       NaN                     159.0   
CID000000135                       NaN                     157.0   
CID000000174                       NaN                     439.0   
CID000000175                       NaN                       NaN   
CID000000177                       NaN                       NaN   

proteins      10116.ENSRNOP00000004027  10116.ENSRNOP00000001021  
chemicals                                                         
CID000000006                       NaN                       NaN  
CID000000051                       NaN                       NaN  
CID000000085                       NaN                       NaN  
CID000000119                       NaN                       NaN  
CID000000134                       NaN                       NaN  
CID000000135                       NaN                       NaN  
CID000000174                       NaN                       NaN  
CID000000175                       NaN                     858.0  
CID000000177                     760.0                       NaN  

reindex_axis :

print df1.reindex_axis(cols, axis=1)
proteins      10116.ENSRNOP00000003921  10116.ENSRNOP00000008952  \
chemicals                                                          
CID000000006                     196.0                       NaN   
CID000000051                     246.0                       NaN   
CID000000085                     196.0                       NaN   
CID000000119                     247.0                       NaN   
CID000000134                       NaN                     159.0   
CID000000135                       NaN                     157.0   
CID000000174                       NaN                     439.0   
CID000000175                       NaN                       NaN   
CID000000177                       NaN                       NaN   

proteins      10116.ENSRNOP00000004027  10116.ENSRNOP00000001021  
chemicals                                                         
CID000000006                       NaN                       NaN  
CID000000051                       NaN                       NaN  
CID000000085                       NaN                       NaN  
CID000000119                       NaN                       NaN  
CID000000134                       NaN                       NaN  
CID000000135                       NaN                       NaN  
CID000000174                       NaN                       NaN  
CID000000175                       NaN                     858.0  
CID000000177                     760.0                       NaN  

关于python - 如何按 Python (pandas) 中列中的出现次数对 Dataframe 进行排序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36402748/

相关文章:

python - 使用来自不同数据集的组均值填充一个数据集中的缺失值

python - sklearn standardscaler 结果与手动结果不同

python - 升级python包时是否需要停止所有python脚本?

python - 使用 Biopython 库删除 PDB 中的残留物

c++ - 如何在 C++ 中对包含动态数组的结构数组进行排序?

python - 煎饼排序中最短翻转序列的计数

python-3.x - 使用 pandas dataframe 列值来透视其他列

python - Dict of Dict 到 CSV(带有已定义的 header )

python - 无法添加或更新子行: a foreign key constraint fails on a Django generated MySQL table

java - 按属性对自定义对象的 ArrayList 进行排序