python - 从具有大量值的数据框中快速创建数组？

我有一个包含大量值(259 行 × 27 列)的大型数据框(通过 pandas 从 csv 文件导入)。该指数是从 1996 年 1 月到 2017 年 7 月的月份。

我想按年份对每一列进行排序，例如K37L:1996、1997、1998、1999、2000等； K37M:1996、1997、1998、1999、2000等
这是我当前的代码:

#Importing CSV
import pandas as pd
import numpy as np
df = pd.read_csv('file.csv', index_col=0, skipinitialspace=True)

#Calling a column
K37L = df['K37L']

#Filtering this column by year (from 1996 to 2017)
K37L96 = K37L.filter(regex = '1996', axis = 0); npK37L96 = np.array(K37L96)
...
...
...
K37L17 = K37L.filter(regex = '2017', axis = 0); npK37L17 = np.array(K37L17)

这会产生我想要的结果:K37L filtered by 1996

但是，这是一个乏味的过程，因为我必须输入所有年份和列名才能获得我想要的内容，这将花费很长时间。有没有更快/更优雅的方法来做到这一点？

编辑:这是请求的 df.head() 输出:

          K37L  K37M  K37N  K37P  K37Q  K37R  K37S  K37T  K37U  K37V  ...   \
1996 Jan  78.9  79.4  71.7  36.7   0.0  88.7  94.1  90.7  80.2  98.9  ...    
1996 Feb  79.3  81.0  72.7  36.7   0.0  88.7  94.3  90.9  79.8  98.7  ...    
1996 Mar  79.8  80.4  72.7  36.7   0.0  89.0  94.6  91.0  79.6  98.6  ...    
1996 Apr  80.4  80.7  72.9  36.7   0.0  89.0  94.6  91.3  79.2  97.9  ...    
1996 May  80.6  80.7  72.9  36.7   0.0  89.1  94.7  91.9  79.2  96.6  ...    

          K385  K386   K387  K388  K389  K38A  K38B  K38C  K38D  K38E  
1996 Jan  70.9  78.7  257.8  83.9  79.7  92.2  73.8  86.4  79.6  74.0  
1996 Feb  70.7  78.7  257.2  83.9  79.8  92.6  73.7  86.6  79.9  73.9  
1996 Mar  70.9  78.7  257.3  83.9  80.1  92.6  73.8  87.2  80.1  74.0  
1996 Apr  70.8  78.9  256.6  83.9  80.4  92.7  73.9  87.9  80.7  74.0  
1996 May  70.9  78.9  256.3  83.9  80.5  92.9  73.9  88.0  80.7  74.1  

[5 rows x 27 columns]

最佳答案

您可以使用:

np.random.seed(458)
cols = ['K37L', 'K37M', 'K37N', 'K37P', 'K37Q', 'K37R', 'K37S', 'K37T', 'K37U','K37V', 'K37W', 'K37X', 'K37Y', 'K37Z', 'K382', 'K383', 'K384', 'K385', 'K386', 'K387', 'K388', 'K389', 'K38A', 'K38B', 'K38C', 'K38D', 'K38E']
idx = pd.date_range('1996-01-01', periods=259, freq='MS').strftime('%Y %b')
df = pd.DataFrame(np.random.randint(20, size=(259,27)), index=idx, columns=cols)
print (df.head(3))
          K37L  K37M  K37N  K37P  K37Q  K37R  K37S  K37T  K37U  K37V  ...   \
1996 Jan     8    13    18     1     6     2     1    11    13     0  ...    
1996 Feb    12     0    14     0    11     0     1    10     3     4  ...    
1996 Mar     5     8     8     8     5     5     2     8     1     7  ...    

          K385  K386  K387  K388  K389  K38A  K38B  K38C  K38D  K38E  
1996 Jan    18    16     0    11    18    18    11    18    11    17  
1996 Feb     9    12    15     7     7     0    17     3     6     7  
1996 Mar    13     9     0     9     2    17    13     1    12     9  

[3 rows x 27 columns]

通过 to_datetime 创建 Datetimeindex :

df.index = pd.to_datetime(df.index, format='%Y %b')
print (df.head(3))
            K37L  K37M  K37N  K37P  K37Q  K37R  K37S  K37T  K37U  K37V  ...   \
1996-01-01     8    13    18     1     6     2     1    11    13     0  ...    
1996-02-01    12     0    14     0    11     0     1    10     3     4  ...    
1996-03-01     5     8     8     8     5     5     2     8     1     7  ...    

            K385  K386  K387  K388  K389  K38A  K38B  K38C  K38D  K38E  
1996-01-01    18    16     0    11    18    18    11    18    11    17  
1996-02-01     9    12    15     7     7     0    17     3     6     7  
1996-03-01    13     9     0     9     2    17    13     1    12     9  

[3 rows x 27 columns]

因此，对于按年选择，请使用 partial string indexing对于选择列 [](相同的语法):

#seelcting rows with year 2000
print (df['2000'])
            K37L  K37M  K37N  K37P  K37Q  K37R  K37S  K37T  K37U  K37V  ...   
2000-01-01    12    15     8    14     2     0    17     0     8    14  ...    
2000-02-01    14    10    11     4    18     1     3    12     9    11  ...    
2000-03-01     4     5    17    16    13     6    18     6    12    12  ...    
2000-04-01     2    15     3     5     6     6    17     3     1     3  ...    
2000-05-01     6    14    14     9     4     0     4    10    14    15  ...    


#selecting column K37P
print (df['K37P'])
1996-01-01     1
1996-02-01     0
1996-03-01     8
1996-04-01    11
1996-05-01    14
1996-06-01    12
1996-07-01    12
1996-08-01    14
1996-09-01     2
1996-10-01     1

要先选择列再选择年份:

print (df['K37L']['2000'])
2000-01-01    12
2000-02-01    14
2000-03-01     4
2000-04-01     2
2000-05-01     6
2000-06-01    10
2000-07-01     2
2000-08-01    13
2000-09-01    18
2000-10-01     4
2000-11-01    12
2000-12-01    11
Name: K37L, dtype: int32

对于 numpy 数组使用:

print (df['K37L']['2000'].values)
[12 14  4  2  6 10  2 13 18  4 12 11]

如果需要按年份排列的字典:

然后通过 partial string indexing 选择 years最后通过 values 转换为数组到字典:

d = {x: df[str(x)].values for x in range(1996, 2018)}

print (d[2000])
[[12 15  8 14  2  0 17  0  8 14 17 15  2  3 14 17 19  2  8  7  5  7 12 13
  17  7  4]
 [14 10 11  4 18  1  3 12  9 11  8  3 12 19 19 15  7 19 14 12  5 19 14 15
   7 11  7]
 [ 4  5 17 16 13  6 18  6 12 12  7 15  3 16  2 18 14 18 15  8  5  9  3  7

关于python - 从具有大量值的数据框中快速创建数组？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46217242/

python - 从具有大量值的数据框中快速创建数组？

上一篇：python - 图形工具 : access vertex/edge property faster

下一篇：python - 如何在多元线性回归模型中找到学生化残差和 PRESS 残差