- 我有一个包含大量值(259 行 × 27 列)的大型数据框(通过 pandas 从 csv 文件导入)。该指数是从 1996 年 1 月到 2017 年 7 月的月份。
我想按年份对每一列进行排序,例如K37L:1996、1997、1998、1999、2000等; K37M:1996、1997、1998、1999、2000等
这是我当前的代码:
#Importing CSV import pandas as pd import numpy as np df = pd.read_csv('file.csv', index_col=0, skipinitialspace=True) #Calling a column K37L = df['K37L'] #Filtering this column by year (from 1996 to 2017) K37L96 = K37L.filter(regex = '1996', axis = 0); npK37L96 = np.array(K37L96) ... ... ... K37L17 = K37L.filter(regex = '2017', axis = 0); npK37L17 = np.array(K37L17)
- 这会产生我想要的结果:K37L filtered by 1996
但是,这是一个乏味的过程,因为我必须输入所有年份和列名才能获得我想要的内容,这将花费很长时间。有没有更快/更优雅的方法来做到这一点?
编辑:这是请求的 df.head() 输出:
K37L K37M K37N K37P K37Q K37R K37S K37T K37U K37V ... \
1996 Jan 78.9 79.4 71.7 36.7 0.0 88.7 94.1 90.7 80.2 98.9 ...
1996 Feb 79.3 81.0 72.7 36.7 0.0 88.7 94.3 90.9 79.8 98.7 ...
1996 Mar 79.8 80.4 72.7 36.7 0.0 89.0 94.6 91.0 79.6 98.6 ...
1996 Apr 80.4 80.7 72.9 36.7 0.0 89.0 94.6 91.3 79.2 97.9 ...
1996 May 80.6 80.7 72.9 36.7 0.0 89.1 94.7 91.9 79.2 96.6 ...
K385 K386 K387 K388 K389 K38A K38B K38C K38D K38E
1996 Jan 70.9 78.7 257.8 83.9 79.7 92.2 73.8 86.4 79.6 74.0
1996 Feb 70.7 78.7 257.2 83.9 79.8 92.6 73.7 86.6 79.9 73.9
1996 Mar 70.9 78.7 257.3 83.9 80.1 92.6 73.8 87.2 80.1 74.0
1996 Apr 70.8 78.9 256.6 83.9 80.4 92.7 73.9 87.9 80.7 74.0
1996 May 70.9 78.9 256.3 83.9 80.5 92.9 73.9 88.0 80.7 74.1
[5 rows x 27 columns]
最佳答案
您可以使用:
np.random.seed(458)
cols = ['K37L', 'K37M', 'K37N', 'K37P', 'K37Q', 'K37R', 'K37S', 'K37T', 'K37U','K37V', 'K37W', 'K37X', 'K37Y', 'K37Z', 'K382', 'K383', 'K384', 'K385', 'K386', 'K387', 'K388', 'K389', 'K38A', 'K38B', 'K38C', 'K38D', 'K38E']
idx = pd.date_range('1996-01-01', periods=259, freq='MS').strftime('%Y %b')
df = pd.DataFrame(np.random.randint(20, size=(259,27)), index=idx, columns=cols)
print (df.head(3))
K37L K37M K37N K37P K37Q K37R K37S K37T K37U K37V ... \
1996 Jan 8 13 18 1 6 2 1 11 13 0 ...
1996 Feb 12 0 14 0 11 0 1 10 3 4 ...
1996 Mar 5 8 8 8 5 5 2 8 1 7 ...
K385 K386 K387 K388 K389 K38A K38B K38C K38D K38E
1996 Jan 18 16 0 11 18 18 11 18 11 17
1996 Feb 9 12 15 7 7 0 17 3 6 7
1996 Mar 13 9 0 9 2 17 13 1 12 9
[3 rows x 27 columns]
通过 to_datetime
创建 Datetimeindex
:
df.index = pd.to_datetime(df.index, format='%Y %b')
print (df.head(3))
K37L K37M K37N K37P K37Q K37R K37S K37T K37U K37V ... \
1996-01-01 8 13 18 1 6 2 1 11 13 0 ...
1996-02-01 12 0 14 0 11 0 1 10 3 4 ...
1996-03-01 5 8 8 8 5 5 2 8 1 7 ...
K385 K386 K387 K388 K389 K38A K38B K38C K38D K38E
1996-01-01 18 16 0 11 18 18 11 18 11 17
1996-02-01 9 12 15 7 7 0 17 3 6 7
1996-03-01 13 9 0 9 2 17 13 1 12 9
[3 rows x 27 columns]
因此,对于按年选择,请使用 partial string indexing对于选择列 []
(相同的语法):
#seelcting rows with year 2000
print (df['2000'])
K37L K37M K37N K37P K37Q K37R K37S K37T K37U K37V ...
2000-01-01 12 15 8 14 2 0 17 0 8 14 ...
2000-02-01 14 10 11 4 18 1 3 12 9 11 ...
2000-03-01 4 5 17 16 13 6 18 6 12 12 ...
2000-04-01 2 15 3 5 6 6 17 3 1 3 ...
2000-05-01 6 14 14 9 4 0 4 10 14 15 ...
#selecting column K37P
print (df['K37P'])
1996-01-01 1
1996-02-01 0
1996-03-01 8
1996-04-01 11
1996-05-01 14
1996-06-01 12
1996-07-01 12
1996-08-01 14
1996-09-01 2
1996-10-01 1
要先选择列再选择年份:
print (df['K37L']['2000'])
2000-01-01 12
2000-02-01 14
2000-03-01 4
2000-04-01 2
2000-05-01 6
2000-06-01 10
2000-07-01 2
2000-08-01 13
2000-09-01 18
2000-10-01 4
2000-11-01 12
2000-12-01 11
Name: K37L, dtype: int32
对于 numpy 数组使用:
print (df['K37L']['2000'].values)
[12 14 4 2 6 10 2 13 18 4 12 11]
如果需要按年份排列的字典:
然后通过 partial string indexing 选择 year
s最后通过 values
转换为数组到字典
:
d = {x: df[str(x)].values for x in range(1996, 2018)}
print (d[2000])
[[12 15 8 14 2 0 17 0 8 14 17 15 2 3 14 17 19 2 8 7 5 7 12 13
17 7 4]
[14 10 11 4 18 1 3 12 9 11 8 3 12 19 19 15 7 19 14 12 5 19 14 15
7 11 7]
[ 4 5 17 16 13 6 18 6 12 12 7 15 3 16 2 18 14 18 15 8 5 9 3 7
关于python - 从具有大量值的数据框中快速创建数组?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46217242/