python - Pandas Dataframe 通过分组连接

标签 python join dataframe pandas

根据以下数据帧,其中同一 STATION 有 2 个唯一的 DBKEY,我需要创建一个具有 2 个单独的 VAL 的新数据帧属于同一 STATION 的列(VAL1VAL2)。

    DBKEY  STATION           DAILY_DATE      VAL   
0   T9947  G377C_C  2011-10-01 00:00:00   17.123   
1   T9947  G377C_C  2011-10-02 00:00:00      NaN  
2   T9947  G377C_C  2011-10-03 00:00:00      NaN  
3   T9947  G377C_C  2011-10-04 00:00:00      NaN  
4   T9947  G377C_C  2011-10-05 00:00:00      NaN   
5   T9947  G377C_C  2011-10-06 00:00:00      NaN   
6   T9947  G377C_C  2011-10-07 00:00:00      NaN     
7   T9947  G377C_C  2011-10-08 00:00:00      NaN     
8   T9947  G377C_C  2011-10-09 00:00:00   92.734   
9   T9947  G377C_C  2011-10-10 00:00:00   48.975   
10  T9947  G377C_C  2011-10-11 00:00:00   17.463   
11  T9947  G377C_C  2011-10-12 00:00:00      NaN  
12  T9947  G377C_C  2011-10-13 00:00:00      NaN   
13  T9947  G377C_C  2011-10-14 00:00:00   12.870   
14  T9947  G377C_C  2011-10-15 00:00:00      NaN    
15  T9947  G377C_C  2011-10-16 00:00:00   48.138   
16  T9947  G377C_C  2011-10-17 00:00:00    0.413   
17  T9947  G377C_C  2011-10-18 00:00:00   39.058  
18  T9947  G377C_C  2011-10-19 00:00:00  235.617  
19  T9947  G377C_C  2011-10-20 00:00:00  182.989  
20  T9947  G377C_C  2011-10-21 00:00:00  132.193  
21  T9947  G377C_C  2011-10-22 00:00:00   19.557   
22  T9947  G377C_C  2011-10-23 00:00:00      NaN   
23  T9947  G377C_C  2011-10-24 00:00:00   80.552  
24  T9947  G377C_C  2011-10-25 00:00:00      NaN   
25  T9947  G377C_C  2011-10-26 00:00:00      NaN   
26  T9947  G377C_C  2011-10-27 00:00:00   39.258   
27  T9947  G377C_C  2011-10-28 00:00:00      NaN    
28  T9947  G377C_C  2011-10-29 00:00:00  253.969  
29  T9947  G377C_C  2011-10-30 00:00:00  319.685  
30  T9947  G377C_C  2011-10-31 00:00:00  303.855  
31  W3972  G377C_C  2011-10-01 00:00:00   17.120   
32  W3972  G377C_C  2011-10-02 00:00:00      NaN    
33  W3972  G377C_C  2011-10-03 00:00:00      NaN   
34  W3972  G377C_C  2011-10-04 00:00:00      NaN    
35  W3972  G377C_C  2011-10-05 00:00:00      NaN    
36  W3972  G377C_C  2011-10-06 00:00:00      NaN    
37  W3972  G377C_C  2011-10-07 00:00:00      NaN    
38  W3972  G377C_C  2011-10-08 00:00:00      NaN    
39  W3972  G377C_C  2011-10-09 00:00:00   92.730  
40  W3972  G377C_C  2011-10-10 00:00:00   48.980  
41  W3972  G377C_C  2011-10-11 00:00:00   17.460   
42  W3972  G377C_C  2011-10-12 00:00:00      NaN    
43  W3972  G377C_C  2011-10-13 00:00:00      NaN    
44  W3972  G377C_C  2011-10-14 00:00:00   12.870   
45  W3972  G377C_C  2011-10-15 00:00:00      NaN    
46  W3972  G377C_C  2011-10-16 00:00:00   48.140   
47  W3972  G377C_C  2011-10-17 00:00:00    0.410   
48  W3972  G377C_C  2011-10-18 00:00:00   39.060   
49  W3972  G377C_C  2011-10-19 00:00:00  235.620  
50  W3972  G377C_C  2011-10-20 00:00:00  182.990  
51  W3972  G377C_C  2011-10-21 00:00:00  132.190  
52  W3972  G377C_C  2011-10-22 00:00:00   19.560   
53  W3972  G377C_C  2011-10-23 00:00:00      NaN  
54  W3972  G377C_C  2011-10-24 00:00:00   80.550   
55  W3972  G377C_C  2011-10-25 00:00:00      NaN   
56  W3972  G377C_C  2011-10-26 00:00:00      NaN    
57  W3972  G377C_C  2011-10-27 00:00:00   39.260   
58  W3972  G377C_C  2011-10-28 00:00:00      NaN    
59  W3972  G377C_C  2011-10-29 00:00:00  253.970  
60  W3972  G377C_C  2011-10-30 00:00:00  319.690  
61  W3972  G377C_C  2011-10-31 00:00:00  303.860  

So, I need the result to only have 31 rows, with STATION and VAL1 (first set of DBKEYs) and VAL2 (second set of DBKEYs).

STATION     DAILY_DATE  VAL1      VAL2
G377C_C     10/1/2011   17.123    17.12
G377C_C     10/2/2011   NaN   NaN
G377C_C     10/3/2011   NaN   NaN
G377C_C     10/4/2011   NaN   NaN
G377C_C     10/5/2011   NaN   NaN
G377C_C     10/6/2011   NaN   NaN
G377C_C     10/7/2011   NaN   NaN
G377C_C     10/8/2011   NaN   NaN
G377C_C     10/9/2011   92.734    92.73
G377C_C     10/10/2011  48.975    48.98
G377C_C     10/11/2011  17.463    17.46
G377C_C     10/12/2011  NaN   NaN
G377C_C     10/13/2011  NaN   NaN
G377C_C     10/14/2011  12.87     12.87
G377C_C     10/15/2011  NaN   NaN
G377C_C     10/16/2011  48.138    48.14
G377C_C     10/17/2011  0.413     0.41
G377C_C     10/18/2011  39.058    39.06
G377C_C     10/19/2011  235.617   235.62
G377C_C     10/20/2011  182.989   182.99
G377C_C     10/21/2011  132.193   132.19
G377C_C     10/22/2011  19.557    19.56
G377C_C     10/23/2011  NaN   NaN
G377C_C     10/24/2011  80.552    80.55
G377C_C     10/25/2011  NaN   NaN
G377C_C     10/26/2011  NaN   NaN
G377C_C     10/27/2011  39.258    39.26
G377C_C     10/28/2011  NaN   NaN
G377C_C     10/29/2011  253.969   253.97
G377C_C     10/30/2011  319.685   319.69
G377C_C     10/31/2011  303.855   303.86

最佳答案

如果我理解正确的话,这看起来非常简单。 unstack() 应该处理它:

In [2]: df = DataFrame({"DBKEY":['T9947', 'T9947', 'T9947', 'W3972','W3972','W3972'],"STATION":['G377C_C','G377C_C','G377C_C','G377C_C','G377C_C','G377C_C'],"DAILY_DATE":['2011-10-01 00:00:00','2011-10-02 00:00:00','2011-10-03 00:00:00','2011-10-01 00:00:00','2011-10-02 00:00:00','2011-10-03 00:00:00'],"VAL":[ 17.123, 'NaN', 'NaN', '17.120', 'NaN', 'NaN']})
In [3]: df
Out[3]:
            DAILY_DATE  DBKEY  STATION     VAL
0  2011-10-01 00:00:00  T9947  G377C_C  17.123
1  2011-10-02 00:00:00  T9947  G377C_C     NaN
2  2011-10-03 00:00:00  T9947  G377C_C     NaN
3  2011-10-01 00:00:00  W3972  G377C_C  17.120
4  2011-10-02 00:00:00  W3972  G377C_C     NaN
5  2011-10-03 00:00:00  W3972  G377C_C     NaN

In [4]: df2 = df.set_index(["STATION", "DBKEY", "DAILY_DATE"])
In [5]: df2
Out[5]:
                                      VAL
STATION DBKEY DAILY_DATE                 
G377C_C T9947 2011-10-01 00:00:00  17.123
              2011-10-02 00:00:00     NaN
              2011-10-03 00:00:00     NaN
        W3972 2011-10-01 00:00:00  17.120
              2011-10-02 00:00:00     NaN
              2011-10-03 00:00:00     NaN

In [6]: df3 = df2.unstack(level=1)
In [7]: df3
Out[7]: 
                                VAL        
DBKEY                         T9947   W3972
STATION DAILY_DATE                         
G377C_C 2011-10-01 00:00:00  17.123  17.120
        2011-10-02 00:00:00     NaN     NaN
        2011-10-03 00:00:00     NaN     NaN

关于python - Pandas Dataframe 通过分组连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16241402/

相关文章:

python - 删除 pandas 列中的前导零,但仅适用于数字

python - df.columns 返回 RangeIndex

python - 将 numpy.array 传递给 ctypes 但得到错误结果的问题

python - Pandas 相当于 'if' 'else' 条件将计算列添加到 df

mysql - 从两个不同的表中查询同一列

mysql - 选择一个表中的所有项目,并检查另一个表中是否有对应的项目

python - 如何显示电话簿的详细 View ?

python - SGDClassifier 在 MNIST 上的使用

mysql - 如何在多个表层次结构级别上执行 mysql join 语句?

python - 如何对 pandas 中的字符串中的数字进行排序?