python - 如何在 Pandas 中将 bool 索引器与多索引结合起来?

标签 python pandas dataframe multi-index

我有一个多索引数据框,我希望根据索引值和 bool 标准提取一个子集。我希望使用多索引键和 bool 索引器覆盖特定新值的值来选择要修改的记录。

import pandas as pd 
import numpy as np

years        = [1994,1995,1996]
householdIDs = [ id for id in range(1,100) ]

midx = pd.MultiIndex.from_product( [years, householdIDs], names = ['Year', 'HouseholdID'] )

householdIncomes = np.random.randint( 10000,100000, size = len(years)*len(householdIDs) )
householdSize    = np.random.randint( 1,5, size = len(years)*len(householdIDs) )
df = pd.DataFrame( {'HouseholdIncome':householdIncomes, 'HouseholdSize':householdSize}, index = midx ) 
df.sort_index(inplace = True)

示例数据如下所示...

  df.head()
=>                   HouseholdIncome  HouseholdSize
Year HouseholdID                                
1994 1                      23866              3
     2                      57956              3
     3                      21644              3
     4                      71912              4
     5                      83663              3

我能够使用索引和列标签成功查询数据框。

此示例为我提供了 1996 年家庭 3 的 HouseholdSize

   df.loc[  (1996,3 ) , 'HouseholdSize' ]
=> 1

但是,我无法将 bool 选择与多索引查询相结合...

pandas docs on Multi-indexing说有一种方法可以将 bool 索引与多索引结合起来,并给出了一个例子......

In [52]: idx = pd.IndexSlice
In [56]: mask = dfmi[('a','foo')]>200

In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

...我似乎无法在我的数据框上复制它

    idx = pd.IndexSlice
    housholdSizeAbove2 = ( df.HouseholdSize > 2 )
    df.loc[ idx[ housholdSizeAbove2, 1996, :] , 'HouseholdSize' ] 
Traceback (most recent call last):
  File "python", line 1, in <module>
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (3), lexsort depth (2)'

在此示例中,我想查看 1996 年家庭规模大于 2 的所有家庭

最佳答案

Pandas.query()在这种情况下应该有效:

df.query("Year == 1996 and HouseholdID > 2")

演示:

In [326]: with pd.option_context('display.max_rows',20):
     ...:     print(df.query("Year == 1996 and HouseholdID > 2"))
     ...:
                  HouseholdIncome  HouseholdSize
Year HouseholdID
1996 3                      28664              4
     4                      11057              1
     5                      36321              2
     6                      89469              4
     7                      35711              2
     8                      85741              1
     9                      34758              3
     10                     56085              2
     11                     32275              4
     12                     77096              4
...                           ...            ...
     90                     40276              4
     91                     10594              2
     92                     61080              4
     93                     65334              2
     94                     21477              4
     95                     83112              4
     96                     25627              2
     97                     24830              4
     98                     85693              1
     99                     84653              4

[97 rows x 2 columns]

更新:

Is there a way to select a specific column?

In [333]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdIncome']
Out[333]:
Year  HouseholdID
1996  3              28664
      4              11057
      5              36321
      6              89469
      7              35711
      8              85741
      9              34758
      10             56085
      11             32275
      12             77096
                     ...
      90             40276
      91             10594
      92             61080
      93             65334
      94             21477
      95             83112
      96             25627
      97             24830
      98             85693
      99             84653
Name: HouseholdIncome, dtype: int32

and ultimately I want to overwrite the data on the dataframe.

In [331]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdSize'] *= 10

In [332]: df.loc[df.eval("Year == 1996 and HouseholdID > 2")]
Out[332]:
                  HouseholdIncome  HouseholdSize
Year HouseholdID
1996 3                      28664             40
     4                      11057             10
     5                      36321             20
     6                      89469             40
     7                      35711             20
     8                      85741             10
     9                      34758             30
     10                     56085             20
     11                     32275             40
     12                     77096             40
...                           ...            ...
     90                     40276             40
     91                     10594             20
     92                     61080             40
     93                     65334             20
     94                     21477             40
     95                     83112             40
     96                     25627             20
     97                     24830             40
     98                     85693             10
     99                     84653             40

[97 rows x 2 columns]

更新 2:

I want to pass a variable year instead of a specific value. Is there a cleaner way to do it than Year == " + str(year) + " and HouseholdID > " + str(householdSize) ?

In [5]: year = 1996

In [6]: household_ids = [1, 2, 98, 99]

In [7]: df.loc[df.eval("Year == @year and HouseholdID in @household_ids")]
Out[7]:
                  HouseholdIncome  HouseholdSize
Year HouseholdID
1996 1                      42217              1
     2                      66009              3
     98                     33121              4
     99                     45489              3

关于python - 如何在 Pandas 中将 bool 索引器与多索引结合起来?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42193544/

相关文章:

python - 在 python 中从 boto 模拟模块

python - Windows、Python : time. sleep() 权限错误

python - 如何使用列表作为 Pandas 数据框中的值?

python - 如何将 pandas df 转换为带有子组的字典

python - 在 Python Pandas 中搜索数据框中的项目并将列转置为行

python - Alembic:alembic 修订版显示导入错误

python - 如何将 pandas 数据框日期时间列转换为 int?

python - 使用 Pandas 读取数据并将其设置为 DataFrame 的索引

r - 将包含分隔字符串的数据框列拆分为多个列,并保留拆分字符串的特定部分

python - 运行 nose --with-coverage 获取所有包文件,但不获取其他依赖项和库