python - Pandas:如何仅选择组内标准差较小的组?

标签 python pandas

我有一个数据框,每个组 ID 有 +- 100 行。我想根据组 ID 进行分组,然后仅保留列的标准差低于阈值的组。我使用以下代码

# df is the dataframe with all rows
# group on groupID
df_grouped = df.groupby('groupID')

# this gives a table with groupID and the std within a group 
df_grouped_std = df_grouped.std() 

# from the df with standard deviations, I select only the groups 
# where the standard deviation is withing limits
selection = df_grouped_std[df_grouped_std['col1']<1][df_grouped_std['col2']<0.05]

# now I try to select from the original dataframe 'df_grouped' the groups that were selected in the previous step.
df_plot = df_grouped[selection]

堆栈跟踪:

   Traceback (most recent call last):

  File "<ipython-input-72-2cd045ecb262>", line 1, in <module>
    runfile('C:/Documents and Settings/a708818/Desktop/coloredByRol.py', wdir='C:/Documents and Settings/a708818/Desktop')

  File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile
    execfile(filename, namespace)

  File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Documents and Settings/a708818/Desktop/coloredByRol.py", line 50, in <module>
    df_plot = df_grouped[selection]

  File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 3170, in __getitem__
    if key not in self.obj:

  File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 688, in __contains__
    return key in self._info_axis

  File "C:\Anaconda\lib\site-packages\pandas\core\index.py", line 885, in __contains__
    hash(key)

  File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 647, in __hash__
    ' hashed'.format(self.__class__.__name__))

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashedus they cannot be hashed

我不知道如何选择我想要的数据。有什么提示吗?

最佳答案

我认为你可以使用:

df_grouped = df.groupby('groupID')
#get std per groups
df_grouped_std = df_grouped.std() 
print (df_grouped_std)
#select by conditions 
selection = df_grouped_std[ (df_grouped_std['col1']<1) & (df_grouped_std['col2']<0.05)]
print (selection)

#select all rows of original df where groupID is same as index of 'selection'
df_plot = df[df.groupID.isin(selection.index)]
print (df_plot)

示例:

df = pd.DataFrame({'groupID':[1,1,1,2,3,3,2],
                   'col1':[5,3,6,4,7,8,9],
                   'col2':[7,8,9,1,2,3,8]})

print (df)
   col1  col2  groupID
0     5     7        1
1     3     8        1
2     6     9        1
3     4     1        2
4     7     2        3
5     8     3        3
6     9     8        2
df_grouped = df.groupby('groupID')
# 
df_grouped_std = df_grouped.std() 
print (df_grouped_std)
             col1      col2
groupID                    
1        1.527525  1.000000
2        3.535534  4.949747
3        0.707107  0.707107

#change conditions for testing only 
selection = df_grouped_std[ (df_grouped_std['col1']>1) & (df_grouped_std['col2']>3)]
print (selection)
             col1      col2
groupID                    
2        3.535534  4.949747

#
df_plot = df[df.groupID.isin(selection.index)]
print (df_plot)
   col1  col2  groupID
3     4     1        2
6     9     8        2

编辑:

另一种可能的解决方案是使用 filter :

print (df.groupby('groupID')
         .filter(lambda x: (x.col1.std() > 1) & (x.col2.std() > 3)))

   col1  col2  groupID
3     4     1        2
6     9     8        2

关于python - Pandas:如何仅选择组内标准差较小的组?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40926610/

相关文章:

python - 如何将文件上传表单放入我的 Pyramid 应用程序中?

python - Facebook 消息的未知编码

python - UnicodeEncodeError : 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128)

c# - 使用 pywinauto 自动化非默认 Windows 应用程序/Java 应用程序

python - pandas 将值乘以另一个 DataFrame 中的缩放因子

python - Pandas 多索引数据框中各组之间的计算

python - 如果来自 Python 字典的子类,LRU 缓存不可散列类型

python - 在 pandas 的 to_markdown() 中抑制科学记数法

python - 返回一个 bool 数据框

python - DataFrame 的元组列表。元素列,元组长度列