python - 如何从 3 类数据帧的前 2 类中删除 1 行？

我有一个包含 3 个类(1、2、3)的数据框。每类有4个样本。但我希望 1 类和 2 类只有 3 个类。所以我需要从每行中删除 1 行。可以是任何一个。

通过我的尝试，我只能删除第一类的第一行。我该如何改进？

#The dataframe
df = pd.DataFrame(np.random.rand(12,5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

df['label'] = label


# My attempt
df1=df.drop(id==1)
df1

编辑或者可以从类别 1 和类别 2 中抽取 3 行；和来自class3的4个样本。在这种情况下，我的代码尝试:

df1 = pd.concat(g.sample(3) for idx, g in df.groupby('label'))

但是...它从所有类中采样了 3 行!

原始数据框

我需要什么

最佳答案

更好、更简单的解决方案是使用 if ... else 语句在列表理解中进行过滤:

df1 = pd.concat(g.sample(3) if g.label.isin([1,2]).all() else g 
                for idx, g in df.groupby('label') )
print (df1)
           0         1         2         3         4  label
3   0.978624  0.811683  0.171941  0.816225  0.274074      1
1   0.121569  0.670749  0.825853  0.136707  0.575093      1
0   0.543405  0.278369  0.424518  0.844776  0.004719      1
4   0.431704  0.940030  0.817649  0.336112  0.175410      2
7   0.890412  0.980921  0.059942  0.890546  0.576901      2
5   0.372832  0.005689  0.252426  0.795663  0.015255      2
8   0.742480  0.630184  0.581842  0.020439  0.210027      3
9   0.544685  0.769115  0.250695  0.285896  0.852395      3
10  0.975006  0.884853  0.359508  0.598859  0.354796      3
11  0.340190  0.178081  0.237694  0.044862  0.505431      3

<小时/>

另一个解决方案是通过groupby和cumcount创建掩码通过 reindex 使用 isin 过滤 DataFrame 并添加 True 值.

上次使用boolean indexing :

np.random.seed(100)
df = pd.DataFrame(np.random.rand(12,5))
label=np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

df['label'] = label
#print (df)

N = 3
vals = [1,2]
s = df.loc[df.label.isin(vals), 'label']
mask = s.groupby(s).cumcount() < N
mask = mask.reindex(df.index, fill_value=True)
print (mask)
0      True
1      True
2      True
3     False
4      True
5      True
6      True
7     False
8      True
9      True
10     True
11     True
dtype: bool

print (df[mask])
           0         1         2         3         4  label
0   0.543405  0.278369  0.424518  0.844776  0.004719      1
1   0.121569  0.670749  0.825853  0.136707  0.575093      1
2   0.891322  0.209202  0.185328  0.108377  0.219697      1
4   0.431704  0.940030  0.817649  0.336112  0.175410      2
5   0.372832  0.005689  0.252426  0.795663  0.015255      2
6   0.598843  0.603805  0.105148  0.381943  0.036476      2
8   0.742480  0.630184  0.581842  0.020439  0.210027      3
9   0.544685  0.769115  0.250695  0.285896  0.852395      3
10  0.975006  0.884853  0.359508  0.598859  0.354796      3
11  0.340190  0.178081  0.237694  0.044862  0.505431      3

更好地解释掩码:

#select values of label where need remove some rows to count = N
s = df.loc[df.label.isin(vals), 'label']
print (s)
0    1
1    1
2    1
3    1
4    2
5    2
6    2
7    2
Name: label, dtype: int32

#groupby in filtered df, so length of df is different as original
mask = s.groupby(s).cumcount() < N
print (mask)
0     True
1     True
2     True
3    False
4     True
5     True
6     True
7    False
dtype: bool

#added missing rows be reindex - NaN are replaced by True
mask = mask.reindex(df.index, fill_value=True)
print (mask)
0      True
1      True
2      True
3     False
4      True
5      True
6      True
7     False
8      True
9      True
10     True
11     True
dtype: bool

关于python - 如何从 3 类数据帧的前 2 类中删除 1 行？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42193354/

python - 如何从 3 类数据帧的前 2 类中删除 1 行？

上一篇：python - 将属性复制到变量，属性会发生更改

下一篇：python - 找到特定的单词并在 python 中读取该单词之后的内容