python - 如何对数据集子集重复相同的操作

我有这个 pandas 数据框:

data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : random.randn(8), 'D' : random.randn(8)})

Out[84]: 
     A      B         C         D
0  foo    one  0.007861 -0.451943
1  bar    one -1.341386 -0.799740
2  foo    two -0.290606 -0.445757
3  bar  three  0.519251 -0.404406
4  foo    two -0.627547 -0.784901
5  bar    two  0.309421  0.234292
6  foo    one -2.156879  0.898375
7  foo  three -1.669896  0.498978

我所做的是应用这个函数来获取 B 中重复元素的计数

data['Counts'] = data.groupby(['B'])['B'].transform('count')

这给了我:

    Out[87]: 
    A      B         C         D  Counts
0  foo    one  0.007861 -0.451943       3
1  bar    one -1.341386 -0.799740       3
2  foo    two -0.290606 -0.445757       3
3  bar  three  0.519251 -0.404406       2
4  foo    two -0.627547 -0.784901       3
5  bar    two  0.309421  0.234292       3
6  foo    one -2.156879  0.898375       3
7  foo  three -1.669896  0.498978       2

然后我创建了一个新列作为 bool 分类器，其中 1 代表那些至少重复一次的行，0 代表那些不重复的行(在本例中没有 0)

data.ix[data.Counts >= 2,'Repeat'] = 1 
data.ix[data.Counts <= 1,'Repeat'] = 0

Out[89]: 
     A      B         C         D  Counts  Repeat
0  foo    one  0.007861 -0.451943       3       1
1  bar    one -1.341386 -0.799740       3       1
2  foo    two -0.290606 -0.445757       3       1
3  bar  three  0.519251 -0.404406       2       1
4  foo    two -0.627547 -0.784901       3       1
5  bar    two  0.309421  0.234292       3       1
6  foo    one -2.156879  0.898375       3       1
7  foo  three -1.669896  0.498978       2       1

我想要获得的是一个进一步的 Count 列，它计算当 A 中具有相同值时 B 中的元素重复的次数，并据此使用 bool 分类器对它们进行分类。这将是:

Out[89]: 
     A      B         C         D  Counts  Repeat CountsInsideA RepeatInsideA
0  foo    one  0.007861 -0.451943       3       1             2              1
1  bar    one -1.341386 -0.799740       3       1             1              0
2  foo    two -0.290606 -0.445757       3       1             2              1
3  bar  three  0.519251 -0.404406       2       1             1              0
4  foo    two -0.627547 -0.784901       3       1             2              1
5  bar    two  0.309421  0.234292       3       1             1              0
6  foo    one -2.156879  0.898375       3       1             2              1
7  foo  three -1.669896  0.498978       2       1             1              0

最佳答案

检查一下，首先您可以使用 np.where 来重复该列，这并不简洁。 2、为了计算特定AB组合的重复次数，我们可能需要使用gourpby，并将结果与原始DataFrame合并:

In [19]:

data = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 
                     'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 
                     'C' : np.random.randn(8), 'D' : np.random.randn(8)})
In [20]:

data['Counts'] = data.groupby(['B'])['B'].transform('count')
print data
     A      B         C         D  Counts
0  foo    one -0.973299 -0.248367       3
1  bar    one  0.518526  0.987810       3
2  foo    two -0.031224  0.340774       3
3  bar  three -0.146824 -0.751124       2
4  foo    two -0.748681 -0.128536       3
5  bar    two  0.744051  0.604505       3
6  foo    one -0.513386  1.262674       3
7  foo  three  0.044814  0.810772       2
In [21]:

data['Repeat'] = np.where(data.Counts>1, 1, 0)
print data
     A      B         C         D  Counts  Repeat
0  foo    one -0.973299 -0.248367       3       1
1  bar    one  0.518526  0.987810       3       1
2  foo    two -0.031224  0.340774       3       1
3  bar  three -0.146824 -0.751124       2       1
4  foo    two -0.748681 -0.128536       3       1
5  bar    two  0.744051  0.604505       3       1
6  foo    one -0.513386  1.262674       3       1
7  foo  three  0.044814  0.810772       2       1
In [23]:

data = pd.merge(left=data,
                right=pd.DataFrame(data.groupby(['A','B']).size(), 
                                   columns=['CountsInsideA']).reset_index(),
                on=['A', 'B'],
                how='left')
print data
     A      B         C         D  Counts  Repeat  CountsInsideA
0  foo    one -0.973299 -0.248367       3       1              2
1  bar    one  0.518526  0.987810       3       1              1
2  foo    two -0.031224  0.340774       3       1              2
3  bar  three -0.146824 -0.751124       2       1              1
4  foo    two -0.748681 -0.128536       3       1              2
5  bar    two  0.744051  0.604505       3       1              1
6  foo    one -0.513386  1.262674       3       1              2
7  foo  three  0.044814  0.810772       2       1              1
In [25]:

data['RepeatInsideA'] = np.where(data.CountsInsideA>1, 1, 0)
print data
     A      B         C         D  Counts  Repeat  CountsInsideA  RepeatInsideA
0  foo    one -0.973299 -0.248367       3       1              2              1 
1  bar    one  0.518526  0.987810       3       1              1              0
2  foo    two -0.031224  0.340774       3       1              2              1
3  bar  three -0.146824 -0.751124       2       1              1              0
4  foo    two -0.748681 -0.128536       3       1              2              1
5  bar    two  0.744051  0.604505       3       1              1              0
6  foo    one -0.513386  1.262674       3       1              2              1
7  foo  three  0.044814  0.810772       2       1              1              0

关于python - 如何对数据集子集重复相同的操作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31811817/

python - 如何对数据集子集重复相同的操作

上一篇：python - 使用 Pandas 读取csv中的特定单元格

下一篇：python - 打印一系列列对齐的列表