python - 如何对数据集子集重复相同的操作

标签 python pandas

我有这个 pandas 数据框:

data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : random.randn(8), 'D' : random.randn(8)})

Out[84]: 
     A      B         C         D
0  foo    one  0.007861 -0.451943
1  bar    one -1.341386 -0.799740
2  foo    two -0.290606 -0.445757
3  bar  three  0.519251 -0.404406
4  foo    two -0.627547 -0.784901
5  bar    two  0.309421  0.234292
6  foo    one -2.156879  0.898375
7  foo  three -1.669896  0.498978

我所做的是应用这个函数来获取 B 中重复元素的计数

data['Counts'] = data.groupby(['B'])['B'].transform('count')

这给了我:

    Out[87]: 
    A      B         C         D  Counts
0  foo    one  0.007861 -0.451943       3
1  bar    one -1.341386 -0.799740       3
2  foo    two -0.290606 -0.445757       3
3  bar  three  0.519251 -0.404406       2
4  foo    two -0.627547 -0.784901       3
5  bar    two  0.309421  0.234292       3
6  foo    one -2.156879  0.898375       3
7  foo  three -1.669896  0.498978       2

然后我创建了一个新列作为 bool 分类器,其中 1 代表那些至少重复一次的行,0 代表那些不重复的行(在本例中没有 0)

data.ix[data.Counts >= 2,'Repeat'] = 1 
data.ix[data.Counts <= 1,'Repeat'] = 0

Out[89]: 
     A      B         C         D  Counts  Repeat
0  foo    one  0.007861 -0.451943       3       1
1  bar    one -1.341386 -0.799740       3       1
2  foo    two -0.290606 -0.445757       3       1
3  bar  three  0.519251 -0.404406       2       1
4  foo    two -0.627547 -0.784901       3       1
5  bar    two  0.309421  0.234292       3       1
6  foo    one -2.156879  0.898375       3       1
7  foo  three -1.669896  0.498978       2       1

我想要获得的是一个进一步的 Count 列,它计算当 A 中具有相同值时 B 中的元素重复的次数,并据此使用 bool 分类器对它们进行分类。这将是:

Out[89]: 
     A      B         C         D  Counts  Repeat CountsInsideA RepeatInsideA
0  foo    one  0.007861 -0.451943       3       1             2              1
1  bar    one -1.341386 -0.799740       3       1             1              0
2  foo    two -0.290606 -0.445757       3       1             2              1
3  bar  three  0.519251 -0.404406       2       1             1              0
4  foo    two -0.627547 -0.784901       3       1             2              1
5  bar    two  0.309421  0.234292       3       1             1              0
6  foo    one -2.156879  0.898375       3       1             2              1
7  foo  three -1.669896  0.498978       2       1             1              0

最佳答案

检查一下,首先您可以使用 np.where 来重复该列,这并不简洁。 2、为了计算特定AB组合的重复次数,我们可能需要使用gourpby,并将结果与​​原始DataFrame合并:

In [19]:

data = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 
                     'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 
                     'C' : np.random.randn(8), 'D' : np.random.randn(8)})
In [20]:

data['Counts'] = data.groupby(['B'])['B'].transform('count')
print data
     A      B         C         D  Counts
0  foo    one -0.973299 -0.248367       3
1  bar    one  0.518526  0.987810       3
2  foo    two -0.031224  0.340774       3
3  bar  three -0.146824 -0.751124       2
4  foo    two -0.748681 -0.128536       3
5  bar    two  0.744051  0.604505       3
6  foo    one -0.513386  1.262674       3
7  foo  three  0.044814  0.810772       2
In [21]:

data['Repeat'] = np.where(data.Counts>1, 1, 0)
print data
     A      B         C         D  Counts  Repeat
0  foo    one -0.973299 -0.248367       3       1
1  bar    one  0.518526  0.987810       3       1
2  foo    two -0.031224  0.340774       3       1
3  bar  three -0.146824 -0.751124       2       1
4  foo    two -0.748681 -0.128536       3       1
5  bar    two  0.744051  0.604505       3       1
6  foo    one -0.513386  1.262674       3       1
7  foo  three  0.044814  0.810772       2       1
In [23]:

data = pd.merge(left=data,
                right=pd.DataFrame(data.groupby(['A','B']).size(), 
                                   columns=['CountsInsideA']).reset_index(),
                on=['A', 'B'],
                how='left')
print data
     A      B         C         D  Counts  Repeat  CountsInsideA
0  foo    one -0.973299 -0.248367       3       1              2
1  bar    one  0.518526  0.987810       3       1              1
2  foo    two -0.031224  0.340774       3       1              2
3  bar  three -0.146824 -0.751124       2       1              1
4  foo    two -0.748681 -0.128536       3       1              2
5  bar    two  0.744051  0.604505       3       1              1
6  foo    one -0.513386  1.262674       3       1              2
7  foo  three  0.044814  0.810772       2       1              1
In [25]:

data['RepeatInsideA'] = np.where(data.CountsInsideA>1, 1, 0)
print data
     A      B         C         D  Counts  Repeat  CountsInsideA  RepeatInsideA
0  foo    one -0.973299 -0.248367       3       1              2              1 
1  bar    one  0.518526  0.987810       3       1              1              0
2  foo    two -0.031224  0.340774       3       1              2              1
3  bar  three -0.146824 -0.751124       2       1              1              0
4  foo    two -0.748681 -0.128536       3       1              2              1
5  bar    two  0.744051  0.604505       3       1              1              0
6  foo    one -0.513386  1.262674       3       1              2              1
7  foo  three  0.044814  0.810772       2       1              1              0

关于python - 如何对数据集子集重复相同的操作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31811817/

相关文章:

python - 在 Python 中,如何根据键的频率编辑字典中的值?

python - 如何获得 ocr 输出的信任率?

python - 将列标题转换为行值

python-2.7 - Pandas Count 聚合中的数据透视表()

python - 通过 subprocess.Popen 在 python 中执行 R 脚本

python - NLTK 标记荷兰语句子

python - scikit-learn 分割数据集中的随机状态

python - 如何从 pandas DataMatrix 获取元数据

python - 与两列中的 ID 关联的所有值的滚动总和

pandas - 如何从 Spark 以 Feather 格式\存储保存文件?