我有这个 pandas 数据框:
data = DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : random.randn(8), 'D' : random.randn(8)})
Out[84]:
A B C D
0 foo one 0.007861 -0.451943
1 bar one -1.341386 -0.799740
2 foo two -0.290606 -0.445757
3 bar three 0.519251 -0.404406
4 foo two -0.627547 -0.784901
5 bar two 0.309421 0.234292
6 foo one -2.156879 0.898375
7 foo three -1.669896 0.498978
我所做的是应用这个函数来获取 B 中重复元素的计数
data['Counts'] = data.groupby(['B'])['B'].transform('count')
这给了我:
Out[87]:
A B C D Counts
0 foo one 0.007861 -0.451943 3
1 bar one -1.341386 -0.799740 3
2 foo two -0.290606 -0.445757 3
3 bar three 0.519251 -0.404406 2
4 foo two -0.627547 -0.784901 3
5 bar two 0.309421 0.234292 3
6 foo one -2.156879 0.898375 3
7 foo three -1.669896 0.498978 2
然后我创建了一个新列作为 bool 分类器,其中 1 代表那些至少重复一次的行,0 代表那些不重复的行(在本例中没有 0)
data.ix[data.Counts >= 2,'Repeat'] = 1
data.ix[data.Counts <= 1,'Repeat'] = 0
Out[89]:
A B C D Counts Repeat
0 foo one 0.007861 -0.451943 3 1
1 bar one -1.341386 -0.799740 3 1
2 foo two -0.290606 -0.445757 3 1
3 bar three 0.519251 -0.404406 2 1
4 foo two -0.627547 -0.784901 3 1
5 bar two 0.309421 0.234292 3 1
6 foo one -2.156879 0.898375 3 1
7 foo three -1.669896 0.498978 2 1
我想要获得的是一个进一步的 Count 列,它计算当 A 中具有相同值时 B 中的元素重复的次数,并据此使用 bool 分类器对它们进行分类。这将是:
Out[89]:
A B C D Counts Repeat CountsInsideA RepeatInsideA
0 foo one 0.007861 -0.451943 3 1 2 1
1 bar one -1.341386 -0.799740 3 1 1 0
2 foo two -0.290606 -0.445757 3 1 2 1
3 bar three 0.519251 -0.404406 2 1 1 0
4 foo two -0.627547 -0.784901 3 1 2 1
5 bar two 0.309421 0.234292 3 1 1 0
6 foo one -2.156879 0.898375 3 1 2 1
7 foo three -1.669896 0.498978 2 1 1 0
最佳答案
检查一下,首先您可以使用 np.where
来重复该列,这并不简洁。 2、为了计算特定AB组合的重复次数,我们可能需要使用gourpby
,并将结果与原始DataFrame
合并:
In [19]:
data = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C' : np.random.randn(8), 'D' : np.random.randn(8)})
In [20]:
data['Counts'] = data.groupby(['B'])['B'].transform('count')
print data
A B C D Counts
0 foo one -0.973299 -0.248367 3
1 bar one 0.518526 0.987810 3
2 foo two -0.031224 0.340774 3
3 bar three -0.146824 -0.751124 2
4 foo two -0.748681 -0.128536 3
5 bar two 0.744051 0.604505 3
6 foo one -0.513386 1.262674 3
7 foo three 0.044814 0.810772 2
In [21]:
data['Repeat'] = np.where(data.Counts>1, 1, 0)
print data
A B C D Counts Repeat
0 foo one -0.973299 -0.248367 3 1
1 bar one 0.518526 0.987810 3 1
2 foo two -0.031224 0.340774 3 1
3 bar three -0.146824 -0.751124 2 1
4 foo two -0.748681 -0.128536 3 1
5 bar two 0.744051 0.604505 3 1
6 foo one -0.513386 1.262674 3 1
7 foo three 0.044814 0.810772 2 1
In [23]:
data = pd.merge(left=data,
right=pd.DataFrame(data.groupby(['A','B']).size(),
columns=['CountsInsideA']).reset_index(),
on=['A', 'B'],
how='left')
print data
A B C D Counts Repeat CountsInsideA
0 foo one -0.973299 -0.248367 3 1 2
1 bar one 0.518526 0.987810 3 1 1
2 foo two -0.031224 0.340774 3 1 2
3 bar three -0.146824 -0.751124 2 1 1
4 foo two -0.748681 -0.128536 3 1 2
5 bar two 0.744051 0.604505 3 1 1
6 foo one -0.513386 1.262674 3 1 2
7 foo three 0.044814 0.810772 2 1 1
In [25]:
data['RepeatInsideA'] = np.where(data.CountsInsideA>1, 1, 0)
print data
A B C D Counts Repeat CountsInsideA RepeatInsideA
0 foo one -0.973299 -0.248367 3 1 2 1
1 bar one 0.518526 0.987810 3 1 1 0
2 foo two -0.031224 0.340774 3 1 2 1
3 bar three -0.146824 -0.751124 2 1 1 0
4 foo two -0.748681 -0.128536 3 1 2 1
5 bar two 0.744051 0.604505 3 1 1 0
6 foo one -0.513386 1.262674 3 1 2 1
7 foo three 0.044814 0.810772 2 1 1 0
关于python - 如何对数据集子集重复相同的操作,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31811817/