我有两个简单的数据框:
a = homes_in.copy()
b = homes.copy()
a['have'] = [True,]*a.shape[0]
b['have'] = [True,]*b.shape[0]
a = a['have'].to_frame()
b = b['have'].to_frame()
print(a.shape)
print(b.shape)
a.reset_index(inplace=True)
b.reset_index(inplace=True)
idx_cols = ['State', 'RegionName']
c = pd.merge(a, b, how='outer', left_on=idx_cols, right_on=idx_cols, suffixes=['_a', '_b'])
print(c.shape)
print(sum(c['have_a']))
print(sum(c['have_b']))
输出
(10730, 1)
(10592, 1)
(10730, 4)
10730
10730
其中a.head()
是:
have
State RegionName
NY New York True
CA Los Angeles True
IL Chicago True
PA Philadelphia True
AZ Phoenix True
问题:have_a
和 have_b
列中的所有值都具有 True
值。
我尝试用伪造的数据复制该行为,但失败了:
col = ['first', 'second', 'third']
a = pd.DataFrame.from_records([('a','b',1), ('a','c',1), ('a','d', 1)], columns=col)
b = pd.DataFrame.from_records([('a','b',2), ('a','c',2)], columns=col)
pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
最佳答案
我认为有重复:
col = ['first', 'second', 'third']
a = pd.DataFrame.from_records([('a','b',True), ('a','c',True), ('a','c', True)], columns=col)
b = pd.DataFrame.from_records([('a','b',True), ('a','c',True)], columns=col)
c = pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
print (a)
first second third
0 a b True
1 a c True <-duplicates a,c
2 a c True <-duplicates a,c
print (b)
first second third
0 a b True
1 a c True
print (c)
first second third_x third_y
0 a b True True
1 a c True True
2 a c True True
<小时/>
您可以查找重复项:
print (a[a.duplicated(['first','second'], keep=False)])
first second third
1 a c True
2 a c True
print (b[b.duplicated(['first','second'], keep=False)])
Empty DataFrame
Columns: [first, second, third]
Index: []
<小时/>
解决方案是通过 drop_duplicates
删除重复项:
a = a.drop_duplicates(['first','second'])
b = b.drop_duplicates(['first','second'])
c = pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
print (a)
first second third
0 a b True
1 a c True
print (b)
first second third
0 a b True
1 a c True
print (c)
first second third_x third_y
0 a b True True
1 a c True True
关于python - Pandas 的 DataFrame 合并意外值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51126475/