python - Pandas 的 DataFrame 合并意外值

标签 python pandas numpy

我有两个简单的数据框:

a = homes_in.copy()
b = homes.copy()

a['have'] = [True,]*a.shape[0]
b['have'] = [True,]*b.shape[0]

a = a['have'].to_frame()
b = b['have'].to_frame()

print(a.shape)
print(b.shape)

a.reset_index(inplace=True)
b.reset_index(inplace=True)
idx_cols = ['State', 'RegionName']

c = pd.merge(a, b, how='outer', left_on=idx_cols, right_on=idx_cols, suffixes=['_a', '_b'])
print(c.shape)
print(sum(c['have_a']))
print(sum(c['have_b']))

输出

(10730, 1)
(10592, 1)
(10730, 4)
10730
10730

其中a.head()是:

                    have
State RegionName        
NY    New York      True
CA    Los Angeles   True
IL    Chicago       True
PA    Philadelphia  True
AZ    Phoenix       True

问题:have_ahave_b 列中的所有值都具有 True 值。

我尝试用伪造的数据复制该行为,但失败了:

col = ['first', 'second', 'third']
a = pd.DataFrame.from_records([('a','b',1), ('a','c',1), ('a','d', 1)], columns=col)
b = pd.DataFrame.from_records([('a','b',2), ('a','c',2)], columns=col)
pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])

最佳答案

我认为有重复:

col = ['first', 'second', 'third']
a = pd.DataFrame.from_records([('a','b',True), ('a','c',True), ('a','c', True)], columns=col)
b = pd.DataFrame.from_records([('a','b',True), ('a','c',True)], columns=col)
c = pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
print (a)
  first second  third
0     a      b   True
1     a      c   True <-duplicates a,c
2     a      c   True <-duplicates a,c

print (b)
  first second  third
0     a      b   True
1     a      c   True

print (c)
  first second  third_x  third_y
0     a      b     True     True
1     a      c     True     True
2     a      c     True     True
<小时/>

您可以查找重复项:

print (a[a.duplicated(['first','second'], keep=False)])
  first second  third
1     a      c   True
2     a      c   True

print (b[b.duplicated(['first','second'], keep=False)])
Empty DataFrame
Columns: [first, second, third]
Index: []
<小时/>

解决方案是通过 drop_duplicates 删除重复项:

a = a.drop_duplicates(['first','second'])
b = b.drop_duplicates(['first','second'])

c = pd.merge(a,b,how='outer',left_on=['first','second'],right_on=['first', 'second'])
print (a)
  first second  third
0     a      b   True
1     a      c   True

print (b)
  first second  third
0     a      b   True
1     a      c   True

print (c)
  first second  third_x  third_y
0     a      b     True     True
1     a      c     True     True

关于python - Pandas 的 DataFrame 合并意外值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51126475/

相关文章:

python - 删除匹配分隔符之间的单词/行

python - 解析 gff 文件,获取脚手架名称并创建一个新的数据框

Python根据其他列(字典单词)计算出现次数

python - 从 Pandas DataFrame 中选择列中具有有限值的最新索引的有效方法?

python - 确定 numpy 数组是否为 datetime64 的最佳方法?

python - Django Admin 的重大性能问题 - 外键标签

python - 将数据帧的切片添加到新列中的另一个数据帧

python - Lambda 语法无效

Python:将所有具有约束的唯一组合输出到 Pandas DataFrame

numpy - 当子例程包含内部过程时,f2py 给出错误(但使用 gfortran 编译成功)