在 pandas 数据框中插入或更新
我想合并 storage_df 和processed_df,如下所示。假设电话是主键: 1. 如果值存在,则字段(并创建剩余的列,如下面示例中的性别) 2.如果值不存在,则将该值插入最终数据帧中,如示例中的 382837371
请注意,随着我们处理更多信息,列会不断增加。但是,有 32 列的限制,在此之前,processed_df/storage_df 将会增加
storage_df
________________________
Phone Name
918348483 Sumit
874647474 Saurabh
238362633 NA
Processed_df
_______________________________
Phone Name Gender
874647474 Saurabh Male
238362633 NA Female
382837371 NA Male
final_df
_______________________________
Phone Name Gender
918348483 Sumit NA
874647474 Saurabh Male
238362633 NA Female
382837371 NA Male
为了做到这一点,我使用了pandas的combine_first:
final_df = processed_df.set_index('phone').combine_first(storage_df.set_index('phone'))
但是随着数据帧大小的增加,系统会耗尽内存(16Gb 内存,并且无法组合形状 (88488, 6) 和形状 (7307, 8)
可以使用 sqlite 在 sql 中存储两个数据帧,然后使用 UPSERT。您能指导我执行此操作的语法吗?虽然我真的很想在内存中而不是在数据库中完成它。
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 5364, in combine_first
return self.combine(other, combiner, overwrite=False)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 5229, in combine
this, other = self.align(other, copy=False)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 3792, in align
broadcast_axis=broadcast_axis)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 8423, in align
fill_axis=fill_axis)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 8459, in _align_frame
allow_dups=True)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py", line 4490, in _reindex_with_indexers
copy=copy)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1220, in reindex_indexer
self._consolidate_inplace()
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1899, in _consolidate
_can_consolidate=_can_consolidate)
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 3146, in _merge_blocks
new_values = np.vstack([b.values for b in blocks])
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/numpy/core/shape_base.py", line 283, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError
最佳答案
您可以尝试 pandas 外连接。
final_df = storage_df.merge(processed_df, on='Phone', how='outer', suffixes=('', '_y'))
final_df.drop(list(final_df.filter(regex=r'.*_y$').columns), axis=1, inplace=True)
- 加入数据框
- 从合并中删除多余的列
关于python - 合并两个数据框 - python 中的 UPSERT,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57887684/