我有两个相似的表(“hist.csv”):
Historical :
id | url | url2 | url3 | Time
1 A B C 5
2 D E F 8
和(“new.csv”):
New :
id | url | url2 | url3 | Time
1 A Z K 9
2 G H I 11
如果“url”列匹配,我想用 Historical.Time 值更新 New.Time 列。 即,此处更新了 url“A”的所需输出:
New2 :
id | url | url2 | url3 | Time
1 A Z K 5
2 G H I 11
我尝试了以下方法:
Historical = pd.DataFrame.from_csv("hist.csv", index_col='id', sep='\t', encoding='utf-8')
New = pd.DataFrame.from_csv("new.csv", index_col='id', sep='\t', encoding='utf-8')
for index, row in New.iterrows():
New.loc[index,'Time']=Historical.loc[historical['url'] == row['url'],'Time']
New.to_csv("new2.csv", sep='\t', encoding='utf-8')
提高:
ValueError: Must have equal len keys and value when setting with an iterable
PS:我找到了这个帖子: Updating a DataFrame based on another DataFrame 但是看起来建议的带有“合并”的解决方案并不能真正满足我的需求,因为我有很多列?
最佳答案
基本问题是 Historical.loc[Historical['url'] == row['url'],'Time']
返回一个系列(即使只有一行或没有符合条件的行 - Historical['url'] == row['url']
- 匹配)。示例 -
In [15]: df
Out[15]:
A B
0 1 2
1 2 3
In [16]: df.loc[df['A']==1,'B']
Out[16]:
0 2
Name: B, dtype: int64
然后您尝试将此 DataFrame 设置到您的 New
dataframe 的单个单元格中,这就是导致问题的原因。
因为你在评论中说 -
I may have several rows with "url" in Historical, but they will have the same Time value. In that case, I should consider the first occurence/match.
您的代码的一个快速修复方法是检查 row['url']
是否存在于另一个 DataFrame 中,并且仅当为真时,才使用 -
for index, row in New.iterrows():
if row['url'] in Historical['url'].values:
row['Time']=Historical.loc[Historical['url'] == row['url'],'Time'].values[0]
关于 python + Pandas : Update ONE column in csv based on another csv,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33052922/