python + Pandas : Update ONE column in csv based on another csv

我有两个相似的表(“hist.csv”):

Historical :
id | url | url2 | url3 | Time
1    A      B      C      5
2    D      E      F      8

和(“new.csv”):

New :
id | url | url2 | url3 | Time
1    A      Z      K      9
2    G      H      I      11

如果“url”列匹配，我想用 Historical.Time 值更新 New.Time 列。即，此处更新了 url“A”的所需输出:

New2 :
id | url | url2 | url3 | Time
1    A      Z      K      5
2    G      H      I      11

我尝试了以下方法:

Historical = pd.DataFrame.from_csv("hist.csv", index_col='id', sep='\t', encoding='utf-8')
New = pd.DataFrame.from_csv("new.csv", index_col='id', sep='\t', encoding='utf-8')

for index, row in New.iterrows():
    New.loc[index,'Time']=Historical.loc[historical['url'] == row['url'],'Time']

New.to_csv("new2.csv", sep='\t', encoding='utf-8')

提高:

 ValueError: Must have equal len keys and value when setting with an iterable

PS:我找到了这个帖子: Updating a DataFrame based on another DataFrame 但是看起来建议的带有“合并”的解决方案并不能真正满足我的需求，因为我有很多列？

最佳答案

基本问题是 Historical.loc[Historical['url'] == row['url'],'Time'] 返回一个系列(即使只有一行或没有符合条件的行 - Historical['url'] == row['url'] - 匹配)。示例 -

In [15]: df
Out[15]:
   A  B
0  1  2
1  2  3

In [16]: df.loc[df['A']==1,'B']
Out[16]:
0    2
Name: B, dtype: int64

然后您尝试将此 DataFrame 设置到您的 New dataframe 的单个单元格中，这就是导致问题的原因。

因为你在评论中说 -

I may have several rows with "url" in Historical, but they will have the same Time value. In that case, I should consider the first occurence/match.

您的代码的一个快速修复方法是检查 row['url'] 是否存在于另一个 DataFrame 中，并且仅当为真时，才使用 -

从中获取值

for index, row in New.iterrows():
    if row['url'] in Historical['url'].values:
        row['Time']=Historical.loc[Historical['url'] == row['url'],'Time'].values[0]

关于 python + Pandas : Update ONE column in csv based on another csv，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33052922/

python + Pandas : Update ONE column in csv based on another csv

上一篇：python - 使用 Selenium 和 Scrapy 在 Python 中调用方法

下一篇：python - 在 Pandas 中操作多索引列