我得到了以下两个数据框:
df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31'],
'value':[4, 5, 6]})
df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31'],
'value':[14, 55, 36]})
df1的问题是['date]列包含一个无法直接解析为时间戳的值。所以我使用了以下函数:
def to_datetime(x):
try:
res = pd.to_datetime(x)
except:
res = x
return res
然后我的新列如下:
df1['date_new'] = df1['date'].apply(to_datetime)
df2['date_new'] = df2['date'].apply(to_datetime)
我想合并 ['date_new'] 上的两个数据帧,但没有匹配的值。
df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])
但是,
df1['date_new'][0] == df2['date_new'][1]
返回真。 完整代码如下:
import pandas as pd
def to_datetime(x):
try:
res = pd.to_datetime(x)
except:
res = x
return res
df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31'],
'value':[4, 5, 6]})
df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31'],
'value':[14, 55, 36]})
df1['date_new'] = df1['date'].apply(to_datetime)
df2['date_new'] = df2['date'].apply(to_datetime)
df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])
请告诉我为什么会发生这种情况。谢谢!
最佳答案
pd.to_datetime
有方便的errors
参数,您可以将其设置为coerce
。然后,您的代码似乎可以工作:
df1['date_new'] = pd.to_datetime(df1['date'], errors='coerce')
df2['date_new'] = pd.to_datetime(df2['date'], errors='coerce')
df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])
>>> df3
date_x value_x date_new date_y value_y
0 2012-12-31 4 2012-12-31 2012-12-31 55
1 2013-12-31 5 2013-12-31 2013-12-31 14
请注意,由于您的日期是强制的,如果它们不适合日期格式,它们将显示为 NaT
,因此这些强制值在合并时会匹配。例如:
df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31','xyz'],
'value':[4, 5, 6, 14]})
df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31','sss'],
'value':[14, 55, 36, 12]})
df1['date_new'] = pd.to_datetime(df1['date'], errors='coerce')
df2['date_new'] = pd.to_datetime(df2['date'], errors='coerce')
df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])
导致这个:
>>> df3
date_x value_x date_new date_y value_y
0 2012-12-31 4 2012-12-31 2012-12-31 55
1 2013-12-31 5 2013-12-31 2013-12-31 14
2 9999-12-31 6 NaT sss 12
3 xyz 14 NaT sss 12
为了避免这些问题,您可以合并数据帧的子集,其中 date_new
不为空:
df3 = pd.merge(df1.loc[df1.date_new.notnull()], df2.loc[df2.date_new.notnull()], how = 'inner', on = ['date_new'])
至于为什么发生在您的代码中,如果您的函数发现不合适的日期,它最终会返回一系列dtype: object
:
df1['date_new'] = df1['date'].apply(to_datetime)
>>> df1['date_new']
0 2012-12-31 00:00:00
1 2013-12-31 00:00:00
2 9999-12-31
Name: date_new, dtype: object
但是当所有日期都正常时,一系列类型dtype: datetime64[ns]
:
df2['date_new'] = df2['date'].apply(to_datetime)
>>> df2['date_new']
0 2013-12-31
1 2012-12-31
2 2010-12-31
Name: date_new, dtype: datetime64[ns]
所以这些不能正确合并
关于python - 解析日期列后 pandas 合并功能出现问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49216237/