我正在尝试围绕各种 Assets 的公共(public)时间戳合并一组 DataFrame。数据集包含每小时数据,但每个 Assets 中每小时的时间戳略有不同。所以我将时间戳从纪元转换为日期时间并删除秒和分钟
market_trading_pair ohlcv_start_date next_future_timestep_return
7073 Poloniex_DOGE_BTC 1445392800 -0.023256
7074 Poloniex_DOGE_BTC 1445396400 0.023810
7075 Poloniex_DOGE_BTC 1445400000 0.000000
7076 Poloniex_DOGE_BTC 1445403600 -0.023256
7077 Poloniex_DOGE_BTC 1445407200 0.000000
使用此代码:
TS = 'ohlcv_start_date'
df[TS] = pd.to_datetime(df[TS], unit='s').dt.strftime('%Y-%m-%d %H:00:00')
print df.groupby('market_trading_pair').get_group('Poloniex_DOGE_BTC').head()[['market_trading_pair','ohlcv_start_date']]
market_trading_pair ohlcv_start_date next_future_timestep_return
7073 Poloniex_DOGE_BTC 2015-10-21 02:00:00 -0.023256
7074 Poloniex_DOGE_BTC 2015-10-21 03:00:00 0.023810
7075 Poloniex_DOGE_BTC 2015-10-21 04:00:00 0.000000
7076 Poloniex_DOGE_BTC 2015-10-21 05:00:00 -0.023256
7077 Poloniex_DOGE_BTC 2015-10-21 06:00:00 0.000000
使用所需数据创建新的数据框:
timestamp DOGE
7073 2015-10-21 02:00:00 -0.023256
7074 2015-10-21 03:00:00 0.023810
7075 2015-10-21 04:00:00 0.000000
7076 2015-10-21 05:00:00 -0.023256
7077 2015-10-21 06:00:00 0.000000
然后,我创建一个“骨架”时间帧数据帧,我将能够将所有数据帧合并到其中,并合并一个帧来进行测试。
timeframe = pd.date_range(start=min_time, end=max_time, freq='H')
test = DataFrame(timeframe, columns=['timestamp'])
timestamp
0 2015-10-21 02:00:00
1 2015-10-21 03:00:00
2 2015-10-21 04:00:00
3 2015-10-21 05:00:00
4 2015-10-21 06:00:00
test = pd.merge(left=test, right=to_merge, left_on='timestamp',right_on='timestamp',how='left')
timestamp DOGE
0 2015-10-21 02:00:00 NaN
1 2015-10-21 03:00:00 NaN
2 2015-10-21 04:00:00 NaN
3 2015-10-21 05:00:00 NaN
结果是 nan 字段,我认为这可能是由于格式错误造成的?但是我比较了时间戳字符串,结果是“True”
最佳答案
问题在于 dtypes
- 无法将列类型 string
与类型 datetime
合并,因为这样输出为 NaN
:
print df
timestamp DOGE
7073 2015-10-21 02:00:00 -0.023256
7074 2015-10-21 03:00:00 0.023810
7075 2015-10-21 04:00:00 0.000000
7076 2015-10-21 05:00:00 -0.023256
7077 2015-10-21 06:00:00 0.000000
print df.dtypes
timestamp datetime64[ns]
DOGE float64
dtype: object
min_time = df['timestamp'].min()
max_time = df['timestamp'].max()
df['timestamp'] = df['timestamp'].dt.strftime('%Y-%m-%d %H:00:00')
print df
timestamp DOGE
7073 2015-10-21 02:00:00 -0.023256
7074 2015-10-21 03:00:00 0.023810
7075 2015-10-21 04:00:00 0.000000
7076 2015-10-21 05:00:00 -0.023256
7077 2015-10-21 06:00:00 0.000000
print df.dtypes
timestamp object **************
DOGE float64
dtype: object
timeframe = pd.date_range(start=min_time, end=max_time, freq='H')
test = pd.DataFrame(timeframe, columns=['timestamp'])
print test
timestamp
0 2015-10-21 02:00:00
1 2015-10-21 03:00:00
2 2015-10-21 04:00:00
3 2015-10-21 05:00:00
4 2015-10-21 06:00:00
print test.dtypes
timestamp datetime64[ns] ****************
dtype: object
print pd.merge(left=test, right=df, on='timestamp', how='left')
timestamp DOGE
0 2015-10-21 02:00:00 NaN
1 2015-10-21 03:00:00 NaN
2 2015-10-21 04:00:00 NaN
3 2015-10-21 05:00:00 NaN
4 2015-10-21 06:00:00 NaN
解决方案
删除将datetime
类型的列转换为string
:
更改:
df[TS] = pd.to_datetime(df[TS], unit='s').dt.strftime('%Y-%m-%d %H:00:00')
至:
df[TS] = pd.to_datetime(df[TS], unit='s')
这意味着(我评论转换为字符串
):
print df.dtypes
timestamp datetime64[ns] ***********
DOGE float64
dtype: object
min_time = df['timestamp'].min()
max_time = df['timestamp'].max()
#df['timestamp'] = df['timestamp'].dt.strftime('%Y-%m-%d %H:00:00')
#print df
#print df.dtypes
timeframe = pd.date_range(start=min_time, end=max_time, freq='H')
test = pd.DataFrame(timeframe, columns=['timestamp'])
print test.dtypes
timestamp datetime64[ns] ***********
dtype: object
print pd.merge(left=test, right=df, on='timestamp', how='left')
timestamp DOGE
0 2015-10-21 02:00:00 -0.023256
1 2015-10-21 03:00:00 0.023810
2 2015-10-21 04:00:00 0.000000
3 2015-10-21 05:00:00 -0.023256
4 2015-10-21 06:00:00 0.000000
关于python - 如何合并具有稍微不同的合并字段的 DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35332634/