我正在使用 pandas 0.17.0 并且有一个类似于这个的 df
:
df.head()
Out[339]:
A B C
DATE_TIME
2016-10-08 13:57:00 in 5.61 1
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:17:00 out 8.18 0
df.tail()
Out[340]:
A B C
DATE_TIME
2016-11-08 13:42:00 in 8.00 0
2016-11-08 13:47:00 in 7.99 0
2016-11-08 13:52:00 out 7.97 0
2016-11-08 13:57:00 in 8.14 1
2016-11-08 14:02:00 in 8.16 1
具有以下dtypes
:
print (df.dtypes)
A object
B float64
C int64
dtype: object
当我将 df
重新索引为分钟间隔时,所有列 int64
都会更改为 float64
。
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index)
print (df2.dtypes)
A object
B float64
C float64
dtype: object
此外,如果我尝试重新采样
df3 = df.resample('Min')
int64
将变成 float64
并且出于某种原因我丢失了我的 object
列。
打印(df3.dtypes)
print (df3.dtypes)
B float64
C float64
dtype: object
因为我想在后续步骤中根据这种区别对列进行不同的插值(在将 df
与另一个 df
连接之后),我需要它们保持原来的状态数据类型
。我真正的 df
有更多的每种类型的列,因此我正在寻找一种不依赖于通过标签单独调用列的解决方案。
有没有办法在重建索引的过程中维护它们的dtype
?或者有没有一种方法可以在之后为它们分配它们的 dtype
(它们是除了 NAN 之外唯一仅由整数组成的列)?
谁能帮帮我?
最佳答案
是impossible ,因为如果您在某列中获得至少一个 NaN
值,int
将转换为 float
。
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index)
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1.0
2016-10-08 13:58:00 NaN NaN NaN
2016-10-08 13:59:00 NaN NaN NaN
2016-10-08 14:00:00 NaN NaN NaN
2016-10-08 14:01:00 NaN NaN NaN
2016-10-08 14:02:00 in 8.05 1.0
2016-10-08 14:03:00 NaN NaN NaN
2016-10-08 14:04:00 NaN NaN NaN
2016-10-08 14:05:00 NaN NaN NaN
2016-10-08 14:06:00 NaN NaN NaN
2016-10-08 14:07:00 in 7.92 0.0
2016-10-08 14:08:00 NaN NaN NaN
2016-10-08 14:09:00 NaN NaN NaN
2016-10-08 14:10:00 NaN NaN NaN
2016-10-08 14:11:00 NaN NaN NaN
2016-10-08 14:12:00 in 7.98 0.0
2016-10-08 14:13:00 NaN NaN NaN
2016-10-08 14:14:00 NaN NaN NaN
2016-10-08 14:15:00 NaN NaN NaN
2016-10-08 14:16:00 NaN NaN NaN
2016-10-08 14:17:00 out 8.18 0.0
print (df2.dtypes)
A object
B float64
C float64
dtype: object
但是如果在reindex
中使用参数fill_value
, dtypes
没有改变:
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index, fill_value=0)
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1
2016-10-08 13:58:00 0 0.00 0
2016-10-08 13:59:00 0 0.00 0
2016-10-08 14:00:00 0 0.00 0
2016-10-08 14:01:00 0 0.00 0
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:03:00 0 0.00 0
2016-10-08 14:04:00 0 0.00 0
2016-10-08 14:05:00 0 0.00 0
2016-10-08 14:06:00 0 0.00 0
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:08:00 0 0.00 0
2016-10-08 14:09:00 0 0.00 0
2016-10-08 14:10:00 0 0.00 0
2016-10-08 14:11:00 0 0.00 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:13:00 0 0.00 0
2016-10-08 14:14:00 0 0.00 0
2016-10-08 14:15:00 0 0.00 0
2016-10-08 14:16:00 0 0.00 0
2016-10-08 14:17:00 out 8.18 0
print (df2.dtypes)
A object
B float64
C int64
dtype: object
更好的方法是在 reindex
中使用 method='ffill
:
index = pd.date_range(df.index[0], df.index[-1], freq="min")
df2 = df.reindex(index, method='ffill')
print (df2)
A B C
2016-10-08 13:57:00 in 5.61 1
2016-10-08 13:58:00 in 5.61 1
2016-10-08 13:59:00 in 5.61 1
2016-10-08 14:00:00 in 5.61 1
2016-10-08 14:01:00 in 5.61 1
2016-10-08 14:02:00 in 8.05 1
2016-10-08 14:03:00 in 8.05 1
2016-10-08 14:04:00 in 8.05 1
2016-10-08 14:05:00 in 8.05 1
2016-10-08 14:06:00 in 8.05 1
2016-10-08 14:07:00 in 7.92 0
2016-10-08 14:08:00 in 7.92 0
2016-10-08 14:09:00 in 7.92 0
2016-10-08 14:10:00 in 7.92 0
2016-10-08 14:11:00 in 7.92 0
2016-10-08 14:12:00 in 7.98 0
2016-10-08 14:13:00 in 7.98 0
2016-10-08 14:14:00 in 7.98 0
2016-10-08 14:15:00 in 7.98 0
2016-10-08 14:16:00 in 7.98 0
2016-10-08 14:17:00 out 8.18 0
print (df2.dtypes)
A object
B float64
C int64
dtype: object
如果使用resample
,您可以通过 unstack
返回 A
列和 stack
, 但不幸的是 float
仍然存在问题:
df3 = df.set_index('A', append=True)
.unstack()
.resample('Min', fill_method='ffill')
.stack()
.reset_index(level=1)
print (df3)
A B C
DATE_TIME
2016-10-08 13:57:00 in 5.61 1.0
2016-10-08 13:58:00 in 5.61 1.0
2016-10-08 13:59:00 in 5.61 1.0
2016-10-08 14:00:00 in 5.61 1.0
2016-10-08 14:01:00 in 5.61 1.0
2016-10-08 14:02:00 in 8.05 1.0
2016-10-08 14:03:00 in 8.05 1.0
2016-10-08 14:04:00 in 8.05 1.0
2016-10-08 14:05:00 in 8.05 1.0
2016-10-08 14:06:00 in 8.05 1.0
2016-10-08 14:07:00 in 7.92 0.0
2016-10-08 14:08:00 in 7.92 0.0
2016-10-08 14:09:00 in 7.92 0.0
2016-10-08 14:10:00 in 7.92 0.0
2016-10-08 14:11:00 in 7.92 0.0
2016-10-08 14:12:00 in 7.98 0.0
2016-10-08 14:13:00 in 7.98 0.0
2016-10-08 14:14:00 in 7.98 0.0
2016-10-08 14:15:00 in 7.98 0.0
2016-10-08 14:16:00 in 7.98 0.0
2016-10-08 14:17:00 out 8.18 0.0
print (df3.dtypes)
A object
B float64
C float64
dtype: object
我尝试修改之前的answer用于转换为 `int:
int_cols = df.select_dtypes(['int64']).columns
print (int_cols)
Index(['C'], dtype='object')
index = pd.date_range(df.index[0], df.index[-1], freq="s")
df2 = df.reindex(index)
for col in df2:
if col == int_cols:
df2[col].ffill(inplace=True)
df2[col] = df2[col].astype(int)
elif df2[col].dtype == float:
df2[col].interpolate(inplace=True)
else:
df2[col].ffill(inplace=True)
#print (df2)
print (df2.dtypes)
A object
B float64
C int32
dtype: object
关于python - 有没有办法在重新索引/上采样时间序列时防止 dtype 从 Int64 更改为 float64?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39219023/