这是我第一次尝试 Pandas。我认为我有一个合理的用例,但我绊倒了。我想将制表符分隔的文件加载到 Pandas Dataframe 中,然后按 Symbol 对其进行分组,并使用由 TimeStamp 列索引的 x.axis 绘制它。这是数据的一个子集:
Symbol,Price,M1,M2,Volume,TimeStamp
TBET,2.19,3,8.05,1124179,9:59:14 AM
FUEL,3.949,9,1.15,109674,9:59:11 AM
SUNH,4.37,6,0.09,24394,9:59:09 AM
FUEL,3.9099,8,1.11,105265,9:59:09 AM
TBET,2.18,2,8.03,1121629,9:59:05 AM
ORBC,3.4,2,0.22,10509,9:59:02 AM
FUEL,3.8599,7,1.07,102116,9:58:47 AM
FUEL,3.8544,6,1.05,100116,9:58:40 AM
GBR,3.83,4,0.46,64251,9:58:24 AM
GBR,3.8,3,0.45,63211,9:58:20 AM
XRA,3.6167,3,0.12,42310,9:58:08 AM
GBR,3.75,2,0.34,47521,9:57:52 AM
MPET,1.42,3,0.26,44600,9:57:52 AM
注意有关 TimeStamp 列的两件事;
- 它有重复的值并且
- 间隔不规则。
我以为我可以做这样的事情......
from pandas import *
import pylab as plt
df = read_csv('data.txt',index_col=5)
df.sort(ascending=False)
df.plot()
plt.show()
但是 read_csv 方法引发异常“尝试将列 1-X 作为索引但发现重复项”。是否有允许我指定具有重复值的索引列的选项?
我也有兴趣将我的不规则时间戳间隔与一秒分辨率对齐,我仍然希望在给定的一秒内绘制多个事件,但也许我可以引入一个唯一索引,然后将我的价格与它对齐?
最佳答案
我刚才创建了几个问题来解决一些我认为会很好的功能/便利:GH-856 , GH-857 , GH-858
我们目前正在努力改进时间序列功能,现在可以对二次解析进行对齐(尽管没有重复项,因此需要为此编写一些函数)。我还想以更好的方式支持重复的时间戳。但是,这实际上是面板 (3D) 数据,因此您可以通过以下方式进行更改:
In [29]: df.pivot('Symbol', 'TimeStamp').stack()
Out[29]:
M1 M2 Price Volume
Symbol TimeStamp
FUEL 9:58:40 AM 6 1.05 3.8544 100116
9:58:47 AM 7 1.07 3.8599 102116
9:59:09 AM 8 1.11 3.9099 105265
9:59:11 AM 9 1.15 3.9490 109674
GBR 9:57:52 AM 2 0.34 3.7500 47521
9:58:20 AM 3 0.45 3.8000 63211
9:58:24 AM 4 0.46 3.8300 64251
MPET 9:57:52 AM 3 0.26 1.4200 44600
ORBC 9:59:02 AM 2 0.22 3.4000 10509
SUNH 9:59:09 AM 6 0.09 4.3700 24394
TBET 9:59:05 AM 2 8.03 2.1800 1121629
9:59:14 AM 3 8.05 2.1900 1124179
XRA 9:58:08 AM 3 0.12 3.6167 42310
请注意,这创建了一个 MultiIndex。另一种方式我可以得到这个:
In [32]: df.set_index(['Symbol', 'TimeStamp'])
Out[32]:
Price M1 M2 Volume
Symbol TimeStamp
TBET 9:59:14 AM 2.1900 3 8.05 1124179
FUEL 9:59:11 AM 3.9490 9 1.15 109674
SUNH 9:59:09 AM 4.3700 6 0.09 24394
FUEL 9:59:09 AM 3.9099 8 1.11 105265
TBET 9:59:05 AM 2.1800 2 8.03 1121629
ORBC 9:59:02 AM 3.4000 2 0.22 10509
FUEL 9:58:47 AM 3.8599 7 1.07 102116
9:58:40 AM 3.8544 6 1.05 100116
GBR 9:58:24 AM 3.8300 4 0.46 64251
9:58:20 AM 3.8000 3 0.45 63211
XRA 9:58:08 AM 3.6167 3 0.12 42310
GBR 9:57:52 AM 3.7500 2 0.34 47521
MPET 9:57:52 AM 1.4200 3 0.26 44600
In [33]: df.set_index(['Symbol', 'TimeStamp']).sortlevel(0)
Out[33]:
Price M1 M2 Volume
Symbol TimeStamp
FUEL 9:58:40 AM 3.8544 6 1.05 100116
9:58:47 AM 3.8599 7 1.07 102116
9:59:09 AM 3.9099 8 1.11 105265
9:59:11 AM 3.9490 9 1.15 109674
GBR 9:57:52 AM 3.7500 2 0.34 47521
9:58:20 AM 3.8000 3 0.45 63211
9:58:24 AM 3.8300 4 0.46 64251
MPET 9:57:52 AM 1.4200 3 0.26 44600
ORBC 9:59:02 AM 3.4000 2 0.22 10509
SUNH 9:59:09 AM 4.3700 6 0.09 24394
TBET 9:59:05 AM 2.1800 2 8.03 1121629
9:59:14 AM 2.1900 3 8.05 1124179
XRA 9:58:08 AM 3.6167 3 0.12 42310
您可以像这样以真正的面板格式获取此数据:
In [35]: df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
Out[35]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 11 (major) x 7 (minor)
Items: Price to Volume
Major axis: 9:57:52 AM to 9:59:14 AM
Minor axis: FUEL to XRA
In [36]: panel = df.set_index(['TimeStamp', 'Symbol']).sortlevel(0).to_panel()
In [37]: panel['Price']
Out[37]:
Symbol FUEL GBR MPET ORBC SUNH TBET XRA
TimeStamp
9:57:52 AM NaN 3.75 1.42 NaN NaN NaN NaN
9:58:08 AM NaN NaN NaN NaN NaN NaN 3.6167
9:58:20 AM NaN 3.80 NaN NaN NaN NaN NaN
9:58:24 AM NaN 3.83 NaN NaN NaN NaN NaN
9:58:40 AM 3.8544 NaN NaN NaN NaN NaN NaN
9:58:47 AM 3.8599 NaN NaN NaN NaN NaN NaN
9:59:02 AM NaN NaN NaN 3.4 NaN NaN NaN
9:59:05 AM NaN NaN NaN NaN NaN 2.18 NaN
9:59:09 AM 3.9099 NaN NaN NaN 4.37 NaN NaN
9:59:11 AM 3.9490 NaN NaN NaN NaN NaN NaN
9:59:14 AM NaN NaN NaN NaN NaN 2.19 NaN
然后您可以根据该数据生成一些图。
请注意,时间戳仍然是字符串——我想它们可以转换为 Python datetime.time 对象,这样使用起来可能会更容易一些。我没有太多计划为原始时间与时间戳(日期 + 时间)提供大量支持,但如果有足够多的人需要它,我想我可以说服 :)
如果您在一秒钟内对单个交易品种进行多次观察,那么上述某些方法将不起作用。但我想在即将发布的 pandas 版本中更好地支持它,因此了解您的用例将对我有所帮助——考虑加入邮件列表 (pystatsmodels)
关于python - Pandas DataFrame - 所需索引具有重复值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9556892/