我有一个 DataFrame df
就像
Col1
Date
2015-01-01 00:00:00 1
2015-01-01 00:00:01 1
2015-01-01 00:00:01 1
2015-01-01 00:00:01 1
2015-01-01 00:00:02 1
2015-01-01 00:00:04 1
2015-01-01 00:00:04 1
2015-01-01 00:00:06 1
2015-01-01 00:00:07 1
2015-01-01 00:00:07 1
它是使用以下内容创建的:
import pandas as pd
from cStringIO import StringIO
dat = """Date,Col1
2015-01-01 00:00:00,1
2015-01-01 00:00:01,1
2015-01-01 00:00:01,1
2015-01-01 00:00:01,1
2015-01-01 00:00:02,1
2015-01-01 00:00:04,1
2015-01-01 00:00:04,1
2015-01-01 00:00:06,1
2015-01-01 00:00:07,1
2015-01-01 00:00:07,1"""
df = pd.read_csv(StringIO(dat))
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
此 DataFrame 没有唯一索引
>>> df.index.is_unique
False
我想通过添加 1 毫秒(或更短)来构建唯一索引以获得类似的内容
Col1
Date
2015-01-01 00:00:00.000 1
2015-01-01 00:00:01.000 1
2015-01-01 00:00:01.001 1
2015-01-01 00:00:01.002 1
2015-01-01 00:00:02.000 1
2015-01-01 00:00:04.000 1
2015-01-01 00:00:04.001 1
2015-01-01 00:00:06.000 1
2015-01-01 00:00:07.000 1
2015-01-01 00:00:07.001 1
我正在寻找矢量化解决方案(不是 for 循环),因为我有大量数据需要处理
最佳答案
您可以groupby
移动列和原始列Date
之间的差异,通过 cumsum
获取它们的数量,按 cumcount
来计数并转换为纳秒。
纳秒 (1E-9
) 比毫秒 (1E-3
) 更好,因为使用毫秒可以创建新的口是心非行,但纳秒则不然(原始数据使用毫秒- 0 2015-11-02 00:00:01.072 欧元/美元 1.10294 1.10296
)。
df = df.reset_index()
#create ms column
df['Date'] = df['Date'] + (df['Date'].groupby((df['Date'] != df['Date'].shift()).cumsum())
.cumcount()).values.astype('timedelta64[ns]')
print df
Date Col1
0 2015-01-01 00:00:00.000000000 1
1 2015-01-01 00:00:01.000000000 1
2 2015-01-01 00:00:01.000000001 1
3 2015-01-01 00:00:01.000000002 1
4 2015-01-01 00:00:02.000000000 1
5 2015-01-01 00:00:04.000000000 1
6 2015-01-01 00:00:04.000000001 1
7 2015-01-01 00:00:06.000000000 1
8 2015-01-01 00:00:07.000000000 1
9 2015-01-01 00:00:07.000000001 1
#set column Date as index
df = df.set_index('Date')
最快的解决方案使用纳秒,并且如果重复数据的最大长度小于 1000000
(1E6
),则可以使用。
因此,如果您使用 csv
( 3898069 rows ),请首先检查此长度,如果 df 的行高于 1E6
:
import pandas as pd
df = pd.read_csv('test/EURUSD-2015-11.csv', header=None, parse_dates=[1],
names =['eurusd','Date','a','b'], sep=",")
#sort values if not sorted
df = df.sort_values('Date')
print df.head()
print df[df['Date'] == df['Date'].shift()]
eurusd Date a b
1996 EUR/USD 2015-11-02 00:51:18.198 1.10323 1.10327
2944 EUR/USD 2015-11-02 01:00:03.844 1.10321 1.10326
6450 EUR/USD 2015-11-02 01:37:35.898 1.10319 1.10324
11429 EUR/USD 2015-11-02 02:24:29.945 1.10301 1.10306
19468 EUR/USD 2015-11-02 03:13:40.575 1.10326 1.10333
20074 EUR/USD 2015-11-02 03:17:03.607 1.10282 1.10288
36618 EUR/USD 2015-11-02 04:36:01.357 1.10213 1.10217
40235 EUR/USD 2015-11-02 04:49:05.946 1.10075 1.10082
42930 EUR/USD 2015-11-02 05:01:37.955 1.10034 1.10042
43269 EUR/USD 2015-11-02 05:03:21.360 1.10070 1.10073
47043 EUR/USD 2015-11-02 05:22:59.811 1.10142 1.10149
47526 EUR/USD 2015-11-02 05:25:45.474 1.10143 1.10150
53398 EUR/USD 2015-11-02 05:58:23.674 1.10294 1.10299
59899 EUR/USD 2015-11-02 06:44:55.266 1.10145 1.10150
64480 EUR/USD 2015-11-02 07:30:27.091 1.10211 1.10217
70576 EUR/USD 2015-11-02 08:14:04.318 1.10329 1.10336
75662 EUR/USD 2015-11-02 08:54:35.138 1.10485 1.10486
75724 EUR/USD 2015-11-02 08:55:00.577 1.10504 1.10507
93917 EUR/USD 2015-11-02 10:55:20.863 1.10345 1.10349
94603 EUR/USD 2015-11-02 10:57:56.289 1.10352 1.10356
98046 EUR/USD 2015-11-02 11:16:24.127 1.10272 1.10278
98433 EUR/USD 2015-11-02 11:19:14.109 1.10281 1.10286
100582 EUR/USD 2015-11-02 11:31:57.891 1.10247 1.10252
105627 EUR/USD 2015-11-02 12:11:01.900 1.10243 1.10246
106789 EUR/USD 2015-11-02 12:19:45.974 1.10183 1.10190
115219 EUR/USD 2015-11-02 14:06:47.229 1.10194 1.10200
116808 EUR/USD 2015-11-02 14:35:50.693 1.10204 1.10211
124436 EUR/USD 2015-11-02 17:06:48.286 1.10125 1.10144
124532 EUR/USD 2015-11-02 17:07:56.048 1.10160 1.10174
124734 EUR/USD 2015-11-02 17:11:51.609 1.10123 1.10142
... ... ... ... ...
3893816 EUR/USD 2015-11-30 20:59:38.304 1.05651 1.05655
3893818 EUR/USD 2015-11-30 20:59:39.341 1.05650 1.05653
3893819 EUR/USD 2015-11-30 20:59:39.976 1.05651 1.05653
3893820 EUR/USD 2015-11-30 20:59:45.170 1.05652 1.05653
3895397 EUR/USD 2015-11-30 20:59:51.605 1.05654 1.05658
3895398 EUR/USD 2015-11-30 20:59:51.707 1.05655 1.05659
3893838 EUR/USD 2015-11-30 20:59:51.767 1.05656 1.05657
3893841 EUR/USD 2015-11-30 20:59:51.816 1.05658 1.05662
3895401 EUR/USD 2015-11-30 20:59:52.073 1.05659 1.05663
3895402 EUR/USD 2015-11-30 20:59:52.229 1.05660 1.05664
3893847 EUR/USD 2015-11-30 20:59:52.818 1.05659 1.05663
3895404 EUR/USD 2015-11-30 20:59:52.915 1.05660 1.05664
3893852 EUR/USD 2015-11-30 20:59:53.106 1.05661 1.05662
3893855 EUR/USD 2015-11-30 20:59:57.031 1.05662 1.05664
3895407 EUR/USD 2015-11-30 20:59:57.084 1.05664 1.05668
3895416 EUR/USD 2015-11-30 21:00:00.816 1.05664 1.05665
3895718 EUR/USD 2015-11-30 21:05:45.605 1.05666 1.05670
3895857 EUR/USD 2015-11-30 21:12:38.965 1.05659 1.05663
3895866 EUR/USD 2015-11-30 21:12:44.505 1.05666 1.05666
3895899 EUR/USD 2015-11-30 21:13:07.805 1.05669 1.05673
3895931 EUR/USD 2015-11-30 21:13:55.007 1.05675 1.05677
3896093 EUR/USD 2015-11-30 21:25:27.988 1.05658 1.05663
3896097 EUR/USD 2015-11-30 21:25:28.002 1.05661 1.05665
3896209 EUR/USD 2015-11-30 21:28:25.906 1.05655 1.05660
3896307 EUR/USD 2015-11-30 21:32:32.490 1.05653 1.05658
3896342 EUR/USD 2015-11-30 21:35:40.005 1.05654 1.05660
3896393 EUR/USD 2015-11-30 21:40:40.182 1.05648 1.05652
3896849 EUR/USD 2015-11-30 22:19:34.582 1.05670 1.05684
3897168 EUR/USD 2015-11-30 22:40:27.108 1.05675 1.05686
3897389 EUR/USD 2015-11-30 22:50:46.825 1.05705 1.05717
[35636 rows x 4 columns]
print len(df[df['Date'] == df['Date'].shift()])
35636
因此,35636
小于 1000000
,然后您可以将此唯一行数计为 999999
:
df.loc[df['Date'] == df['Date'].shift(), 'Date'] =
df['Date'] +
((df['Date'] == df['Date'].shift()).cumsum()).astype('timedelta64[ns]')
print df
Date Col1
0 2015-01-01 00:00:00.000000000 1
1 2015-01-01 00:00:01.000000000 1
2 2015-01-01 00:00:01.000000001 1
3 2015-01-01 00:00:01.000000002 1
4 2015-01-01 00:00:02.000000000 1
5 2015-01-01 00:00:04.000000000 1
6 2015-01-01 00:00:04.000000003 1
7 2015-01-01 00:00:06.000000000 1
8 2015-01-01 00:00:07.000000000 1
9 2015-01-01 00:00:07.000000004 1
.
.
.
99945 2015-01-01 23:59:09.000999999 1
比较:
import pandas as pd
df = pd.read_csv('test/EURUSD-2015-11.csv', header=None, parse_dates=[1],
names =['eurusd','Date','a','b'], sep=",")
#sort values if not sorted
df = df.sort_values('Date')
print df.head()
#print df[df['Date'] == df['Date'].shift()]
#print len(df[df['Date'] == df['Date'].shift()])
df3 = df.copy()
def ori(df):
df['Date']=df['Date']+(df['Date'].groupby((df['Date'] != df['Date'].shift())
.cumsum()).cumcount()).values.astype('timedelta64[ns]')
return df
def new(df):
df.loc[df['Date'] == df['Date'].shift(), 'Date'] = df['Date'] +
((df['Date'] == df['Date'].shift()).cumsum()).astype('timedelta64[ns]')
return df
df1 = ori(df)
df2 = new(df3)
print df1.head()
print df2.head()
时机更好:
In [81]: %timeit ori(df)
1 loops, best of 3: 2min 22s per loop
Compiler time: 0.10 s
In [82]: %timeit new(df)
1 loops, best of 3: 758 ms per loop
关于python - 通过添加 timedelta 创建具有唯一值的 DatetimeIndex 的 DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34575126/