python - 通过添加 timedelta 创建具有唯一值的 DatetimeIndex 的 DataFrame

标签 python pandas indexing

我有一个 DataFrame df 就像

                     Col1
Date
2015-01-01 00:00:00     1
2015-01-01 00:00:01     1
2015-01-01 00:00:01     1
2015-01-01 00:00:01     1
2015-01-01 00:00:02     1
2015-01-01 00:00:04     1
2015-01-01 00:00:04     1
2015-01-01 00:00:06     1
2015-01-01 00:00:07     1
2015-01-01 00:00:07     1

它是使用以下内容创建的:

import pandas as pd
from cStringIO import StringIO

dat = """Date,Col1
2015-01-01 00:00:00,1
2015-01-01 00:00:01,1
2015-01-01 00:00:01,1
2015-01-01 00:00:01,1
2015-01-01 00:00:02,1
2015-01-01 00:00:04,1
2015-01-01 00:00:04,1
2015-01-01 00:00:06,1
2015-01-01 00:00:07,1
2015-01-01 00:00:07,1"""

df = pd.read_csv(StringIO(dat))
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')

此 DataFrame 没有唯一索引

>>> df.index.is_unique
False

我想通过添加 1 毫秒(或更短)来构建唯一索引以获得类似的内容

                         Col1
Date
2015-01-01 00:00:00.000     1
2015-01-01 00:00:01.000     1
2015-01-01 00:00:01.001     1
2015-01-01 00:00:01.002     1
2015-01-01 00:00:02.000     1
2015-01-01 00:00:04.000     1
2015-01-01 00:00:04.001     1
2015-01-01 00:00:06.000     1
2015-01-01 00:00:07.000     1
2015-01-01 00:00:07.001     1

我正在寻找矢量化解决方案(不是 for 循环),因为我有大量数据需要处理

最佳答案

您可以groupby移动列和原始列Date之间的差异,通过 cumsum 获取它们的数量,按 cumcount 来计数并转换为纳秒。

纳秒 (1E-9) 比毫秒 (1E-3) 更好,因为使用毫秒可以创建新的口是心非行,但纳秒则不然(原始数据使用毫秒- 0 2015-11-02 00:00:01.072 欧元/美元 1.10294 1.10296)。

df = df.reset_index()
#create ms column
df['Date'] =  df['Date'] + (df['Date'].groupby((df['Date'] != df['Date'].shift()).cumsum())
                                      .cumcount()).values.astype('timedelta64[ns]')
print df

                          Date  Col1
0 2015-01-01 00:00:00.000000000     1
1 2015-01-01 00:00:01.000000000     1
2 2015-01-01 00:00:01.000000001     1
3 2015-01-01 00:00:01.000000002     1
4 2015-01-01 00:00:02.000000000     1
5 2015-01-01 00:00:04.000000000     1
6 2015-01-01 00:00:04.000000001     1
7 2015-01-01 00:00:06.000000000     1
8 2015-01-01 00:00:07.000000000     1
9 2015-01-01 00:00:07.000000001     1

#set column Date as index
df = df.set_index('Date')

最快的解决方案使用纳秒,并且如果重复数据的最大长度小于 1000000 (1E6),则可以使用。

因此,如果您使用 csv ( 3898069 rows ),请首先检查此长度,如果 df 的行高于 1E6:

import pandas as pd

df = pd.read_csv('test/EURUSD-2015-11.csv', header=None, parse_dates=[1],
                  names =['eurusd','Date','a','b'], sep=",")

#sort values if not sorted
df = df.sort_values('Date')
print df.head()
print df[df['Date'] == df['Date'].shift()]
          eurusd                    Date        a        b
1996     EUR/USD 2015-11-02 00:51:18.198  1.10323  1.10327
2944     EUR/USD 2015-11-02 01:00:03.844  1.10321  1.10326
6450     EUR/USD 2015-11-02 01:37:35.898  1.10319  1.10324
11429    EUR/USD 2015-11-02 02:24:29.945  1.10301  1.10306
19468    EUR/USD 2015-11-02 03:13:40.575  1.10326  1.10333
20074    EUR/USD 2015-11-02 03:17:03.607  1.10282  1.10288
36618    EUR/USD 2015-11-02 04:36:01.357  1.10213  1.10217
40235    EUR/USD 2015-11-02 04:49:05.946  1.10075  1.10082
42930    EUR/USD 2015-11-02 05:01:37.955  1.10034  1.10042
43269    EUR/USD 2015-11-02 05:03:21.360  1.10070  1.10073
47043    EUR/USD 2015-11-02 05:22:59.811  1.10142  1.10149
47526    EUR/USD 2015-11-02 05:25:45.474  1.10143  1.10150
53398    EUR/USD 2015-11-02 05:58:23.674  1.10294  1.10299
59899    EUR/USD 2015-11-02 06:44:55.266  1.10145  1.10150
64480    EUR/USD 2015-11-02 07:30:27.091  1.10211  1.10217
70576    EUR/USD 2015-11-02 08:14:04.318  1.10329  1.10336
75662    EUR/USD 2015-11-02 08:54:35.138  1.10485  1.10486
75724    EUR/USD 2015-11-02 08:55:00.577  1.10504  1.10507
93917    EUR/USD 2015-11-02 10:55:20.863  1.10345  1.10349
94603    EUR/USD 2015-11-02 10:57:56.289  1.10352  1.10356
98046    EUR/USD 2015-11-02 11:16:24.127  1.10272  1.10278
98433    EUR/USD 2015-11-02 11:19:14.109  1.10281  1.10286
100582   EUR/USD 2015-11-02 11:31:57.891  1.10247  1.10252
105627   EUR/USD 2015-11-02 12:11:01.900  1.10243  1.10246
106789   EUR/USD 2015-11-02 12:19:45.974  1.10183  1.10190
115219   EUR/USD 2015-11-02 14:06:47.229  1.10194  1.10200
116808   EUR/USD 2015-11-02 14:35:50.693  1.10204  1.10211
124436   EUR/USD 2015-11-02 17:06:48.286  1.10125  1.10144
124532   EUR/USD 2015-11-02 17:07:56.048  1.10160  1.10174
124734   EUR/USD 2015-11-02 17:11:51.609  1.10123  1.10142
...          ...                     ...      ...      ...
3893816  EUR/USD 2015-11-30 20:59:38.304  1.05651  1.05655
3893818  EUR/USD 2015-11-30 20:59:39.341  1.05650  1.05653
3893819  EUR/USD 2015-11-30 20:59:39.976  1.05651  1.05653
3893820  EUR/USD 2015-11-30 20:59:45.170  1.05652  1.05653
3895397  EUR/USD 2015-11-30 20:59:51.605  1.05654  1.05658
3895398  EUR/USD 2015-11-30 20:59:51.707  1.05655  1.05659
3893838  EUR/USD 2015-11-30 20:59:51.767  1.05656  1.05657
3893841  EUR/USD 2015-11-30 20:59:51.816  1.05658  1.05662
3895401  EUR/USD 2015-11-30 20:59:52.073  1.05659  1.05663
3895402  EUR/USD 2015-11-30 20:59:52.229  1.05660  1.05664
3893847  EUR/USD 2015-11-30 20:59:52.818  1.05659  1.05663
3895404  EUR/USD 2015-11-30 20:59:52.915  1.05660  1.05664
3893852  EUR/USD 2015-11-30 20:59:53.106  1.05661  1.05662
3893855  EUR/USD 2015-11-30 20:59:57.031  1.05662  1.05664
3895407  EUR/USD 2015-11-30 20:59:57.084  1.05664  1.05668
3895416  EUR/USD 2015-11-30 21:00:00.816  1.05664  1.05665
3895718  EUR/USD 2015-11-30 21:05:45.605  1.05666  1.05670
3895857  EUR/USD 2015-11-30 21:12:38.965  1.05659  1.05663
3895866  EUR/USD 2015-11-30 21:12:44.505  1.05666  1.05666
3895899  EUR/USD 2015-11-30 21:13:07.805  1.05669  1.05673
3895931  EUR/USD 2015-11-30 21:13:55.007  1.05675  1.05677
3896093  EUR/USD 2015-11-30 21:25:27.988  1.05658  1.05663
3896097  EUR/USD 2015-11-30 21:25:28.002  1.05661  1.05665
3896209  EUR/USD 2015-11-30 21:28:25.906  1.05655  1.05660
3896307  EUR/USD 2015-11-30 21:32:32.490  1.05653  1.05658
3896342  EUR/USD 2015-11-30 21:35:40.005  1.05654  1.05660
3896393  EUR/USD 2015-11-30 21:40:40.182  1.05648  1.05652
3896849  EUR/USD 2015-11-30 22:19:34.582  1.05670  1.05684
3897168  EUR/USD 2015-11-30 22:40:27.108  1.05675  1.05686
3897389  EUR/USD 2015-11-30 22:50:46.825  1.05705  1.05717

[35636 rows x 4 columns]
print len(df[df['Date'] == df['Date'].shift()])
35636

因此,35636 小于 1000000,然后您可以将此唯一行数计为 999999:

df.loc[df['Date'] == df['Date'].shift(), 'Date'] =  
                     df['Date'] +
                     ((df['Date'] == df['Date'].shift()).cumsum()).astype('timedelta64[ns]')

print df

                           Date  Col1
0 2015-01-01 00:00:00.000000000     1
1 2015-01-01 00:00:01.000000000     1
2 2015-01-01 00:00:01.000000001     1
3 2015-01-01 00:00:01.000000002     1
4 2015-01-01 00:00:02.000000000     1
5 2015-01-01 00:00:04.000000000     1
6 2015-01-01 00:00:04.000000003     1
7 2015-01-01 00:00:06.000000000     1
8 2015-01-01 00:00:07.000000000     1
9 2015-01-01 00:00:07.000000004     1

.
.
.
99945 2015-01-01 23:59:09.000999999     1

比较:

import pandas as pd

df = pd.read_csv('test/EURUSD-2015-11.csv', header=None, parse_dates=[1], 
                 names =['eurusd','Date','a','b'], sep=",")

#sort values if not sorted
df = df.sort_values('Date')
print df.head()

#print df[df['Date'] == df['Date'].shift()]
#print len(df[df['Date'] == df['Date'].shift()])

df3 = df.copy()

def ori(df):
    df['Date']=df['Date']+(df['Date'].groupby((df['Date'] != df['Date'].shift())
                                     .cumsum()).cumcount()).values.astype('timedelta64[ns]')
    return df


def new(df):
    df.loc[df['Date'] == df['Date'].shift(), 'Date'] =  df['Date'] + 
    ((df['Date'] == df['Date'].shift()).cumsum()).astype('timedelta64[ns]')

    return df    

df1 = ori(df)
df2 = new(df3)


print df1.head()
print df2.head()

时机更好:

In [81]: %timeit ori(df)
1 loops, best of 3: 2min 22s per loop
Compiler time: 0.10 s

In [82]: %timeit new(df)
1 loops, best of 3: 758 ms per loop

关于python - 通过添加 timedelta 创建具有唯一值的 DatetimeIndex 的 DataFrame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34575126/

相关文章:

python - 如何在数据框中执行涉及行和列的算术运算?

c# - Entity Framework 模型到具有索引属性的数据库

MySQL 5.7 与 5.6 : Index usage wrong at first, 但 "automagically"几周后修复

python - Pandas Groupby 对部分字符串进行计数

python - 在排序的 Pandas 数据框中按时间戳搜索元素

mongoid - 使用Mongoid查看MongoDB中的现有索引

python - 当作为参数传递给 range() 时,如何告诉 PyCharm 类实例可以通过 __index__ 方法解释为整数?

python - 根据 STR 中的条件插入 ',' - Python

python - 将不规则列表的单列数据框分解为多列

python - 我如何通过在 python 中使用 re.sub 将 "11 12 13 14"转换为 "12 13 14 15"