python-2.7 - pandas 中基于列条件的多重索引

我的 csv 文件中有一个巨大的 GPS 数据集。
事情是这样的。

12,1999-09-08 12:12:12, 116.3426, 32.5678

12,1999-09-08 12:12:17, 116.34234, 32.5678

.
.
.

其中每列的形式为
id、时间戳、经度、纬度

现在，我正在使用 pandas 并将文件导入到数据框中，到目前为止我已经编写了这段代码。

import pandas as pd
import numpy as np
#this imports the columns and making the timestamp values as row indexes
df = pd.read_csv('/home/abc/Downloads/..../366.txt',delimiter=',',
                index_col=1,names=['id','longitude','latitude'])
#removes repeated entries due to gps errors. 
df = df.groupby(df.index).first()

有时，同一日期会有 2 或 3 个多个条目，应将其删除

我得到了这样的东西

                       id  longitude  latitude
1999-09-08 12:12:12    12  116.3426   32.5678
1999-09-08 12:12:17    12  116.34234  32.5678
# and so on with redundant entries removed

现在我希望对具有相同纬度和经度的行进行连续索引。即，我的可视化是

                      id  longitude  latitude
0 1999-09-08 12:12:12 12  116.3426    32.5678
1 1999-09-08 12:12:17 12  116.34234   32.5678
2 1999-09-08 12:12:22 12  116.342341  32.5678
  1999-09-08 12:12:27 12  116.342341  32.5678
  1999-09-08 12:12:32 12  116.342341  32.5678
  ....
  1999-09-08 12:19:37 12  116.342341  32.5678
3 1999-09-08 12:19:42 12  116.34234   32.56123
  and so on..

即，具有相同纬度和经度值的行将按顺序索引。我怎样才能做到这一点？我是 pandas 的初学者，所以我对此了解不多。请帮忙!

最佳答案

您应该利用DataFrame.duplicated并用它做一些数学计算:

idx = df.duplicated(['longitude', 'latitude'])
idx *= -1
idx += 1
idx.ix[0] = 0
df = df.set_index(idx.cumsum(), append=True).swaplevel(0,1)

代码的工作原理

从您得到的 df 开始:

In [215]: df
Out[215]: 
                     id   longitude  latitude
stamp                                        
1999-09-08T12:12:12  12  116.342600  32.56780
1999-09-08T12:12:17  12  116.342340  32.56780
1999-09-08T12:12:22  12  116.342341  32.56780
1999-09-08T12:12:27  12  116.342341  32.56780
1999-09-08T12:12:32  12  116.342341  32.56780
1999-09-08T12:19:37  12  116.342341  32.56780
1999-09-08T12:19:42  12  116.342340  32.56123

首先计算连续重复的(longitude, latitude)元组:

In [216]: idx = df.duplicated(['longitude', 'latitude'])

In [217]: idx
Out[217]: 
stamp
1999-09-08T12:12:12    False
1999-09-08T12:12:17    False
1999-09-08T12:12:22    False
1999-09-08T12:12:27     True
1999-09-08T12:12:32     True
1999-09-08T12:19:37     True
1999-09-08T12:19:42    False

然后我们使用cumsum创建一个从零开始的索引，该索引不会因重复项而增加。用它进行一些数学计算，以获得重复行的零和其他行的零:

In [218]: idx *= -1
In [219]: idx += 1


In [220]: idx
Out[220]: 
stamp
1999-09-08T12:12:12    1
1999-09-08T12:12:17    1
1999-09-08T12:12:22    1
1999-09-08T12:12:27    0
1999-09-08T12:12:32    0
1999-09-08T12:19:37    0
1999-09-08T12:19:42    1

由于我们想要一个从零开始的索引，因此我们将第一个单元格设置为 0，并将该列附加到 df 的索引中以创建 多重索引:

In [221]: idx.ix[0] = 0
In [222]: df = df.set_index(idx.cumsum(), append=True)

默认情况下，set_index 将索引添加到比现有索引低的级别。我们必须通过交换时间戳和附加索引之间的级别来完成:

In [223]: df = df.swaplevel(0,1)

In [224]: df
Out[224]: 
                       id   longitude  latitude
  stamp                                        
0 1999-09-08T12:12:12  12  116.342600  32.56780
1 1999-09-08T12:12:17  12  116.342340  32.56780
2 1999-09-08T12:12:22  12  116.342341  32.56780
  1999-09-08T12:12:27  12  116.342341  32.56780
  1999-09-08T12:12:32  12  116.342341  32.56780
  1999-09-08T12:19:37  12  116.342341  32.56780
3 1999-09-08T12:19:42  12  116.342340  32.56123

关于python-2.7 - pandas 中基于列条件的多重索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15462996/

python-2.7 - pandas 中基于列条件的多重索引

上一篇：asp.net-mvc - 为什么我不能将普通 View 与 ApiController 一起使用？

下一篇：knockout-validation - 如何使用knockout-validation在 View 模型中调用isValid()函数