示例 CSV 是这样的:
user_id lat lon
1 19.111841 72.910729
1 19.111342 72.908387
2 19.111542 72.907387
2 19.137815 72.914085
2 19.119677 72.905081
2 19.129677 72.905081
3 19.319677 72.905081
3 19.120217 72.907121
4 19.420217 72.807121
4 19.520217 73.307121
5 19.319677 72.905081
5 19.419677 72.805081
5 19.629677 72.705081
5 19.111860 72.911347
5 19.111860 72.931346
5 19.219677 72.605081
6 19.319677 72.805082
6 19.419677 72.905086
我知道我可以使用 haversine 进行距离计算(python 也有 haversine 包):
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees).
Source: http://gis.stackexchange.com/a/56589/15183
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(math.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin(dlat/2)**2 + math.cos(lat1) * math.cos(lat2) * math.sin(dlon/2)**2
c = 2 * math.asin(math.sqrt(a))
km = 6371 * c
return km
但是,我只想计算相同ID内的距离。 所以预期的答案是这样的:
user_id lat lon result
1 19.111841 72.910729 NaN
1 19.111342 72.908387 xx*
2 19.111542 72.907387 NaN
2 19.137815 72.914085 xx
2 19.119677 72.905081 xx
2 19.129677 72.905081 xx
3 19.319677 72.905081 NaN
3 19.120217 72.907121 xx
4 19.420217 72.807121 NaN
4 19.520217 73.307121 xx
5 19.319677 72.905081 NaN
5 19.419677 72.805081 xx
5 19.629677 72.705081 xx
5 19.111860 72.911347 xx
5 19.111860 72.931346 xx
5 19.219677 72.605081 xx
6 19.319677 72.805082 NaN
6 19.419677 72.905086 xx
*:xx 是以公里为单位的距离数。
我该怎么做?
最佳答案
试试这个方法:
import pandas as pd
import numpy as np
# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')
# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
现在我们可以计算属于同一 id
(组)的坐标之间的距离:
df['dist'] = \
np.concatenate(df.groupby('id')
.apply(lambda x: haversine(x['lat'], x['lon'],
x['lat'].shift(), x['lon'].shift())).values)
结果:
In [105]: df
Out[105]:
id lat lon dist
0 1 19.111841 72.910729 NaN
1 1 19.111342 72.908387 0.252243
2 2 19.111542 72.907387 NaN
3 2 19.137815 72.914085 3.004976
4 2 19.119677 72.905081 2.227658
5 2 19.129677 72.905081 1.111949
6 3 19.319677 72.905081 NaN
7 3 19.120217 72.907121 22.179974
8 4 19.420217 72.807121 NaN
9 4 19.520217 73.307121 53.584504
10 5 19.319677 72.905081 NaN
11 5 19.419677 72.805081 15.286775
12 5 19.629677 72.705081 25.594890
13 5 19.111860 72.911347 61.509917
14 5 19.111860 72.931346 2.101215
15 5 19.219677 72.605081 36.304756
16 6 19.319677 72.805082 NaN
17 6 19.419677 72.905086 15.287063
关于python - Pandas :计算每组行内的正弦距离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43577086/