我正在为大学做一个项目,我有两个 pandas 数据框:
# Libraries
import pandas as pd
from geopy import distance
# Dataframes
df1 = pd.DataFrame({'id': [1,2,3],
'lat':[-23.48, -22.94, -23.22],
'long':[-46.36, -45.40, -45.80]})
df2 = pd.DataFrame({'id': [100,200,300],
'lat':[-28.48, -22.94, -23.22],
'long':[-46.36, -46.40, -45.80]})
我需要计算数据帧之间的地理纬度和经度坐标之间的距离。所以我用了geopy。如果坐标组合之间的距离小于 100 米的阈值,那么我必须在“附近”列中分配值 1。我编写了以下代码:
threshold = 100 # meters
df1['nearby'] = 0
for i in range(0, len(df1)):
for j in range(0, len(df2)):
coord_geo_1 = (df1['lat'].iloc[i], df1['long'].iloc[i])
coord_geo_2 = (df2['lat'].iloc[j], df2['long'].iloc[j])
var_distance = (distance.distance(coord_geo_1, coord_geo_2).km) * 1000
if(var_distance < threshold):
df1['nearby'].iloc[i] = 1
虽然出现警告,但代码可以正常工作。但是,我想找到一种方法来覆盖 for() 迭代。可能吗?
# Output:
id lat long nearby
1 -23.48 -46.36 0
2 -22.94 -45.40 0
3 -23.22 -45.80 1
最佳答案
如果可以使用库 scikit-learn,方法 haversine_distances
计算两组坐标之间的距离。所以你得到:
from sklearn.metrics.pairwise import haversine_distances
# variable in meter you can change
threshold = 100 # meters
# another parameter
earth_radius = 6371000 # meters
df1['nearby'] = (
# get the distance between all points of each DF
haversine_distances(
# note that you need to convert to radiant with *np.pi/180
X=df1[['lat','long']].to_numpy()*np.pi/180,
Y=df2[['lat','long']].to_numpy()*np.pi/180)
# get the distance in meter
*earth_radius
# compare to your threshold
< threshold
# you want to check if any point from df2 is near df1
).any(axis=1).astype(int)
print(df1)
# id lat long nearby
# 0 1 -23.48 -46.36 0
# 1 2 -22.94 -45.40 0
# 2 3 -23.22 -45.80 1
编辑:OP 要求一个与 geopy 有距离的版本,所以这是一种方法。
df1['nearby'] = (np.array(
[[(distance.distance(coord1, coord2).km)
for coord2 in df2[['lat','long']].to_numpy()]
for coord1 in df1[['lat','long']].to_numpy()]
) * 1000 < threshold
).any(1).astype(int)
关于pandas - 如何获取两个不同数据框的两个地理坐标之间的距离?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70941094/