python - 在多列上使用 numpy 二维数组从 Pandas 数据框中选择行

数据

我有一个包含 5 列的数据框:

起点经纬度(origin_lat, origin_lng)
目的地经纬度(dest_lat, dest_lng)
根据其他字段计算的分数

我有一个矩阵 M，其中包含成对的起点和终点纬度/经度。其中一些对存在于数据框中，其他则不存在。

目标

我的目标有两个:

从 M 中选择数据帧前四列中不存在的所有对，对它们应用函数 func(计算得分列)，然后将结果附加到现有数据框。 注意:我们不应该为已经存在的行重新计算分数。
添加缺失的行后，选择新数据帧 dfs 中选择矩阵 M 定义的所有行。

示例代码

# STEP 1: Generate example data
ctr_lat = 40.676762
ctr_lng = -73.926420
N = 12
N2 = 3

data = np.array([ctr_lat+np.random.random((N))/10,
                 ctr_lng+np.random.random((N))/10,
                 ctr_lat+np.random.random((N))/10,
                 ctr_lng+np.random.random((N))/10]).transpose()

# Example function - does not matter what it does
def func(x):
    return np.random.random()

# Create dataframe
geocols = ['origin_lat','origin_lng','dest_lat','dest_lng']
df = pd.DataFrame(data,columns=geocols)
df['score'] = df.apply(func,axis=1)

这给了我一个像这样的数据框df:

    origin_lat  origin_lng   dest_lat   dest_lng     score
0    40.684887  -73.924921  40.758641 -73.847438  0.820080
1    40.703129  -73.885330  40.774341 -73.881671  0.104320
2    40.761998  -73.898955  40.767681 -73.865001  0.564296
3    40.736863  -73.859832  40.681693 -73.907879  0.605974
4    40.761298  -73.853480  40.696195 -73.846205  0.779520
5    40.712225  -73.892623  40.722372 -73.868877  0.628447
6    40.683086  -73.846077  40.730014 -73.900831  0.320041
7    40.726003  -73.909059  40.760083 -73.829180  0.903317
8    40.748258  -73.839682  40.713100 -73.834253  0.457138
9    40.761590  -73.923624  40.746552 -73.870352  0.867617
10   40.748064  -73.913599  40.746997 -73.894851  0.836674
11   40.771164  -73.855319  40.703426 -73.829990  0.010908

然后我可以人为地创建选择矩阵 M，其中包含数据框中存在的 3 行和不存在的 3 行。

# STEP 2: Generate data to select
# As an example, I select 3 rows that are part of the dataframe, and 3 that are not
data2 = np.array([ctr_lat+np.random.random((N2))/10,
                  ctr_lng+np.random.random((N2))/10,
                  ctr_lat+np.random.random((N2))/10,
                  ctr_lng+np.random.random((N2))/10]).transpose()

M = np.concatenate((data[4:7,:],data2))

矩阵 M 如下所示:

array([[ 40.7612977 , -73.85348031,  40.69619549, -73.84620489],
       [ 40.71222463, -73.8926234 ,  40.72237185, -73.86887696],
       [ 40.68308567, -73.84607722,  40.73001434, -73.90083107],
       [ 40.7588412 , -73.87128079,  40.76750639, -73.91945371],
       [ 40.74686156, -73.84804047,  40.72378653, -73.92207075],
       [ 40.6922673 , -73.88275402,  40.69708748, -73.87905543]])

从这里开始，我不知道如何知道 M 中的哪些行不存在于 df 中并添加它们。我也不知道如何从 df 中选择 M 中的所有行。

想法

我的想法是识别缺失的行，将它们附加到带有 nan 分数的 df，然后重新计算 nan 行的分数只要。但是，我不知道如何在不循环矩阵 M 的每个元素的情况下有效地选择这些行。

有什么建议吗？非常感谢您的帮助!

最佳答案

有什么理由不使用 merge 吗？

df2 = pd.DataFrame(M, columns=geocols) 
df = df.merge(df2, how='outer')
ix = df.score.isnull()
df.loc[ix, 'score'] = df.loc[ix].apply(func, axis=1)

它完全按照您的建议进行:添加缺失的行 df 和 nan 分数，识别 nans，计算这些行的分数。

关于python - 在多列上使用 numpy 二维数组从 Pandas 数据框中选择行，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46161097/

python - 在多列上使用 numpy 二维数组从 Pandas 数据框中选择行

上一篇：python - 如何从 Python 数据框列中的字符串中删除非字母数字字符？

下一篇：python - Keras-嵌入层