python - 为什么 Python 代码需要这么长时间才能找到数据集中的近点？

我正在尝试制作一个高效的 Python 代码，给定一组表单中的数据点

a1 a2 a3 ... an (point 1)
b1 b2 b3 ... bn (point 2)
         .
         .
         .

将发现其中哪些太接近(在阈值内)。
我已经有以下 Fortran 例程可以解决这个问题:

program check_closeness                                           
    implicit none
    integer, parameter :: npoints = 10000, ndis = 20
    real(8), parameter :: r = 1e0, maxv = 3e0, minv = 0e0
    real(8), dimension(npoints,ndis) :: dis
    real(8) :: start, ende

    call RANDOM_NUMBER(dis)
    dis = (maxv - minv) * dis + minv

    call cpu_time(start)
    call remove_close(npoints, ndis,dis, r)
    call cpu_time(ende)
    write(*,*) 'Time elapsed', ende - start
endprogram

subroutine remove_close(npoints, ndis, points, r)
    implicit none
    integer, intent(in) :: npoints, ndis
    integer :: i, i_check
    real(8), dimension(npoints, ndis), intent(in) :: points
    real(8), intent(in) :: r
    logical :: is_close

    do i=1,npoints
       is_close = .FALSE.
       i_check = i

       do while (.not. is_close .and. i_check < npoints)
          i_check = i_check + 1
          is_close = all(abs(points(i,:) - points(i_check,:)) < r)
       enddo
    enddo
end subroutine

这个执行需要我的电脑:Time elapsed 1.0651889999999999 (秒)。
现在，我在 Python 中编写了完全相同的代码(或者我认为是这样):

 import numpy as np                                       
 import time
 
 def check_close(a, r):
     npoints = a.shape[0]
     for i, vec1 in enumerate(a):
         is_close = False
         icheck = i
 
         while (not is_close and icheck < npoints - 1):
             icheck +=1
             vec2 = a[icheck,:]
             is_close = all(np.abs(vec1 - vec2) < r)
 
 maxv, minv = 3, 0
 a = np.random.rand(10000, 20)
 a = (maxv - minv) * a + minv
 r = 1e0
 
 ini = time.time()
 check_close(a, r)
 fin = time.time()
 print('Time elapsed {}'.format(fin - ini))

此执行需要 Time elapsed 102.60617995262146 (秒)，这比 Fortran 慢得多。
我尝试了这个例程的另一个 Python 版本，它要快得多，但仍然没有接近 Fortran 版本:

import numpy as np                        
import time

def check_close(a, r):
    for i, vec1 in enumerate(a[:-1]):
        d = np.abs(a[i+1:,:] - vec1)
        is_close = np.any(np.all(d < r, axis=1))

maxv, minv = 3, 0
a = np.random.rand(10000, 20)
a = (maxv - minv) * a + minv
r = 1e0

ini = time.time()
check_close(a, r)
fin = time.time()
print('Time elapsed {}'.format(fin - ini))

在这种情况下，执行需要 Time elapsed 3.4987785816192627 (秒)。由此，我想改进来自于实现的矢量化和去除了 while 循环。另一方面，当找到一个接近点时，此实现不会从停止搜索中受益。
我的问题是:

是什么在 Python 实现中花费了这么多时间？

有什么办法可以重写 Python 代码，使 if(几乎)和 Fortran 代码一样快？

[编辑]
我被要求使用 time.perf_counter() 来测量 Python 代码中的执行时间。
现在我使用:

ini = time.perf_counter()
check_close(a, r)
fin = time.perf_counter()
print('Time elapsed {}'.format(fin - ini))

测量它们，更新的时间是:
Python实现1:时间流逝98.63728801719844(秒)
Python 实现 2:时间流逝 3.3923211600631475(秒)

最佳答案

Numpy 是专门创建的，因为 python 非常慢。正如您所展示的，fortran 代码速度要快得多，没有太多算法差异。
所以回答你的问题:

Python 是一种脚本语言，并不是围绕速度构建的。如果您需要速度，请使用诸如 numpy 之类的库，就像您正在做的那样，或者尝试将 native C/C++ 代码合并到您需要的程序中。

为了让它像 fortran 一样快，尝试找到一种只使用 numpy 例程的方法，摆脱 for 循环。这将大大提高你的表现。

试着想想你的问题，以及如何在没有循环的情况下解决它们；这将是在 python 中解锁速度的关键。
顺便说一句，我不确定您使用 np.abs() 的意图是什么。但它不是计算两点之间的欧几里得距离。使用 np.linalg.norm()为了那个原因。

关于python - 为什么 Python 代码需要这么长时间才能找到数据集中的近点？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66258485/

python - 为什么 Python 代码需要这么长时间才能找到数据集中的近点？

上一篇：c# - 如何在 .net core 3.0 中使用 System.Text.Json 反序列化部分 json？

下一篇：operators - 如何 optional 地提供 OCaml (let*) 运算符以与新旧编译器一起使用？