Python:从一堆点中选择分布较好的n个点

我在 XY 平面上有一个 numpy 点数组，例如: distribution

我想从所有这些点中选择分布更好的 n 个点(比如说 100)。也就是说，我希望点的密度在任何地方都保持不变。

像这样:

enter image description here

是否有任何 pythonic 方法或任何 numpy/scipy 函数来执行此操作？

最佳答案

@EMS 是非常正确的，你应该仔细考虑你到底想要什么。

有更复杂的方法可以做到这一点(EMS 的建议非常好!)，但一种蛮力的方法是将点分格到规则的矩形网格上，然后从每个分格中随机抽取一个点。

主要缺点是您无法获得所要求的分数。相反，您会得到比该数字小的数字。

使用 pandas 进行一些创造性的索引使这种“网格化”方法变得非常简单，尽管您当然也可以使用“纯”numpy 来实现。

作为最简单的暴力网格方法的示例:(这里有很多我们可以做得更好的地方。)

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))

# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000

# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)

# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])

# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

enter image description here

粗略地基于@EMS 在评论中的建议，这是另一种方法。

我们将使用核密度估计来计算点的密度，然后使用它的倒数作为选择给定点的概率。

scipy.stats.gaussian_kde 未针对此用例(或通常针对大量点)进行优化。瓶颈就在这里。可以通过多种方式(近似值、成对距离的特殊情况等)为这个特定用例编写更优化的版本。但是，这超出了这个特定问题的范围。请注意，对于这个具有 1e5 点的特定示例，它需要一两分钟才能运行。

此方法的优点是您可以获得所需的确切分数。缺点是您可能拥有选定点的本地集群。

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))

# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)

# Try playing around with this weight. Compare 1/dens,  1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()

# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

enter image description here

关于Python:从一堆点中选择分布较好的n个点，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21409664/

Python:从一堆点中选择分布较好的n个点

上一篇：python - 我如何告诉 boto 操作成功？

下一篇：python - TypeError: 'int' 对象不支持项目分配