python - xy点的测量结构-python

标签 python pandas scipy

我正在尝试测量 xy 点的整体结构以表示重复出现的粒子形成。我希望采用成对的方法通过相对于相邻点的定位来确定结构,而不是取原始笛卡尔坐标的平均值。

为此,我想在每个时间戳计算每个点与相邻点之间的向量。每对点之间的这些向量的平均值应提供整体结构。

注意:如果在特定点之间对向量进行硬编码,则无法正确识别结构。如果点交换位置或不同的点被替换但保留相同的结构,最终结果将不准确。我希望该函数能够仅根据相邻点来确定整体结构。

因此最终结构应该采用成对的方法,其中最终的空间分布,1) 将结构的质心设置为结构最密集部分中点的位置,由到第三个的平均距离确定-最近的邻居。 2) 确定它们的最近邻点的相对位置,该点的最近邻点的相对位置等等,直到所有点的位置都确定了。

我将在下面生成两个示例 df。使用 df1,第 1 帧显示第一个时间戳处的点之间的向量。第 2 帧对某些点进行新定位并为其他点交换位置(点 A 和 B 在帧之间交换定位)。最后一帧显示所有帧的每个矢量,而点显示平均结构。

import pandas as pd
from sklearn.neighbors import KernelDensity
from scipy.spatial.distance import pdist, squareform
import matplotlib.pyplot as plt
import numpy as np

# Example 1:
df = pd.DataFrame({   
    'Time' : [1,1,1,1,1,2,2,2,2,2],             
    'id' : ['A','B','C','D','E','B','A','C','D','E'],                 
    'X' : [1.0,2.8,4.0,2.0,2.0,1.5,3.0,5.0,3.0,2.5],
    'Y' : [1.0,1.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,2.0],
    })

def calculate_distances(group):
    group_distances = pd.DataFrame(
        squareform(pdist(group[["X", "Y"]].to_numpy())),  # Default is Euclidean distance
        columns=group["id"],
        index=group["id"],
    )

    return group_distances

# Calculate the distances between the points, per timeframe
df_distances = df.groupby("Time").apply(calculate_distances)

# Create a placeholder to store the relative positions at every timestamp
relative_positions = {timestamp: [] for timestamp in df["Time"].values}

# Go over the timeframes
for timestamp, group in df.groupby("Time"):

    # ---
    # "... first, we set the centroid of the structure to be the position of the point in the densest part of the structure ..."

    # Determine the density of the group, within this timeframe
    kde = KernelDensity(kernel="gaussian", bandwidth=0.5).fit(group[["X", "Y"]])
    log_density = kde.score_samples(group[["X", "Y"]])

    # Centroid is the most dense point in the structure
    centroid = group.iloc[np.argmax(log_density)]

    # Make a list of the other points to keep track of which points we've handled
    other_points = group["id"].to_list()

    # Start by making the centroid the active point
    active_point_id = centroid["id"]

    # ---
    # "... the relative position of that point’s nearest neighbor (ignoring any point already considered
    # in the process) and so on, until the positions of all points in the team have been determined."

    # Keep handling the next point until there are no points left
    while len(other_points) > 1:

        # Remove the active point from the list
        other_points = [point for point in other_points if point != active_point_id]

        # For the active point, get the nearest neighbor
        nearest_neighbor = df_distances.loc[[timestamp]][active_point_id].droplevel(0).loc[other_points].sort_values().reset_index().iloc[0]["id"]

        # ---
        # "... We then identify the relative position of his nearest neighbor ..."

        # Determine the relative position of the nearest neigbor (compared to the active point)
        active_point_coordinates = group.loc[group["id"] == active_point_id, ["X", "Y"]].iloc[0].values
        nearest_neighbor_coordinates = group.loc[group["id"] == nearest_neighbor, ["X", "Y"]].iloc[0].values
        relative_position = active_point_coordinates - nearest_neighbor_coordinates

        # Add the relative position to the list, for this timestamp
        relative_positions[timestamp].append(relative_position)

        # The neighbor becomes the active point
        active_point_id = nearest_neighbor

# ---
# "... averaging the vectors between each pair of points over a specified time interval to gain a
# clear measure of their designated relative positions ..."

# Take the average vector, across timeframes
averages = np.mean([t for t in relative_positions.values()], axis=0)

# Plot the relative positions, NOTE: The centroid is always at (0, 0), and is not plotted

plt.scatter(averages[:,0], averages[:,1])

如果我在 0,0 处手动绘制质心,则输出为:

enter image description here

点结构框架1:

enter image description here

点结构框架2:

enter image description here

突出显示了两个帧的总向量。所以这些的平均点结构应该是:

enter image description here

如果我生成相同的点结构但将点向右移动以用于后续帧,则底层点结构应该相同。

df2 = pd.DataFrame({   
    'Time' : [1,1,1,1,1,2,2,2,2,2],             
    'id' : ['A','B','C','D','E','B','A','C','D','E'],                 
    'X' : [1.0,3.0,4.0,2.0,2.0,3.0,5.0,6.0,4.0,4.0],
    'Y' : [1.0,1.0,0.0,0.0,2.0,1.0,1.0,0.0,0.0,2.0],
    })

预期结构:

enter image description here

最佳答案

我已经尝试按照您引用的论文对 T 进行跟踪,但他们的算法描述非常模糊。这是我的解决方案:

import numpy
import pandas
import random

from sklearn.neighbors import KernelDensity
from scipy.spatial.distance import pdist, squareform

# From the paper:
# ---------------
# Formations are measured by calculating the vectors between each player and the rest of his
# teammates at successive instants during a match, averaging the vectors between each pair of
# players over a specified time interval to gain a clear measure of their designated relative positions.
# The final spatial distribution of the outfield players is determined by the following algorithm:
# first, we set the centroid of the formation to be the position of the player in the densest part of the
# team, as determined by the average distance to the third-nearest neighbor. We then identify the
# relative position of his nearest neighbor, the relative position of that player’s nearest neighbor
# (ignoring any player already considered in the process) and so on, until the positions of all players
# in the team have been determined.


# Your data, I've added some randomness to get a more realistic setting
df = pandas.DataFrame(
    {
        "Time": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "id": ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
        "Y": [element + random.random() * 0.25 for element in [1.0, 1.0, 0.0, 1.25, 2.0, 1.0, 1.0, 0.0, 1.25, 2.0]],
        "X": [element + random.random() * 0.25 for element in [1.0, 3.0, 2.0, 2.25, 2.0, 3.0, 5.0, 4.0, 4.25, 4.0]],
    }
)

# Plot the different timeframes (for reference)
for timestamp in df["Time"].unique():
    df.loc[df["Time"] == timestamp].plot(kind="scatter", x="X", y="Y")


def calculate_distances(group: pandas.DataFrame) -> pandas.DataFrame:
    """ Calculate the distances between the players, within a specific timeframe.

    Args:
        group (pandas.DataFrame): The data from a specif timeframe

    Returns:
        pandas.DataFrame: The distances
    """
    group_distances = pandas.DataFrame(
        squareform(pdist(group[["X", "Y"]].to_numpy())),  # Default is Euclidean distance
        columns=group["id"],
        index=group["id"],
    )
    return group_distances


# Calculate the distances between the points, per timeframe
df_distances = df.groupby("Time").apply(calculate_distances)

# Create a placeholder to store the relative positions at every timestamp
relative_positions = {timestamp: [] for timestamp in df["Time"].values}

# Go over the timeframes
for timestamp, group in df.groupby("Time"):

    # ---
    # "... first, we set the centroid of the formation to be the position of the player in the densest part of the team ..."

    # Determine the density of the group, within this timeframe
    kde = KernelDensity(kernel="gaussian", bandwidth=0.5).fit(group[["X", "Y"]])
    log_density = kde.score_samples(group[["X", "Y"]])

    # Centroid is the most dense point in the formation
    centroid = group.iloc[numpy.argmax(log_density)]

    # Make a list of the other players to keep track of which players we've handled
    other_players = group["id"].to_list()

    # Start by making the centroid the active player
    active_player_id = centroid["id"]

    # ---
    # "... the relative position of that player’s nearest neighbor (ignoring any player already considered
    # in the process) and so on, until the positions of all players in the team have been determined."

    # Keep handling the next player until there are no players left
    while len(other_players) > 1:

        # Remove the active player from the list
        other_players = [player for player in other_players if player != active_player_id]

        # For the active player, get the nearest neighbor
        nearest_neighbor = df_distances.loc[[timestamp]][active_player_id].droplevel(0).loc[other_players].sort_values().reset_index().iloc[0]["id"]

        # ---
        # "... We then identify the relative position of his nearest neighbor ..."

        # Determine the relative position of the nearest neigbor (compared to the active player)
        active_player_coordinates = group.loc[group["id"] == active_player_id, ["X", "Y"]].iloc[0].values
        nearest_neighbor_coordinates = group.loc[group["id"] == nearest_neighbor, ["X", "Y"]].iloc[0].values
        relative_position = active_player_coordinates - nearest_neighbor_coordinates

        # Add the relative position to the list, for this timestamp
        relative_positions[timestamp].append(relative_position)

        # The neighbor becomes the active player
        active_player_id = nearest_neighbor


# ---
# "... averaging the vectors between each pair of players over a specified time interval to gain a
# clear measure of their designated relative positions ..."

# Take the average vector, across timeframes
averages = numpy.mean([t for t in relative_positions.values()], axis=0)

# Plot the relative positions, NOTE: The centroid is always at (0, 0), and is not plotted
pandas.DataFrame(averages, columns=["X", "Y"]).plot(kind="scatter", x="X", y="Y")

上一个答案:

第一部分(修复您的代码示例)并不太难。 scipy 有一个名为 pdist 的函数,它计算多个维度(在本例中为 2)的一组点之间的距离。如果您希望按时间范围进行此操作,则只需使用 groupby

第二部分更难,因为它并不完全清楚您希望实现的目标。无需先前的距离计算即可找到编队中最“密集”的点。 sklearn 有一个用于此的 KernelDensity 类。除此之外,我无法真正按照您的意愿行事,因为在您的编队中没有最近的邻居(从质心到所有其他点的距离相等,因此所有邻居都同样接近)。但是,我认为您可以为此目的使用平均距离矩阵 (df_distances_mean),因为它确实包含所有距离。您只需选择距质心距离最近的下一个。

下面是使用 KernelDensity 类计算距离并找到质心的代码:

import numpy
import pandas
from sklearn.neighbors import KernelDensity
from scipy.spatial.distance import pdist, squareform

# Your data
df = pandas.DataFrame(
    {
        "Time": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "id": ["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
        "X": [1.0, 3.0, 2.0, 2.0, 2.0, 3.0, 5.0, 4.0, 4.0, 4.0],
        "Y": [1.0, 1.0, 0.0, 1.0, 2.0, 1.0, 1.0, 0.0, 1.0, 2.0],
    }
)


def calculate_distances(group: pandas.DataFrame) -> pandas.DataFrame:
    group_distances = pandas.DataFrame(
        squareform(pdist(group[["X", "Y"]].to_numpy())),  # Default is Euclidean distance
        columns=group["id"],
        index=group["id"],
    )
    return group_distances


# Calculate the distances between the points, per timeframe
df_distances = df.groupby("Time").apply(calculate_distances)

# Take the mean distance across timeframes (since your points are just shifted right over time, this should be constant)
df_distances_mean = pandas.DataFrame(
    numpy.mean([group.to_numpy() for _, group in df_distances.groupby("Time")], axis=0),
    columns=df_distances.columns,
    index=df_distances.columns,
)
# df_distances_mean will now contain the mean distance between points

# =====================
# Not a 100% sure what you want to achieve from this point forward

# Determining the density of the formation (at each time step)
for _, group in df.groupby("Time"):

    # Determine the density
    kde = KernelDensity(kernel="gaussian", bandwidth=0.5).fit(group[["X", "Y"]])
    log_density = kde.score_samples(group[["X", "Y"]])

    # Centroid is the most dense point in the formation (?)
    centroid = group.iloc[numpy.argmax(log_density)]
    print("Centroid based on density:", centroid)

输出:

Centroid based on density: Time    1
id      D
X       2
Y       1
Name: 3, dtype: object
Centroid based on density: Time    2
id      D
X       4
Y       1
Name: 8, dtype: object

print(df_distances_mean)

id         A         B         C    D         E
id                                             
A   0.000000  2.000000  1.414214  1.0  1.414214
B   2.000000  0.000000  1.414214  1.0  1.414214
C   1.414214  1.414214  0.000000  1.0  2.000000
D   1.000000  1.000000  1.000000  0.0  1.000000
E   1.414214  1.414214  2.000000  1.0  0.000000

关于python - xy点的测量结构-python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65001543/

相关文章:

python - 关于 pandas 按功能分组的问题?

python - 计算 pandas 中的唯一值对

python - 从 geotiff 图像计算纬度和经度

python - 替代 scipy.cluster.hierarchy.cut_tree()

python 、 Pandas : Boolean Indexing Comparing DateTimeIndex to Period

python - 如何在 Matplotlib 中手动指定 bins?

python - 显示列表中的每个元素

python - 无法用相应列中最后三行的平均值替换数据帧最后一行中的零,同时保留非零值

python - 权重不符合给定曲线的 scipy splrep()

python - Django 消息框架未在模板中显示消息