python - 根据大小对组进行排名

示例数据:

我想做的是将最大的集群 ID 替换为 0，将第二大的集群 ID 替换为 1，依此类推。输出将如下所示。

我不太确定从哪里开始。任何帮助将非常感激。

最佳答案

目的是根据该组在该列中的总值计数的相应排名，重新标记在 'cluster' 列中定义的组。我们将其分解为几个步骤:

整数分解。找到一个整数表示，其中列中的每个唯一值都有自己的整数。我们将从零开始。
然后我们需要每个唯一值的计数。
我们需要按计数对唯一值进行排名。
我们将排名分配回原始列的位置。

方法一
使用 Numpy 的 numpy.unique + argsort

长话短说

u, i, c = np.unique(
    df.cluster.values,
    return_inverse=True,
    return_counts=True
)
(-c).argsort()[i]

事实证明，numpy.unique 一次性执行了整数分解和计数值的任务。在此过程中，我们也获得了独特的值(value)，但我们并不真正需要这些。此外，整数分解并不明显。这是因为根据 numpy.unique 函数，我们正在寻找的返回值称为 inverse。之所以称为逆，是因为它旨在作为一种在给定唯一值数组的情况下取回原始数组的方法。所以如果我们让

u, i, c = np.unique(
    df.cluster.values,
    return_inverse=True,
    return_couns=True
)

你会看到 i 看起来像:

array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])

如果我们执行u[i]，我们会得到原始的df.cluster.values

array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])

但我们将把它用作整数分解。

接下来，我们需要计数 c

array([2, 3, 4, 2, 1, 1])

我打算建议使用 argsort 但它令人困惑。所以我会尝试展示它:

np.row_stack([c, (-c).argsort()])

array([[2, 3, 4, 2, 1, 1],
       [2, 1, 0, 3, 4, 5]])

一般来说，argsort 所做的是将顶部位置(位置 0)放置在原始数组中，即要从中绘制的位置。

#            position 2
#            is best
#                |
#                v
# array([[2, 3, 4, 2, 1, 1],
#        [2, 1, 0, 3, 4, 5]])
#         ^
#         |
#     top spot
#     from
#     position 2

#        position 1
#        goes to
#        pen-ultimate spot
#            |
#            v
# array([[2, 3, 4, 2, 1, 1],
#        [2, 1, 0, 3, 4, 5]])
#            ^
#            |
#        pen-ultimate spot
#        from
#        position 1

这让我们可以做的是用我们的整数分解对这个 argsort 结果进行切片，以重新映射排名。

#     i is
#        [2 2 2 2 0 0 1 1 1 3 3 4 5]

#     (-c).argsort() is 
#        [2 1 0 3 4 5]

# argsort
# slice
#      \   / This is our integer factorization
#       a i
#     [[0 2]  <-- 0 is second position in argsort
#      [0 2]  <-- 0 is second position in argsort
#      [0 2]  <-- 0 is second position in argsort
#      [0 2]  <-- 0 is second position in argsort
#      [2 0]  <-- 2 is zeroth position in argsort
#      [2 0]  <-- 2 is zeroth position in argsort
#      [1 1]  <-- 1 is first position in argsort
#      [1 1]  <-- 1 is first position in argsort
#      [1 1]  <-- 1 is first position in argsort
#      [3 3]  <-- 3 is third position in argsort
#      [3 3]  <-- 3 is third position in argsort
#      [4 4]  <-- 4 is fourth position in argsort
#      [5 5]] <-- 5 is fifth position in argsort

然后我们可以使用 pd.DataFrame.assign 将其放入列中

u, i, c = np.unique(
    df.cluster.values,
    return_inverse=True,
    return_counts=True
)
df.assign(cluster=(-c).argsort()[i])

    id  cluster
0    1        0
1    2        0
2    3        0
3    4        0
4    5        2
5    6        2
6    7        1
7    8        1
8    9        1
9   10        3
10  11        3
11  12        4
12  13        5

方法二
我将利用相同的概念。但是，我将使用 Pandas pandas.factorize 进行整数分解，使用 numpy.bincount 对值进行计数。使用这种方法的原因是因为 Numpy 的 unique 实际上是在分解和计数过程中对值进行排序。 pandas.factorize 没有。对于更大的数据集，大 oh 是我们的 friend ，因为这仍然是 O(n) 而 Numpy 方法是 O(nlogn)。

i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])

    id  cluster
0    1        0
1    2        0
2    3        0
3    4        0
4    5        2
5    6        2
6    7        1
7    8        1
8    9        1
9   10        3
10  11        3
11  12        4
12  13        5

关于python - 根据大小对组进行排名，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47402346/

python - 根据大小对组进行排名

上一篇：Python:创建以列表索引号为键并以列表元素为值的字典？

下一篇：python - 每天运行 python 脚本的最佳方式是什么？