python - 减少python中数据框的维度

我有数据框，包含三列。我想减少数据框的尺寸。

  data = [[1, 876, 0.98],[1, 888, 0.58],[1, 976, 0.48],[1, 648, 0.98],[2, 765, 0.28], [2, 986, 0.28], [2, 765, 1.0], [2, 876, 0.45]]
    sample = pd.DataFrame(data, columns=['col1','col2', 'col3'])
   col1  col2  col3
0     1   876  0.98
1     1   888  0.58
2     1   976  0.48
3     1   648  0.98
4     2   765  0.28
5     2   986  0.28
6     2   765  1.00
7     2   876  0.45

我希望根据条件将以下内容作为所需的输出: 1. 对于 col1 中的每个值，应该有一行，col 4 应该是元组列表(col2，col3) 2. col4 应该只有基于 col3 中值的前两个元组。例如在示例数据帧中，在 col2 765 中出现了两次，最终数据帧应该取在 col3 中具有最高值和第二高的值

data = [[1, [(876, 0.98),(648, 0.98)]],[2, [(876, 0.45), (765, 1.0)]]]
desired_output = pd.DataFrame(data, columns=['col1', 'col2'])

   col1                        col4
0     1  [(876, 0.98), (648, 0.98)]
1     2   [(876, 0.45), (765, 1.0)]

我想存储在一个元组列表中，以便我可以将其用于其他目的。这只是解决更大问题的一部分。

最佳答案

前世今生

sample = sample.sort_values(['col1', 'col3'], ascending=[True, False])
sample.groupby('col1')[['col2', 'col3']].apply(
    lambda d: [*d.head(2).itertuples(index=False)]
).reset_index(name='col4')

   col1                        col4
0     1  [(876, 0.98), (648, 0.98)]
1     2   [(765, 1.0), (876, 0.45)]

那些将被命名为元组。您可以使用 name=None

避免这种情况

sample = sample.sort_values(['col1', 'col3'], ascending=[True, False])
sample.groupby('col1')[['col2', 'col3']].apply(
    lambda d: [*d.head(2).itertuples(index=False, name=None)]
).reset_index(name='col4')

关于python - 减少python中数据框的维度，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52957686/

上一篇：python - 绕过 python 列表中的内存错误问题或 Numpy 中更聪明的方法

下一篇：python - 如何使用python正则表达式计算文本中特殊字符后面的单词的出现次数

相关文章：

python - 多个版本的python

python - 如何在使用 scrapy 时从多个标签中排除特定的 html 标签(没有任何 id)？

python - 访问了模拟实例上的断言属性

python - 自定义用户模型 Django 错误，没有这样的表

python - 有没有更好的方法来收集 pandas 中的唯一索引值？

python - Eratosthenes 筛法 - X 和 N 之间的素数

numpy - Torch 广播如何为 (8, 8) @ (4, 8, 2) 工作？

python - 能否提高海量时序数据之间相关性分析的计算速度？

python - 如何删除数据帧(python)中具有起始索引和结束索引的一行索引？

python - 如何定义 RegularGridInterpolator 的值