python - 如何按区间索引分组,在列表列表上聚合均值,并加入另一个数据框?

标签 python pandas numpy

我有两个数据框。它们看起来像这样:

df_a
     Framecount                                        probability
0           0.0  [0.00019486549333333332, 4.883635666666667e-06...
1           1.0  [0.00104359155, 3.9232405e-05, 0.0015722045000...
2           2.0  [0.00048501002666666667, 1.668179e-05, 0.00052...
3           3.0  [4.994969500000001e-05, 4.0931635e-07, 0.00011...
4           4.0  [0.0004808829, 5.389742e-05, 0.002522127933333...
..          ...                                                ...
906       906.0  [1.677140566666667e-05, 1.1745095666666665e-06...
907       907.0  [1.5164155000000002e-05, 7.66629575e-07, 0.000...
908       908.0  [8.1334184e-05, 0.00012675669636333335, 0.0028...
909       909.0  [0.00014893802999999998, 1.0407592500000001e-0...
910       910.0  [4.178489e-05, 2.17477925e-06, 0.02094931, 0.0...

和:

df_b
     start    stop
0     12.12   12.47
1     13.44   20.82
2     20.88   29.63
3     31.61   33.33
4     33.44   42.21
..      ...     ...
228  880.44  887.92
229  888.63  892.07
230  892.13  895.30
231  895.31  900.99
232  907.58  908.35

df_a.Framecount 位于 df_b.start 和 df_b.stop 之间时,我想将 df_a.probability 合并到 df_b df_a.probability 的聚合统计量应该是 mean,但我遇到了错误,因为 df_a.probability 是 dtype np 数组。

我正在尝试使用这段代码:

idx = pd.IntervalIndex.from_arrays(df_text['start'], df_text['stop'])
df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].apply(np.mean), how='left')

第 1 行创建索引来确定分组。在第 2 行中,我试图实现 group by 并聚合 df_a.probability 中属于 groupby 索引的所有值。我想要每个 groupby 一个数组,它是 groupby 索引中所有数组的平均值。这段代码给我这个错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-271-19c7d58fb664> in <module>
      1 idx = pd.IntervalIndex.from_arrays(df_text['start'], df_text['stop'])
      2 f = lambda x: np.mean(np.array(x.tolist()), axis=0)
----> 3 df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].apply(np.mean), how='left')

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed)
   5808             group_keys=group_keys,
   5809             squeeze=squeeze,
-> 5810             observed=observed,
   5811         )
   5812 

~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated)
    407                 sort=sort,
    408                 observed=observed,
--> 409                 mutated=self.mutated,
    410             )
    411 

~/anaconda3/lib/python3.7/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate)
    588 
    589         elif is_in_axis(gpr):  # df.groupby('name')
--> 590             if gpr in obj:
    591                 if validate:
    592                     obj._check_label_or_level_ambiguity(gpr, axis=axis)

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in __contains__(self, key)
   1848     def __contains__(self, key) -> bool_t:
   1849         """True if the key is in the info axis"""
-> 1850         return key in self._info_axis
   1851 
   1852     @property

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in __contains__(self, key)
   3898     @Appender(_index_shared_docs["contains"] % _index_doc_kwargs)
   3899     def __contains__(self, key) -> bool:
-> 3900         hash(key)
   3901         try:
   3902             return key in self._engine

TypeError: unhashable type: 'numpy.ndarray'

我尝试过多种聚合规范,包括:

df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].apply(lambda x: np.mean(np.array(x.tolist()), axis=0)), how='left')

df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].apply((np.mean), how='left')

df_text.join(df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount']))['probability'].mean()), how='left')

我得到了同样的错误。

我该如何实现?

最佳答案

  • 错误发生是因为 idx.get_indexer_non_unique(df_vid['Framecount']) 创建了一个 tuple,而你不能 groupby元组,以这种方式。
    • df_vid.groupby(idx.get_indexer_non_unique(df_vid['Framecount'])[0]) 选择 tuple 中的第一个数组即可。
  • idx.get_indexer(df_a.fc) 将生成一个数组,其中包含 fc 所属区间的索引。如果没有匹配的区间,索引将显示为-1
  • df_a.groupby(idx.get_indexer(df_a.fc)) 按索引数组分组。
  • .agg({'prob': list}) 将每个 fc 的所有列表聚合到一个列表中。
    • 每个组的结果是一个列表列表
  • .prob.map(np.mean) 返回组中所有列表的总体平均值
  • .prob.apply(lambda x: [np.mean(v) for v in x]) 为每个列表返回均值列表。
  • 没有 'fc' 值落入 12.12 - 12.47 的 bin 中。
import pandas as pd
import numpy as np

# setup df with start and stop ranges
data = {'start': [12.12, 13.44, 20.88, 31.61, 33.44, 880.44, 888.63, 892.13, 895.31, 907.58], 'stop': [12.47, 20.82, 29.63, 33.33, 42.21, 887.92, 892.07, 895.3, 900.99, 908.35]}
df = pd.DataFrame(data)

# setup sample df_a with Framecount as fc, and probability as prob
np.random.seed(365)
df_a = pd.DataFrame({'fc': range(911), 'prob': np.random.randint(1, 100, (911, 14)).tolist()})

# this will convert the column to np.arrays instead of lists; the remainder of the code works regardless
# df_a.prob = df_a.prob.map(np.array)

# create an IntervalIndex from df start and stop
idx = pd.IntervalIndex.from_arrays(df.start, df.stop, closed='both')

这将在 axis=0 上创建一个均值列表

dfg = df_a.groupby(idx.get_indexer(df_a.fc)).agg({'prob': list}).prob.apply(lambda x: np.mean(x, axis=0))

# join df with dfg
dfj = df.join(dfg)

# display(dfj) for list of means
    start    stop                                                                                  prob
0   12.12   12.47                                                                                   NaN
1   13.44   20.82  [49.3, 57.1, 51.4, 45.9, 47.1, 45.9, 45.9, 55.3, 32.6, 48.0, 42.0, 45.0, 50.4, 54.4]
2   20.88   29.63  [42.7, 42.6, 46.0, 45.9, 54.1, 55.9, 50.1, 55.2, 51.7, 54.0, 37.6, 60.9, 49.2, 45.6]
3   31.61   33.33  [87.5, 49.0, 46.5, 54.5, 75.0, 47.0, 24.0, 40.5, 52.5, 21.0, 51.0, 72.5, 34.5, 50.5]
4   33.44   42.21  [48.6, 66.2, 45.8, 64.7, 43.1, 69.0, 54.4, 52.1, 52.6, 59.6, 51.1, 42.1, 43.3, 38.0]
5  880.44  887.92  [51.9, 50.6, 63.7, 47.7, 51.3, 34.9, 51.3, 53.0, 53.4, 65.1, 38.6, 49.4, 48.1, 44.1]
6  888.63  892.07  [45.2, 23.5, 67.2, 68.0, 38.2, 47.2, 50.2, 75.8, 35.2, 46.8, 55.0, 57.5, 44.2, 78.0]
7  892.13  895.30  [61.3, 44.0, 43.3, 36.3, 63.7, 89.7, 51.7, 57.0, 50.0, 68.7, 80.7, 46.3, 66.7, 11.3]
8  895.31  900.99  [68.2, 44.6, 50.8, 35.2, 53.2, 40.4, 34.8, 77.4, 61.0, 35.2, 26.0, 47.8, 30.4, 55.4]
9  907.58  908.35     [17.0, 78.0, 24.0, 33.0, 88.0, 3.0, 43.0, 2.0, 36.0, 48.0, 8.0, 87.0, 36.0, 34.0]

这将为每个组创建一个平均值

dfg = df_a.groupby(idx.get_indexer(df_a.fc)).agg({'prob': list}).prob.map(np.mean)

# join df with dfg
dfj = df.join(dfg)

# display(dfj) for overall mean
    start    stop       prob
0   12.12   12.47        NaN
1   13.44   20.82  47.877551
2   20.88   29.63  49.380952
3   31.61   33.33  50.428571
4   33.44   42.21  52.182540
5  880.44  887.92  50.224490
6  888.63  892.07  52.303571
7  892.13  895.30  55.047619
8  895.31  900.99  47.171429
9  907.58  908.35  38.357143

关于python - 如何按区间索引分组,在列表列表上聚合均值,并加入另一个数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64019052/

相关文章:

python - 如何向 pointdrawtool 添加 python 回调

python - 在 Pandas 的日期时间值列中混合日期和月份

pandas - 如何使用Hadoop维护其架构每季度更改一次的历史数据

用于从 image-net.org 下载图像以进行 haar 级联训练的 python 代码

Python套接字发送EOF

python - 访问 sklearn pipeline 中的属性

python - 如何在 python (pandas kde) 中提取密度函数概率

python - Pandas DataFrame 将 json 列列表转换为信息行,每 "id"

python - Numpy - 计算对角线的乘积

python - 使用 numpy 生成具有 case-when 条件的随机数据