TLDR:我需要将每组中项目的长度或数量添加到新列中。如何才能实现这一目标?
我正在处理包含多次重复结果的实验结果(使用相同的设置重复相同的实验以提高数据的统计弹性)。每个实验都有一个标识符,而每个重复/运行在每个实验“内部”都有一个索引(请参阅代码片段以进行说明)。
由于即将进行数据处理并显示与每个实验的总运行次数相关的运行次数(例如运行 1/3、运行 2/3、运行 3/3),我需要添加两列,其中包含
- “人类可读”的运行编号(基本上是基于 1 的运行索引)和
- 每个实验的运行总数。
第一个可以通过增加每次运行的 run_id
来轻松实现:
import io
import pandas as pd
import numpy as np
DATA_STRING = """
experiment_id run_id value other_data
9tfc6d 0 0.448 0.883
9tfc6d 1 0.963 0.230
9tfc6d 2 0.711 0.724
q9tqjq 0 0.748 0.959
q9tqjq 1 0.662 0.772
q9tqjq 2 0.530 0.834
jsxp2m 0 0.087 0.346
jsxp2m 1 0.362 0.569
jsxp2m 2 0.124 0.206
"""
file_like = io.StringIO(DATA_STRING)
df = pd.read_csv(file_like, sep='\s+')
df['run_number'] = df['run_id'] + 1
但是,我在生成第二列时遇到了困难。概念方法应如下:
- 按
experiment_id
对数据帧df
进行分组,以单独访问每个实验 block 。 - 对每个组应用一个函数,该函数确定每个组中的游程数(相当于组的长度或行数)。
- 返回与包含组长度为整数值的组长度相同的系列。
- 将所有系列合并/固定为一个系列,然后将其作为新列分配给数据框
df
。
尽管创建了新列,但使用丑陋的 for 循环看起来像这样:
for name, group in df.groupby('experiment_id'):
group.loc[:, 'total_runs'] = group['run_id'].count()
print(group, end='\n\n')
由于这是一种丑陋的方法,我不想陷入那个兔子洞。特别是 Pandas 发出的警告:
C:\Users\Albert\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\indexing.py:376: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
由于我的想法是在每个组上调用一个函数并返回一系列包含相关信息的数据,因此我查看了文档。阅读guide Groupby: Split, Apply, Combine from the docs ,我偶然发现了 .transform()
,我有更多 detailed look在。
像这样调用.transform()
df.groupby('experiment_id').transform(lambda x: len(x))
产生所需的输出:
run_id value other_data run_number
0 3 3 3 3
1 3 3 3 3
2 3 3 3 3
3 3 3 3 3
4 3 3 3 3
5 3 3 3 3
6 4 4 4 4
7 4 4 4 4
8 4 4 4 4
9 4 4 4 4
但是,调用完全相同的行来创建新列
df['total_runs'] = df.groupby('experiment_id').transform(lambda x: len(x))
引发几个KeyError
和一个ValueError
:
KeyError Traceback (most recent call last) ~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2896 try:
-> 2897 return self._engine.get_loc(key) 2898 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'total_runs'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last) ~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py in set(self, item, value) 1068 try:
-> 1069 loc = self.items.get_loc(item) 1070 except KeyError:
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 2898 except KeyError:
-> 2899 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2900 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'total_runs'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last) <ipython-input-11-874dd354de5d> in <module>
----> 1 df['total_runs'] = df.groupby('experiment_id').transform(lambda x: len(x))
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value) 3485 else: 3486
# set column
-> 3487 self._set_item(key, value) 3488 3489 def _setitem_slice(self, key, value):
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py in _set_item(self, key, value) 3563 self._ensure_valid_index(value) 3564 value = self._sanitize_column(key, value)
-> 3565 NDFrame._set_item(self, key, value) 3566 3567 # check if we are modifying a copy
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py in _set_item(self, key, value) 3379 3380 def
_set_item(self, key, value):
-> 3381 self._data.set(key, value) 3382 self._clear_item_cache() 3383
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py in set(self, item, value) 1070 except KeyError: 1071
# This item wasn't present, just insert at end
-> 1072 self.insert(len(self.items), item, value) 1073 return 1074
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\managers.py in insert(self, loc, item, value, allow_duplicates) 1179 new_axis = self.items.insert(loc, item) 1180
-> 1181 block = make_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1)) 1182 1183 for blkno, count in _fast_count_smallints(self._blknos[loc:]):
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\blocks.py in make_block(values, placement, klass, ndim, dtype, fastpath) 3282 values = DatetimeArray._simple_new(values, dtype=dtype) 3283
-> 3284 return klass(values, ndim=ndim, placement=placement) 3285 3286
~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\internals\blocks.py in __init__(self, values, placement, ndim)
126 raise ValueError(
127 "Wrong number of items passed {val}, placement implies "
--> 128 "{mgr}".format(val=len(self.values), mgr=len(self.mgr_locs))
129 )
130
ValueError: Wrong number of items passed 4, placement implies 1
由于从一开始就引发了 KeyError
,为了继续前进,我实现了一个小黑客(真的很难看):
df['total_runs'] = np.zeros_like(df['run_id'])
df['total_runs'] = df.groupby('experiment_id').transform(lambda x: len(x))
最后,这成功了。但是,我想删除这个丑陋的黑客并根据 Pandas 的出色(GroupBy object
)功能生成所需的列。我怎样才能实现这个目标?
最佳答案
命令 df.groupby('experiment_id').transform(lambda x: len(x))
返回 4
列。
因此,当您尝试将上述命令的输出保存在只有 1 列的 total_runs
中时,它自然会失败。
df['total_runs'] = df.groupby('experiment_id').transform(lambda x: len(x))
相反,请执行以下操作:
In [1517]: df['total_runs'] = df.groupby('experiment_id')['run_number'].transform('count')
In [1518]: df
Out[1518]:
experiment_id run_id value other_data run_number total_runs
0 9tfc6d 0 0.448 0.883 1 3
1 9tfc6d 1 0.963 0.230 2 3
2 9tfc6d 2 0.711 0.724 3 3
3 q9tqjq 0 0.748 0.959 1 3
4 q9tqjq 1 0.662 0.772 2 3
5 q9tqjq 2 0.530 0.834 3 3
6 jsxp2m 0 0.087 0.346 1 3
7 jsxp2m 1 0.362 0.569 2 3
8 jsxp2m 2 0.124 0.206 3 3
关于python - 将组中项目的长度/数量分配给新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66972316/