我有一堆记录,每条记录都标有一个聚类值。
原始数据框,df:
+-------------+---------+
| measurement | cluster |
+-------------+---------+
| M1 | 6 |
| M2 | 6 |
| M3 | 6 |
| M4 | 12 |
| M5 | 12 |
| M6 | 12 |
| M7 | 2 |
| M8 | 9 |
| M9 | 9 |
| M10 | 9 |
| M11 | 9 |
+-------------+---------+
在分配给聚类值不等于前一个或下一个的“x”行时,如何根据当前聚类值是否等于前一个和下一个来将聚类重命名为新数字?
所需的df:
+-------------+---------+-------------+
| measurement | cluster | new_cluster |
+-------------+---------+-------------+
| M1 | 6 | 1 |
| M2 | 6 | 1 |
| M3 | 6 | 1 |
| M4 | 12 | 2 |
| M5 | 12 | 2 |
| M6 | 12 | 2 |
| M7 | 2 | x |
| M8 | 9 | 3 |
| M9 | 9 | 3 |
| M10 | 9 | 3 |
| M11 | 9 | 3 |
+-------------+---------+-------------+
最佳答案
使用pandas.factorize
对于通过掩码过滤的值:
m = df['cluster'].ne(df['cluster'].shift()).cumsum().duplicated(keep=False)
df.loc[m, 'new_cluster'] = pd.factorize(df.loc[m, 'cluster'])[0] + 1
print (df)
measurement cluster new_cluster
0 M1 6 1.0
1 M2 6 1.0
2 M3 6 1.0
3 M4 12 2.0
4 M5 12 2.0
5 M6 12 2.0
6 M7 2 NaN
7 M8 9 3.0
8 M9 9 3.0
9 M10 9 3.0
10 M11 9 3.0
如果想将NaN
替换为x
:
df['new_cluster'] = df['new_cluster'].fillna('x')
print (df)
measurement cluster new_cluster
0 M1 6 1
1 M2 6 1
2 M3 6 1
3 M4 12 2
4 M5 12 2
5 M6 12 2
6 M7 2 x
7 M8 9 3
8 M9 9 3
9 M10 9 3
10 M11 9 3
bool 掩码的详细信息 - 首先为连续值创建助手 Series
,然后用 duplicated
掩码使用 keep='False' 返回所有欺骗:
print (df['cluster'].ne(df['cluster'].shift()).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 4
8 4
9 4
10 4
Name: cluster, dtype: int32
print (m)
0 True
1 True
2 True
3 True
4 True
5 True
6 False
7 True
8 True
9 True
10 True
Name: cluster, dtype: bool
关于python - 如何根据 Pandas 数据框的条件增加计数器?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51791903/