python - Pandas One 热编码 : Bundling together less frequent categories

标签 python pandas scikit-learn one-hot-encoding

我正在对具有大约 18 种不同类型值的分类列进行热编码。我只想为那些出现超过某个阈值(假设 1%)的值创建新列，并创建另一个名为 other values 的列，如果值不是那些频繁值，则该列为 1。

我正在使用 Pandas 和 Sci-kit 学习。我探索了 pandas get_dummies 和 sci-kit learn 的 one hot encoder，但无法弄清楚如何将频率较低的值捆绑到一列中。

最佳答案

计划

pd.get_dummies正常进行一次热编码
sum() < threshold识别聚合的列
- 我使用 pd.value_counts使用参数 normalize=True获得发生的百分比。
join

def hot_mess2(s, thresh):
    d = pd.get_dummies(s)
    f = pd.value_counts(s, sort=False, normalize=True) < thresh
    if f.sum() == 0:
        return d
    else:
        return d.loc[:, ~f].join(d.loc[:, f].sum(1).rename('other'))

考虑 pd.Series s

s = pd.Series(np.repeat(list('abcdef'), range(1, 7)))

s

0     a
1     b
2     b
3     c
4     c
5     c
6     d
7     d
8     d
9     d
10    e
11    e
12    e
13    e
14    e
15    f
16    f
17    f
18    f
19    f
20    f
dtype: object

hot_mess(s, 0)

    a  b  c  d  e  f
0   1  0  0  0  0  0
1   0  1  0  0  0  0
2   0  1  0  0  0  0
3   0  0  1  0  0  0
4   0  0  1  0  0  0
5   0  0  1  0  0  0
6   0  0  0  1  0  0
7   0  0  0  1  0  0
8   0  0  0  1  0  0
9   0  0  0  1  0  0
10  0  0  0  0  1  0
11  0  0  0  0  1  0
12  0  0  0  0  1  0
13  0  0  0  0  1  0
14  0  0  0  0  1  0
15  0  0  0  0  0  1
16  0  0  0  0  0  1
17  0  0  0  0  0  1
18  0  0  0  0  0  1
19  0  0  0  0  0  1
20  0  0  0  0  0  1

hot_mess(s, .1)

    c  d  e  f  other
0   0  0  0  0      1
1   0  0  0  0      1
2   0  0  0  0      1
3   1  0  0  0      0
4   1  0  0  0      0
5   1  0  0  0      0
6   0  1  0  0      0
7   0  1  0  0      0
8   0  1  0  0      0
9   0  1  0  0      0
10  0  0  1  0      0
11  0  0  1  0      0
12  0  0  1  0      0
13  0  0  1  0      0
14  0  0  1  0      0
15  0  0  0  1      0
16  0  0  0  1      0
17  0  0  0  1      0
18  0  0  0  1      0
19  0  0  0  1      0
20  0  0  0  1      0

关于python - Pandas One 热编码 : Bundling together less frequent categories，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43334222/

上一篇：python - whereis python 和 python --version 之间的矛盾

下一篇：python - 从 DataFrame 中的行中减去重复计数值

相关文章：

python - 如何在 Python 中访问关键字参数的默认值？

python - python pandas 中的 DataFrame.apply 更改原始和重复的 DataFrame

python - 用于 python 中的多类 SVM 的 GridSearchCV

machine-learning - KNN 中的 knn.score 和准确率指标有什么区别 - SKlearn

machine-learning - sklearn多类SVM函数

python - 在 django 中对抽象 View 进行单元测试

python - TensorFlow 二值图像分类 : Predict Probability of each class for each image in data set

python - 尝试对 Recaman 序列进行编码，但我为绘制圆弧传递的参数存在问题

python - 使用 Pandas 删除/替换行中的字符后替换数据帧值

pandas - 如何将小时列中的数字转换为实际小时数