python - scikit-learn 中具有相同属性的跨多列的标签编码

如果我有如下两列:

Origin  Destination  
China   USA  
China   Turkey  
USA     China  
USA     Turkey  
USA     Russia  
Russia  China

我将如何执行标签编码，同时确保 Origin 列的标签与目标列中的标签相匹配，即

Origin  Destination  
0   1  
0   3  
1   0  
1   0  
1   0  
2   1

如果我分别对每一列进行编码，那么算法会认为第 1 列中的中国与第 2 列中的中国不同，但事实并非如此

最佳答案

`堆栈`

df.stack().pipe(lambda s: pd.Series(pd.factorize(s.values)[0], s.index)).unstack()

   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

`factorize` 和 `reshape`

pd.DataFrame(
    pd.factorize(df.values.ravel())[0].reshape(df.shape),
    df.index, df.columns
)

   Origin  Destination
0       0            1
1       0            2
2       1            0
3       1            2
4       1            3
5       3            0

`np.unique` 和 `reshape`

pd.DataFrame(
    np.unique(df.values.ravel(), return_inverse=True)[1].reshape(df.shape),
    df.index, df.columns
)

   Origin  Destination
0       0            3
1       0            2
2       3            0
3       3            2
4       3            1
5       1            0

恶心的选择

我无法停止尝试……抱歉!

df.applymap(
    lambda x, y={}, c=itertools.count():
        y.get(x) if x in y else y.setdefault(x, next(c))
)

   Origin  Destination
0       0            1
1       0            3
2       1            0
3       1            3
4       1            2
5       2            0

正如 cᴏʟᴅsᴘᴇᴇᴅ 指出的那样

您可以通过分配回数据框来缩短它

df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)

关于python - scikit-learn 中具有相同属性的跨多列的标签编码，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50264334/

上一篇：python - Python 何时检查 ABC 的具体子类是否实现了所需的方法？

下一篇：python - Keras ValueError : Input 0 is incompatible with layer conv2d_1: expected ndim=4, 发现 ndim=5

相关文章：

python - Python中的1000位pi

python - 单个单词的 PDFMiner 提取 - LTText LTTextBox

python - 矢量化:不是有效的集合

python - 使用 python StandardScaler 进行特征缩放会产生负值

python - 从字符串中随机选择

python - Pandas 错误 - 遇到无效值

python - pandas 中的条件

python-2.7 - Pandas:以特定格式生成日期范围

python - scikit-learn:StandardScaler() 在梳中卡住。使用 Pipeline 和 GridSearchCV

具有 pywin32 或 pypiwin32 依赖项的 Python 包