我想将由元素列表组成的 pandas 列分解为与唯一元素一样多的列，即 one-hot-encode 它们(值为 1 表示存在于一行中的给定元素，0 在不存在的情况下)。

例如，取数据帧 df

Col1   Col2         Col3
 C      33     [Apple, Orange, Banana]
 A      2.5    [Apple, Grape]
 B      42     [Banana]

我想把它转换成:

Col1   Col2   Apple   Orange   Banana   Grape
 C      33     1        1        1       0
 A      2.5    1        0        0       1
 B      42     0        0        1       0

如何使用 pandas/sklearn 来实现这一点？

最佳答案

我们也可以使用sklearn.preprocessing.MultiLabelBinarizer :

为了节省大量内存，我们通常希望对现实世界的数据使用 sparse DataFrame。

稀疏解决方案(适用于 Pandas v0.25.0+)

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)

df = df.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(df.pop('Col3')),
                index=df.index,
                columns=mlb.classes_))

结果:

In [38]: df
Out[38]:
  Col1  Col2  Apple  Banana  Grape  Orange
0    C  33.0      1       1      0       1
1    A   2.5      1       0      1       0
2    B  42.0      0       1      0       0

In [39]: df.dtypes
Out[39]:
Col1                object
Col2               float64
Apple     Sparse[int32, 0]
Banana    Sparse[int32, 0]
Grape     Sparse[int32, 0]
Orange    Sparse[int32, 0]
dtype: object

In [40]: df.memory_usage()
Out[40]:
Index     128
Col1       24
Col2       24
Apple      16    #  <--- NOTE!
Banana     16    #  <--- NOTE!
Grape       8    #  <--- NOTE!
Orange      8    #  <--- NOTE!
dtype: int64

密集解决方案

mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')),
                          columns=mlb.classes_,
                          index=df.index))

结果:

In [77]: df
Out[77]:
  Col1  Col2  Apple  Banana  Grape  Orange
0    C  33.0      1       1      0       1
1    A   2.5      1       0      1       0
2    B  42.0      0       1      0       0

关于python - 如何从包含列表的 Pandas 列中进行一次热编码？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45312377/

python - 如何从包含列表的 Pandas 列中进行一次热编码？

稀疏解决方案(适用于 Pandas v0.25.0+)

密集解决方案

上一篇：python - Python 有类似 Java 的匿名内部类的东西吗？

下一篇：python - JWT: 'module' 对象没有属性 'encode'