python - 如何对具有变化的值的数据帧进行一致的热编码？

我正在获取数据帧形式的内容流，每个批处理在列中具有不同的值。例如，一批可能如下所示:

day1_data = {'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
            'city': ['C', 'B', 'G', 'Z', 'F'], 
            'age': [27, 19, 63, 40, 93]}

还有一个类似:

day2_data = {'state': ['AL', 'WY', 'VA'], 
            'city': ['A', 'B', 'E'], 
            'age': [42, 52, 73]}

如何以返回一致数量的列的方式对列进行热编码？

如果我在每个批处理上使用 pandas 的 get_dummies()，它会返回不同数量的列:

df1 = pd.get_dummies(pd.DataFrame(day1_data))
df2 = pd.get_dummies(pd.DataFrame(day2_data))

len(df1.columns) == len(df2.columns)

我可以获得每一列的所有可能值，问题是即使有了这些信息，为每个每日批处理生成一个热编码以使列数保持一致的最简单方法是什么？

最佳答案

好的，因为所有可能的值都是预先知道的。下面是一种稍微有点黑客的方法。

import numpy as np
import pandas as pd

# This is a one time process
# Keep all the possible data here in lists
# Can add other categorical variables too which have this type of data
all_possible_states=  ['AL', 'MS', 'MS', 'OK', 'VA', 'NJ', 'NM', 'CD', 'WY']
all_possible_cities= ['A', 'B', 'C', 'D', 'E', 'G', 'Z', 'F']

# Declare our transformer class
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

class MyOneHotEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, all_possible_values):
        self.le = LabelEncoder()
        self.ohe = OneHotEncoder()
        self.ohe.fit(self.le.fit_transform(all_possible_values).reshape(-1,1))

    def transform(self, X, y=None):
        return self.ohe.transform(self.le.transform(X).reshape(-1,1)).toarray()

# Allow the transformer to see all the data here
encoders = {}
encoders['state'] = MyOneHotEncoder(all_possible_states)
encoders['city'] = MyOneHotEncoder(all_possible_cities)
# Do this for all categorical columns

# Now this is our method which will be used on the incoming data 
def encode(df):

    tup = (encoders['state'].transform(df['state']), 
           encoders['city'].transform(df['city']),
           # Add all other columns which are not to be transformed
           df[['age']])

    return np.hstack(tup)

# Testing:
day1_data = pd.DataFrame({'state': ['MS', 'OK', 'VA', 'NJ', 'NM'], 
        'city': ['C', 'B', 'G', 'Z', 'F'], 
        'age': [27, 19, 63, 40, 93]})

print(encode(day1_data))
[[  0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.
    0.   0.  27.]
 [  0.   0.   0.   0.   0.   1.   0.   0.   0.   1.   0.   0.   0.   0.
    0.   0.  19.]
 [  0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.
    1.   0.  63.]
 [  0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   1.  40.]
 [  0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   1.
    0.   0.  93.]]


day2_data = pd.DataFrame({'state': ['AL', 'WY', 'VA'], 
            'city': ['A', 'B', 'E'], 
            'age': [42, 52, 73]})

print(encode(day2_data))
[[  1.   0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.
    0.   0.  42.]
 [  0.   0.   0.   0.   0.   0.   0.   1.   0.   1.   0.   0.   0.   0.
    0.   0.  52.]
 [  0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   1.   0.
    0.   0.  73.]]

请仔细阅读评论，如果仍有任何问题，请询问我。

关于python - 如何对具有变化的值的数据帧进行一致的热编码？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48034035/

python - 如何对具有变化的值的数据帧进行一致的热编码？

上一篇：python - 为什么 ptb_word_ln.py 中 embedding_lookup 仅用作编码器而不用作解码器

下一篇：python - 在 jenkins 中构建 Python Web 应用程序