python - 如何在Python中拥有分类因子变量

标签 python pandas dataframe categorical-data

    age     income  student     credit_rating   Class_buys_computer
0   youth   high    no  fair    no
1   youth   high    no  excellent   no
2   middle_aged     high    no  fair    yes
3   senior  medium  no  fair    yes
4   senior  low     yes     fair    yes
5   senior  low     yes     excellent   no
6   middle_aged     low     yes     excellent   yes
7   youth   medium  no  fair    no
8   youth   low     yes     fair    yes
9   senior  medium  yes     fair    yes
10  youth   medium  yes     excellent   yes
11  middle_aged     medium  no  excellent   yes
12  middle_aged     high    yes     fair    yes
13  senior  medium  no  excellent   no

我正在使用这个数据集,并希望拥有诸如年龄收入等变量,就像R<中的因子变量一样,我怎样才能在Python中做到这一点

最佳答案

您可以使用astype带参数类别:

cols = ['age','income','student']

for col in cols:
    df[col] = df[col].astype('category')

print (df.dtypes)
age                    category
income                 category
student                category
credit_rating            object
Class_buys_computer      object
dtype: object

如果需要转换所有列:

for col in df.columns:
    df[col] = df[col].astype('category')

print (df.dtypes)
age                    category
income                 category
student                category
credit_rating          category
Class_buys_computer    category
dtype: object

你需要循环,因为如果使用:

df = df.astype('category')

NotImplementedError: > 1 ndim Categorical are not supported at this time

Pandas documentation about categorical .

按评论编辑:

如果需要排序,请使用另一个解决方案 pandas.Categorical :

df['age']=pd.Categorical(df['age'],categories=["youth","middle_aged","senior"],ordered=True)

print (df.age)
0           youth
1           youth
2     middle_aged
3          senior
4          senior
5          senior
6     middle_aged
7           youth
8           youth
9          senior
10          youth
11    middle_aged
12    middle_aged
13         senior
Name: age, dtype: category
Categories (3, object): [youth < middle_aged < senior]

然后您可以按列年龄对DataFrame进行排序:

df = df.sort_values('age')
print (df)
            age  income student credit_rating Class_buys_computer
0         youth    high      no          fair                  no
1         youth    high      no     excellent                  no
7         youth  medium      no          fair                  no
8         youth     low     yes          fair                 yes
10        youth  medium     yes     excellent                 yes
2   middle_aged    high      no          fair                 yes
6   middle_aged     low     yes     excellent                 yes
11  middle_aged  medium      no     excellent                 yes
12  middle_aged    high     yes          fair                 yes
3        senior  medium      no          fair                 yes
4        senior     low     yes          fair                 yes
5        senior     low     yes     excellent                  no
9        senior  medium     yes          fair                 yes
13       senior  medium      no     excellent                  no

关于python - 如何在Python中拥有分类因子变量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38954525/

相关文章:

python - 将列表从数据帧转换为 numpy 数组

Sklearn CountVectorizer 的 Python 访问标签

python - IndexError : list index out of range. 谁能帮我解决这个Python代码吗?包括 numpy 和 pandas 的概念

python - 当 matplotlib 中的数字有上标时,如何对齐刻度标签?

python - Pandas 中的列表理解

python - 奇怪 "ModuleNotFoundErrior no module named iexfinance"

python - 我们如何合并多个图?

python - 使用 matplotlib 更改 python 条形图中日期时间数据的 x 轴刻度标签的频率

python - 根据另一个排序列表中的值索引创建新列

python - Pandas 数据帧 : How to convert binary columns into one categorical column?