我在 pandas 数据框中有一列用于流派。它是一个由列分隔的流派字符串。
>>> df['genres_omdb']
0 Crime, Drama
1 Adventure, Family, Fantasy
2 Drama, Mystery
3 Horror, Mystery, Thriller
5 Action, Adventure, Sci-Fi
6 Drama, Romance
8 Drama
9 Animation, Adventure, Comedy
10 Animation, Adventure, Comedy
11 Drama, Sci-Fi
12 Drama
13 Drama, Romance, War
14 Comedy, Drama, Family
16 Comedy, Musical, Romance
所以最初我将其分为三列,并在每列上运行 get_dummies。这产生了重复的列(即流派1_冒险流派2_冒险)。
然后,我尝试获取每种独特的流派,创建该流派的一列,然后手动迭代行,如果该流派在列表中,则将值更改为 1。
genre1_keys = df['genre1'].value_counts().keys()
genre2_keys = df['genre2'].value_counts().keys()
genre3_keys = df['genre3'].value_counts().keys()
for genre in genre1_keys:
all_genres.add(genre.strip())
for genre in genre2_keys:
all_genres.add(genre.strip())
for genre in genre3_keys:
all_genres.add(genre.strip())
for genre in all_genres:
df[genre] = 0
for i, row in df.iterrows():
genres = row['genres_omdb'].split(',')
for genre in genres:
genre = genre.strip()
row[genre] = 1
这非常困惑,我知道有更好的方法来做到这一点。任何有关如何清理此代码的帮助将不胜感激。
最佳答案
我认为你只需要str.get_dummies
df['genres_omdb'].str.get_dummies(sep=',')
Out[115]:
Action Adventure Animation Comedy Crime Drama Family Fantasy \
0 0 0 0 0 1 1 0 0
1 0 1 0 0 0 0 1 1
2 0 0 0 0 0 1 0 0
3 0 0 0 0 0 0 0 0
5 1 1 0 0 0 0 0 0
6 0 0 0 0 0 1 0 0
8 0 0 0 0 0 1 0 0
9 0 1 1 1 0 0 0 0
10 0 1 1 1 0 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 1 0 0
13 0 0 0 0 0 1 0 0
14 0 0 0 1 0 1 1 0
16 0 0 0 1 0 0 0 0
Horror Musical Mystery Romance Sci-Fi Thriller War
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 1 0 1 0 0 1 0
5 0 0 0 0 1 0 0
6 0 0 0 1 0 0 0
8 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0
11 0 0 0 0 1 0 0
12 0 0 0 0 0 0 0
13 0 0 0 1 0 0 1
14 0 0 0 0 0 0 0
16 0 1 0 1 0 0 0
关于python - 使用 pandas 将单列编码为多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48794354/