python - 拆分管道分隔的系列，按单独的系列分组，并返回新列中每个拆分值的计数

给定一个带有管道分隔系列的数据框:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'year': [1960, 1960, 1961, 1961, 1961],
                   'genre': ['Drama|Romance|Thriller',
                             'Spy|Mystery|Bio',
                             'Drama|Romance',
                             'Drama|Romance',
                             'Drama|Spy']})

或数据格式:

   year                   genre
0  1960  Drama|Romance|Thriller
1  1960         Spy|Mystery|Bio
2  1961           Drama|Romance
3  1961           Drama|Romance
4  1961               Drama|Spy

我可以拆分 genre系列与 str.split (正如许多关于 SO 的类似问题所证明的那样)。
但我也想按年份分组并返回 Drama 的计数, Romance , Thriller , 等等新列中的每个唯一年份。
我最初的尝试:

df_split = df.groupby('year')['genre'].apply(lambda x: x.str.split('|', expand=True).reset_index(drop=True))

            0        1         2
year                            
1960 0  Drama  Romance  Thriller
     1    Spy  Mystery       Bio
1961 0  Drama  Romance       NaN
     1  Drama  Romance       NaN
     2  Drama      Spy       NaN

但是如何按年份在自己的列中获取每种独特类型的计数？
我可以使用

genres = pd.unique(df['genre'].str.split('|', expand=True).stack())

但我仍然不确定如何将这些流派作为单独的系列，按年份计算。
我想要的最终输出是:

      Drama  Romance  Thriller  Spy  Mystery  Bio
1960      1        1         1    1        1    1
1961      3        2         0    1        0    0

其中每个独特的流派都有自己的系列，并按年份进行相应的计数。
这也很可能是一个 X-Y 问题。我的最终目标是制作一个百分比堆积面积图。假设 df_split有所需的转换，我想做:

df_perc = df_split.divide(df_split.sum(axis=1), axis=0)

         Drama   Romance  Thriller       Spy   Mystery       Bio
1960  0.166667  0.166667  0.166667  0.166667  0.166667  0.166667
1961  0.500000  0.333333  0.000000  0.166667  0.000000  0.000000

进而

plt.stackplot(df_perc.index, *[ts for col, ts in df_perc.iteritems()],
                               labels=df_perc.columns)
plt.gca().set_xticks(df_perc.index)
plt.margins(0)
plt.legend()

给出输出:

最佳答案

我们可以使用一些简单的整形和聚合来获得您想要的结果:

(df.assign(genre=df['genre'].str.split('|'))
   .explode('genre')
   .groupby('year')['genre']
   .value_counts(normalize=True)
   .unstack(fill_value=0))     
 
genre       Bio     Drama   Mystery   Romance       Spy  Thriller
year                                                             
1960   0.166667  0.166667  0.166667  0.166667  0.166667  0.166667
1961   0.000000  0.500000  0.000000  0.333333  0.166667  0.000000

从这里您可以通过绘制面积图来完成:

(df.assign(genre=df['genre'].str.split('|'))
   .explode('genre')
   .groupby('year')['genre']
   .value_counts(normalize=True)
   .unstack(fill_value=0)
   .plot
   .area())

它是如何工作的
从跨行分解数据开始:

df.assign(genre=df['genre'].str.split('|')).explode('genre') 

   year     genre
0  1960     Drama
0  1960   Romance
0  1960  Thriller
1  1960       Spy
1  1960   Mystery
1  1960       Bio
2  1961     Drama
2  1961   Romance
3  1961     Drama
3  1961   Romance
4  1961     Drama
4  1961       Spy

接下来，做一个groupby并获得标准化计数:

_.groupby('year')['genre'].value_counts(normalize=True)

year  genre   
1960  Bio         0.166667
      Drama       0.166667
      Mystery     0.166667
      Romance     0.166667
      Spy         0.166667
      Thriller    0.166667
1961  Drama       0.500000
      Romance     0.333333
      Spy         0.166667
Name: genre, dtype: float64

接下来，取消堆叠结果:

_.unstack(fill_value=0)

genre       Bio     Drama   Mystery   Romance       Spy  Thriller
year                                                             
1960   0.166667  0.166667  0.166667  0.166667  0.166667  0.166667
1961   0.000000  0.500000  0.000000  0.333333  0.166667  0.000000

最后，用

_.plot.area()

关于python - 拆分管道分隔的系列，按单独的系列分组，并返回新列中每个拆分值的计数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62973539/

python - 拆分管道分隔的系列，按单独的系列分组，并返回新列中每个拆分值的计数

上一篇：html - 如果当前行的宽度太窄，将 child 的溢出移到下一行

下一篇：css - 如何绘制树结构的线条