python - 使用 matplotlib 绘图时如何排序分类月份变量?

标签 python pandas dataframe datetime matplotlib

我正在做一些主题建模,我有兴趣展示平均主题权重如何随时间变化。当我使用 matplotlib(版本 3.3.4)绘制它时出现问题。在 x 轴上,我想要分类 month_year 变量。问题是它没有以合理的方式排序。正如其他堆栈溢出帖子中所建议的那样,我已经尝试使用以下代码确保 pandas 列的 dtype 是有序分类:

monthlabels = ['Nov 19','Dec 19','Jan 20','Feb 20','Mar 20', 'Apr 20','May 20','Jun 20','Jul 20','Aug 20', 'Sep 20', 'Okt 20','Nov 20','Dec 20','Jan 21','Feb 21']
category_month = pd.CategoricalDtype(categories = monthlabels , ordered = True)

df['month_year']=df['month_year'].astype(category_month)

df['month_year'].dtypes
### output ###
CategoricalDtype(categories=['Nov 19', 'Dec 19', 'Jan 20', 'Feb 20', 'Mar 20', 'Apr 20',
                  'May 20', 'Jun 20', 'Jul 20', 'Aug 20', 'Sep 20', 'Okt 20',
                  'Nov 20', 'Dec 20', 'Jan 21', 'Feb 21'],
                 ordered=True)

但是,当我使用以下代码绘制前 9 个主题的平均权重时,月份仍然是乱七八糟的。

plt.figure(figsize=(14, 8), dpi=80)
for i in range(1, 10):
    plt.plot('month_year', 'average_weight', data = df[df['topic_id']==i], label = "Topic {}".format(i))
    
plt.legend()

enter image description here

关于如何解决这个问题有什么想法吗?

编辑:以下可用于创建测试数据框

test = {
    'month_year' : ['Okt 20','Okt 20','Okt 20', 
                    'Jan 20','Jan 20','Jan 20',
                    'Jan 21','Jan 21','Jan 21',
                    'Feb 21','Feb 21','Feb 21',
                    'Nov 19','Nov 19','Nov 19',
                    'Dec 19','Dec 19','Dec 19',
                    'Feb 20','Feb 20','Feb 20',
                    'Mar 20','Mar 20','Mar 20',
                    'Apr 20','Apr 20','Apr 20',
                    'May 20','May 20','May 20',
                    'Jun 20','Jun 20','Jun 20',
                    'Jul 20','Jul 20','Jul 20',
                    'Aug 20','Aug 20','Aug 20',
                    'Nov 20','Nov 20','Nov 20',
                    'Dec 20','Dec 20','Dec 20',
                    'Sep 20','Sep 20','Sep 20'],
    'topic_id' : [1, 2, 3]*16,
    'average_weight' : [0.0034448771785276057,0.00234510088697649,0.004074211769665663,0.008929628932562012,0.013741873628579272,0.0033314566617497266,0.004239432615204117,0.012250019864250835,0.013073026411569653,0.0020715684200135562,0.002658988134219096,0.00582952833829973,0.0027180065711339316,0.0057726953512965105,0.0055539998022887185,0.018381623288568776,0.0061883432074235035,0.007737642207827706,0.0045695560208211345,0.0024893487063355935,0.006388474864741931,0.004562876933516982,0.00800004672521773,0.0019508447462263016,0.0024570989697120893,0.005440877392314947,0.006958154412225271,0.035187635445394196,0.0034783523505887925,0.014961680677982096,0.005622866414385113,0.002655701866852288,0.0022439579296199314,0.007044070218804771,0.0032079321863121213,0.0025985821304469617,0.017684469631747815,0.0148618754616377,0.01631911248241339,0.0011055421114840424,0.0016653659358988743,0.01217493533488271,0.001419802304537931,0.0017606995911196841,0.006776685929581973,0.010324044291131124,0.004357617965337888,0.005569919780210301]
}
df_test = pd.DataFrame(test) 

最佳答案

一个可靠的解决方案是将 month_year 列从 str 类型转换为 datetime 并让 pandas 自行对值进行排序,no需要使用自定义 CategoricalDtype:

# I have to replace 'Okt' with 'Oct' for english format, you may not need this line
df_test['month_year'] = df_test['month_year'].replace({'Okt': 'Oct'}, regex = True)

df_test['time'] = pd.to_datetime(df_test['month_year'], format = '%b %y')

所以你有一个像这样的数据框:

   month_year  topic_id  average_weight       time
0      Oct 20         1        0.003445 2020-10-01
1      Oct 20         2        0.002345 2020-10-01
2      Oct 20         3        0.004074 2020-10-01
3      Jan 20         1        0.008930 2020-01-01
4      Jan 20         2        0.013742 2020-01-01
5      Jan 20         3        0.003331 2020-01-01
6      Jan 21         1        0.004239 2021-01-01
7      Jan 21         2        0.012250 2021-01-01
8      Jan 21         3        0.013073 2021-01-01
9      Feb 21         1        0.002072 2021-02-01
10     Feb 21         2        0.002659 2021-02-01

然后你可以绘制:

fig, ax = plt.subplots(figsize = (14, 8), dpi = 80)

for topic in df_test['topic_id'].unique():
    df_tmp = df_test[df_test['topic_id'] == topic].sort_values(by = 'time')
    ax.plot(df_tmp['time'], df_tmp['average_weight'], label = f'Topic {topic}')

ax.xaxis.set_major_locator(md.MonthLocator(interval = 1))
ax.xaxis.set_major_formatter(md.DateFormatter('%b %y'))

ax.legend(frameon = True)

plt.show()

matplotlib.dates.MonthLocatormatplotlib.dates.DateFormatter 允许您根据需要自定义 x 轴刻度标签。

完整代码

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as md


test = {'month_year': ['Okt 20', 'Okt 20', 'Okt 20',
                       'Jan 20', 'Jan 20', 'Jan 20',
                       'Jan 21', 'Jan 21', 'Jan 21',
                       'Feb 21', 'Feb 21', 'Feb 21',
                       'Nov 19', 'Nov 19', 'Nov 19',
                       'Dec 19', 'Dec 19', 'Dec 19',
                       'Feb 20', 'Feb 20', 'Feb 20',
                       'Mar 20', 'Mar 20', 'Mar 20',
                       'Apr 20', 'Apr 20', 'Apr 20',
                       'May 20', 'May 20', 'May 20',
                       'Jun 20', 'Jun 20', 'Jun 20',
                       'Jul 20', 'Jul 20', 'Jul 20',
                       'Aug 20', 'Aug 20', 'Aug 20',
                       'Nov 20', 'Nov 20', 'Nov 20',
                       'Dec 20', 'Dec 20', 'Dec 20',
                       'Sep 20', 'Sep 20', 'Sep 20'],
        'topic_id': [1, 2, 3]*16,
        'average_weight': [0.0034448771785276057, 0.00234510088697649, 0.004074211769665663, 0.008929628932562012,
                           0.013741873628579272, 0.0033314566617497266, 0.004239432615204117, 0.012250019864250835,
                           0.013073026411569653, 0.0020715684200135562, 0.002658988134219096, 0.00582952833829973,
                           0.0027180065711339316, 0.0057726953512965105, 0.0055539998022887185, 0.018381623288568776,
                           0.0061883432074235035, 0.007737642207827706, 0.0045695560208211345, 0.0024893487063355935,
                           0.006388474864741931, 0.004562876933516982, 0.00800004672521773, 0.0019508447462263016,
                           0.0024570989697120893, 0.005440877392314947, 0.006958154412225271, 0.035187635445394196,
                           0.0034783523505887925, 0.014961680677982096, 0.005622866414385113, 0.002655701866852288,
                           0.0022439579296199314, 0.007044070218804771, 0.0032079321863121213, 0.0025985821304469617,
                           0.017684469631747815, 0.0148618754616377, 0.01631911248241339, 0.0011055421114840424,
                           0.0016653659358988743, 0.01217493533488271, 0.001419802304537931, 0.0017606995911196841,
                           0.006776685929581973, 0.010324044291131124, 0.004357617965337888, 0.005569919780210301]}
df_test = pd.DataFrame(test)


df_test['month_year'] = df_test['month_year'].replace({'Okt': 'Oct'}, regex = True)
df_test['time'] = pd.to_datetime(df_test['month_year'], format = '%b %y')


fig, ax = plt.subplots(figsize = (14, 8), dpi = 80)

for topic in df_test['topic_id'].unique():
    df_tmp = df_test[df_test['topic_id'] == topic].sort_values(by = 'time')
    ax.plot(df_tmp['time'], df_tmp['average_weight'], label = f'Topic {topic}')

ax.xaxis.set_major_locator(md.MonthLocator(interval = 1))
ax.xaxis.set_major_formatter(md.DateFormatter('%b %y'))

ax.legend(frameon = True)

plt.show()

enter image description here

关于python - 使用 matplotlib 绘图时如何排序分类月份变量?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70904347/

相关文章:

python - 每个值都取决于另一个 df 查询的 Pandas 列

python - python 中类方法和静态变量的行为

python - 使用 isinstance 测试 Unicode 字符串

python - 列表元素与 pandas 列的关键字匹配

python - 用 fillna 替换空值仍然会在 Pandas 中返回空值

python - PySpark:如何在PySpark转换中指定dropna轴?

python - 将 Pandas 数据帧读入 R

python - 与正则表达式匹配的字符串的最小长度

python - 使用二维数组创建可点击的 TKinter Canvas

python - Pandas :在同一张图上绘制两个数据框时出现异常