python - 如何将数据框的列值拆分为多列

标签 python pandas dataframe split

下面是我的数据框,其中有一列合并在一起,

   PLUGS\nDESIGN\nGEAR
0  700\nDaewoo 8000  Gearless   
1  300\nHyundai 4400  Gearless   
2  600\nSTX 2600  Gearless   
3  200\nB170 \nGeared   
4  362 Wenchong 1700 Mk II \nGeared   
5  252\nRichMax 1550  Gearless   
6  220\nCV 1100 Plus \nGeared   
7  232\nOrskov Mk VII  Gearless   
8  119\nKouan 1000  Gearless   
9  100\nHanjin 700  Gearless

我想将列拆分为三个不同的列,即 PLUGS、DESIGN、GEAR。有什么办法吗?

下面是我试过的代码:

new_df[['PLUGS', 'DESIGN', 'GEAR']] = new_df['PLUGS\nDESIGN\nGEAR'].str.split(' ')
                print(new_df)

预期输出:

   PLUGS  DESIGN               GEAR
0  700    Daewoo 8000          Gearless   
1  300    Hyundai 4400         Gearless   
2  600    STX 2600             Gearless   
3  200    B170                 Geared   
4  362    Wenchong 1700 Mk II  Geared   
5  252    RichMax 1550         Gearless   
6  220    CV 1100 Plus         Geared   
7  232    Orskov Mk VII        Gearless   
8  119    Kouan 1000           Gearless   
9  100    Hanjin 700           Gearless

最佳答案

正如评论部分所建议的,正则表达式在这里应该工作得很好,

数据框示例:

>>> df
                   PLUGS\nDESIGN\nGEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

只需从列名称中删除换行符即可提高可读性并易于使用。

>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)

现在,Column name 没有任何特殊的汽车:

>>> df
                     PLUGS DESIGN GEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

现在,我们可以使用pandas.Series.str.extract .使用regex 方法时,所有命名组() 将成为结果中的列名。

因为,命名组将成为具有预定义名称的列,例如 0,1,2 因此我们可以使用所需的名称一起重命名它们以获得所需的结果,如下所示:

>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})

结果:

>>> print(df)
  PLUGS                DESIGN      GEAR
0   700          Daewoo 8000   Gearless
1   300         Hyundai 4400   Gearless
2   600             STX 2600   Gearless
3   200                 B170     Geared
4   362  Wenchong 1700 Mk II     Geared
5   252         RichMax 1550   Gearless
6   220         CV 1100 Plus     Geared
7   232        Orskov Mk VII   Gearless
8   119           Kouan 1000   Gearless
9   100           Hanjin 700   Gearless

正则解释:

您可以查看regex101.com

(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)

第一捕获组 (\d+)

    \d matches a digit (equivalent to [0-9])
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第二捕获组 ([^\]+)

    Match a single character not present in the list below [^\\]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第 3 捕获组 ([|^Gear][a-z]+)

Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

关于python - 如何将数据框的列值拆分为多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68296245/

相关文章:

python - Sqlalchemy,关系与关系

python - 由于 jsonschema,无法启动 jupyter notebook

python - SqlAlchemy 中的有效批处理 "update-or-insert"

python - 如何让Pandas Python中的HBase中不存储空值?

python - Pandas - 处理分类数据中的 NaN

python - 为什么 pd.MultiIndex.from_tuples 改变元组的顺序

python - 如何在 Arch linux 上安装 pip?

python - Pandas DataFrame 使用日期和计数进行透视

python - 如何通过定义分隔符前后来提取子字符串

python - 属性错误 : 'DataFrame' object has no attribute 'group'