python - 如何将数据框的列值拆分为多列

下面是我的数据框，其中有一列合并在一起，

   PLUGS\nDESIGN\nGEAR
0  700\nDaewoo 8000  Gearless   
1  300\nHyundai 4400  Gearless   
2  600\nSTX 2600  Gearless   
3  200\nB170 \nGeared   
4  362 Wenchong 1700 Mk II \nGeared   
5  252\nRichMax 1550  Gearless   
6  220\nCV 1100 Plus \nGeared   
7  232\nOrskov Mk VII  Gearless   
8  119\nKouan 1000  Gearless   
9  100\nHanjin 700  Gearless

我想将列拆分为三个不同的列，即 PLUGS、DESIGN、GEAR。有什么办法吗？

下面是我试过的代码:

new_df[['PLUGS', 'DESIGN', 'GEAR']] = new_df['PLUGS\nDESIGN\nGEAR'].str.split(' ')
                print(new_df)

预期输出:

   PLUGS  DESIGN               GEAR
0  700    Daewoo 8000          Gearless   
1  300    Hyundai 4400         Gearless   
2  600    STX 2600             Gearless   
3  200    B170                 Geared   
4  362    Wenchong 1700 Mk II  Geared   
5  252    RichMax 1550         Gearless   
6  220    CV 1100 Plus         Geared   
7  232    Orskov Mk VII        Gearless   
8  119    Kouan 1000           Gearless   
9  100    Hanjin 700           Gearless

最佳答案

正如评论部分所建议的，正则表达式在这里应该工作得很好，

数据框示例:

>>> df
                   PLUGS\nDESIGN\nGEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

只需从列名称中删除换行符即可提高可读性并易于使用。

>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)

现在，Column name 没有任何特殊的汽车:

>>> df
                     PLUGS DESIGN GEAR
0        700\nDaewoo 8000  Gearless
1       300\nHyundai 4400  Gearless
2           600\nSTX 2600  Gearless
3                200\nB170 \nGeared
4  362 Wenchong 1700 Mk II \nGeared
5       252\nRichMax 1550  Gearless
6        220\nCV 1100 Plus \nGeared
7      232\nOrskov Mk VII  Gearless
8         119\nKouan 1000  Gearless
9            100\nHanjin 700  Gearless

现在，我们可以使用pandas.Series.str.extract .使用regex 方法时，所有命名组() 将成为结果中的列名。

因为，命名组将成为具有预定义名称的列，例如 0,1,2 因此我们可以使用所需的名称一起重命名它们以获得所需的结果，如下所示:

>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})

结果:

>>> print(df)
  PLUGS                DESIGN      GEAR
0   700          Daewoo 8000   Gearless
1   300         Hyundai 4400   Gearless
2   600             STX 2600   Gearless
3   200                 B170     Geared
4   362  Wenchong 1700 Mk II     Geared
5   252         RichMax 1550   Gearless
6   220         CV 1100 Plus     Geared
7   232        Orskov Mk VII   Gearless
8   119           Kouan 1000   Gearless
9   100           Hanjin 700   Gearless

正则解释:

您可以查看regex101.com

(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)

第一捕获组 (\d+)

    \d matches a digit (equivalent to [0-9])
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第二捕获组 ([^\]+)

    Match a single character not present in the list below [^\\]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    Match a single character present in the list below [\\n\s]
    + matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
    \\ matches the character \ literally (case sensitive)
    n matches the character n literally (case sensitive)
    \s matches any whitespace character (equivalent to [\r\n\t\f\v ])

第 3 捕获组 ([|^Gear][a-z]+)

Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

关于python - 如何将数据框的列值拆分为多列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68296245/

python - 如何将数据框的列值拆分为多列

数据框示例:

结果:

上一篇：c++ - openmp omp declare uniform 这在 GCC 中不受支持吗？

下一篇：javascript - 如何随机化 2 个列表以使相同索引的项目不相同？