下面是我的数据框,其中有一列合并在一起,
PLUGS\nDESIGN\nGEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
我想将列拆分为三个不同的列,即 PLUGS、DESIGN、GEAR。有什么办法吗?
下面是我试过的代码:
new_df[['PLUGS', 'DESIGN', 'GEAR']] = new_df['PLUGS\nDESIGN\nGEAR'].str.split(' ')
print(new_df)
预期输出:
PLUGS DESIGN GEAR
0 700 Daewoo 8000 Gearless
1 300 Hyundai 4400 Gearless
2 600 STX 2600 Gearless
3 200 B170 Geared
4 362 Wenchong 1700 Mk II Geared
5 252 RichMax 1550 Gearless
6 220 CV 1100 Plus Geared
7 232 Orskov Mk VII Gearless
8 119 Kouan 1000 Gearless
9 100 Hanjin 700 Gearless
最佳答案
正如评论部分所建议的,正则表达式在这里应该工作得很好,
数据框示例:
>>> df
PLUGS\nDESIGN\nGEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
只需从列名称中删除换行符即可提高可读性并易于使用。
>>> df.columns = df.columns.str.replace(r"\\n", " ", regex=True)
现在,Column name 没有任何特殊的汽车:
>>> df
PLUGS DESIGN GEAR
0 700\nDaewoo 8000 Gearless
1 300\nHyundai 4400 Gearless
2 600\nSTX 2600 Gearless
3 200\nB170 \nGeared
4 362 Wenchong 1700 Mk II \nGeared
5 252\nRichMax 1550 Gearless
6 220\nCV 1100 Plus \nGeared
7 232\nOrskov Mk VII Gearless
8 119\nKouan 1000 Gearless
9 100\nHanjin 700 Gearless
现在,我们可以使用pandas.Series.str.extract .使用regex
方法时,所有命名组()
将成为结果中的列名。
因为,命名组将成为具有预定义名称的列,例如 0,1,2
因此我们可以使用所需的名称一起重命名它们以获得所需的结果,如下所示:
>>> df = df['PLUGS DESIGN GEAR'].str.extract(r"^(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\\n|^Gear][a-z]+)").rename(columns={0: 'PLUGS', 1: 'DESIGN', 2: 'GEAR'})
结果:
>>> print(df)
PLUGS DESIGN GEAR
0 700 Daewoo 8000 Gearless
1 300 Hyundai 4400 Gearless
2 600 STX 2600 Gearless
3 200 B170 Geared
4 362 Wenchong 1700 Mk II Geared
5 252 RichMax 1550 Gearless
6 220 CV 1100 Plus Geared
7 232 Orskov Mk VII Gearless
8 119 Kouan 1000 Gearless
9 100 Hanjin 700 Gearless
正则解释:
您可以查看regex101.com
(\d+)[\\n\s]+([^\\]+)[\\n\s]+([\|^Gear][a-z]+)
第一捕获组 (\d+)
\d matches a digit (equivalent to [0-9])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
Match a single character present in the list below [\\n\s]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
n matches the character n literally (case sensitive)
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
第二捕获组 ([^\]+)
Match a single character not present in the list below [^\\]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
Match a single character present in the list below [\\n\s]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
\\ matches the character \ literally (case sensitive)
n matches the character n literally (case sensitive)
\s matches any whitespace character (equivalent to [\r\n\t\f\v ])
第 3 捕获组 ([|^Gear][a-z]+)
Match a single character present in the list below [\|^Gear]
\| matches the character | literally (case sensitive)
^Gear matches a single character in the list ^Gear (case sensitive)
Match a single character present in the list below [a-z]
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
关于python - 如何将数据框的列值拆分为多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68296245/