所以我有一个看起来像这样的 pandas 专栏:
full_name = pd.Series([
'Reservoir 1 Compartment 1',
'Reservoir 1 Common Inlet',
'Reservoir 2 Compartment 1',
'Vyrnwy Line 2 Balancing Tank 1',
'Reservoir 1'
])
我想把它分成两列。预期的输出应如下所示:
[['Reservoir 1', 'Compartment 1'],
['Reservoir 1', 'Common Inlet'],
['Reservoir 2', 'Compartment 1'],
['Vyrnwy Line 2', 'Balancing Tank 1'],
['Reservoir 1', None]]
我试过这个:
res_compartment_split = pd.concat([full_name.str.split(r'\s\s*?(?=[A-Z])', expand=True)])
但我得到了这个输出
[['Reservoir 1', 'Compartment 1', None, None],
['Reservoir 1', 'Common', 'Inlet', None],
['Reservoir 2', 'Compartment 1', None, None],
['Vyrnwy', 'Line 2', 'Balancing', 'Tank 1'],
['Reservoir 1', None, None, None]]
感谢您的帮助。
最佳答案
尝试以下操作:
import pandas as pd
full_name = pd.Series([
'Reservoir 1 Compartment 1',
'Reservoir 1 Common Inlet',
'Reservoir 2 Compartment 1',
'Vyrnwy Line 2 Balancing Tank 1',
'Reservoir 1'
])
res = full_name.str.split('(?<=\d)\s+(?=[A-Z])', expand=True)
输出:
>>> res
0 1
0 Reservoir 1 Compartment 1
1 Reservoir 1 Common Inlet
2 Reservoir 2 Compartment 1
3 Vyrnwy Line 2 Balancing Tank 1
4 Reservoir 1 None
正则表达式模式的解释:
-
(?<=\d)
- 积极的后视:确保在分隔符之前有一个数字,而不消耗它 -
\s+
- 分隔符:匹配一个或多个空格 -
(?=[A-Z])
- 正向前瞻:确保紧接着有一个字母(A 到 Z),而不消耗它
使用 regex101.com 查看实际效果.
另外,您可以在这里看到为什么您的模式不起作用:https://regex101.com/r/nSmEEs/1 .
关于python - 在 Pandas 中拆分列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72670379/