python - 在Python中捕获字符串中的数字并存储在数据框中

我对 python 有点陌生，几个月来我一直在玩 pandas 和 numpy。这是我在这里发表的第一篇文章，所以如果我遗漏了什么，请告诉我。

我希望从数据框中存储为列的分子式中提取原子计数。字符串看起来像这样

C55H85N17O25S4

问题是，我当前的代码提取了一些原子，例如 C、H、N 或 O，但没有提取 S(或 Cl 或 Br)，我不明白为什么。

我当前的代码如下所示:

import pandas as pd
import numpy as np

myfile = "whatever.csv"
data = pd.read_csv(myfile, sep='|', header=0)

#create the columns for atoms
atoms = ['C', 'H', 'O', 'N', 'Cl','S','Br']
for col in atoms:
    data[col] = np.nan

#parse molecular_formula for atoms using regex and add the number into the corresponding column
for col in atoms:
    data[col]= pd.np.where(data.molecular_formula.str.contains(col), data.molecular_formula.str.extract(re.escape(col) + r'(\d{1,})'), '0')

我知道，如果字符串中的字母后面没有数字，我不会捕获数字，而是捕获 NaN，但我对此没意见。如果原子不包含在分子式中，我可以用“1”替换 NaN，只要我得到“0”(不过可能有一种更优雅的方法来做到这一点)。

对于这个例子，我当前的输出是:

molecular_formula   C       H       O       N       Cl      S      Br
C55H85N17O25S4      55      85      25      17      0       0      0

虽然我想要:

molecular_formula   C       H       O       N       Cl      S      Br
C55H85N17O25S4      55      85      25      17      0       4      0

我认为问题出在我的 str.extract() 上，就好像我将代码更改为

data[col]= pd.np.where(data.molecular_formula.str.contains(col), 1, 0)

我确实得到了类似的东西:

molecular_formula   C       H       O       N       Cl      S      Br
C55H85N17O25S4      1       1       1       1       0       1      0

更新:我添加了一些额外的行来计算单个原子，当它位于分子式末尾或中间但后面不跟有时，应将其计为“1”一个数字。

#When the single atom is at the end of the molecular formula:
data.loc[data.molecular_formula.str.contains(r'[C]$') == True, 'C'] = 1
data.loc[data.molecular_formula.str.contains(r'[H]$') == True, 'H'] = 1
data.loc[data.molecular_formula.str.contains(r'[S]$') == True, 'S'] = 1
data.loc[data.molecular_formula.str.contains(r'[O]$') == True, 'O'] = 1
data.loc[data.molecular_formula.str.contains(r'[N]$') == True, 'N'] = 1
data.loc[data.molecular_formula.str.contains(r'[C][l]$') == True, 'Cl'] = 1
data.loc[data.molecular_formula.str.contains(r'[N][a]$') == True, 'Na'] = 1
data.loc[data.molecular_formula.str.contains(r'[B][r]$') == True, 'Br'] = 1

#When the singe atom is somewhere inside the molecular formula:
data.loc[data.molecular_formula.str.contains(r'.*[C][l]\D') == True, 'Cl'] = 1
data.loc[data.molecular_formula.str.contains(r'.*[C]\D') == True, 'C'] = 1
data.loc[data.molecular_formula.str.contains(r'.*[B][r]\D') == True, 'Br'] = 1
data.loc[data.molecular_formula.str.contains(r'.*[N][a]\D') == True, 'Na'] = 1
data.loc[data.molecular_formula.str.contains(r'.*[N]\D') == True, 'N'] = 1
data.loc[data.molecular_formula.str.contains(r'.*[H]\D') == True, 'H'] = 1
data.loc[data.molecular_formula.str.contains(r'.*[S]\D') == True, 'S'] = 1
data.loc[data.molecular_formula.str.contains(r'.*[O]\D') == True, 'O'] = 1

#Convert the atom columns into int:
for col in atoms:
    data[col] = pd.to_numeric(data[col])

它又快又脏，我将不得不遍历这些并使用惰性正则表达式来解决带有“Br”或“Na”等两个字母的原子问题。但这些行与 @jxc 的答案相结合给出了我想要的输出。

最佳答案

如果您使用的是 pandas 0.18.0+，您可以尝试 extractall()检索所有原子+计数组合，然后使用pivot()或unstack()获取列中的原子。之后 reindex() 和 fillna() 来获取丢失的原子:参见下面的示例(在 Pandas 0.23.4 上测试):

更新:在 Pandas 0.24+ 版本上，pd.pivot() 函数会产生KeyError，并且此函数的一些更改使其与版本 0.23.4 不兼容。使用unstack()相反，在新代码中:

df = pd.DataFrame([('C55H85N17O25S4',),('C23H65',),(None,), (None,), ('C22H16ClN3OS2',)
         , ('C37H42Cl2N2O6',), ('C21H30BrNO4',), ('C11H13ClN2',), ('C34H53NaO8',), ('A0',)
    ],columns=['molecular_formula'])
#  molecular_formula
#0    C55H85N17O25S4
#1            C23H65
#2              None
#3              None
#4     C22H16ClN3OS2
#5     C37H42Cl2N2O6
#6       C21H30BrNO4
#7        C11H13ClN2
#8        C34H53NaO8
#9                A0

# list of concerned atoms 
atoms = ['C', 'H', 'O', 'N', 'Cl','S','Br']

# regeex pattern
atom_ptn = r'(?P<atom>' + r'|'.join(atoms) + r')(?P<cnt>\d+)'
print(atom_ptn)
#(?P<atom>C|H|O|N|Cl|S|Br)(?P<cnt>\d+)

# extract the combo of atom vs number and pivot them into desired table format 
df1 = df.molecular_formula.str.extractall(atom_ptn) \
        .reset_index(level=1, drop=True) \
        .set_index('atom', append=True) \
        .unstack(1)

# remove the level-0 from the column indexing
df1.columns = [ c[1] for c in df1.columns ]

# reindex df1 and join the result with the original df, then fillna() 
df.join(df1.reindex(columns=atoms)).fillna({c:0 for c in atoms}, downcast='infer')
#  molecular_formula   C   H   O   N Cl  S  Br
#0    C55H85N17O25S4  55  85  25  17  0  4   0
#1            C23H65  23  65   0   0  0  0   0
#2              None   0   0   0   0  0  0   0
#3              None   0   0   0   0  0  0   0
#4     C22H16ClN3OS2  22  16   0   3  0  2   0
#5     C37H42Cl2N2O6  37  42   6   2  2  0   0
#6       C21H30BrNO4  21  30   4   0  0  0   0
#7        C11H13ClN2  11  13   0   2  0  0   0
#8        C34H53NaO8  34  53   8   0  0  0   0
#9                A0   0   0   0   0  0  0   0

Pandas 0.24.0的AS，我们可以使用DataFrame.droplevel()然后在一条链中完成所有操作:

df.join(df.molecular_formula.str.extractall(atom_ptn) 
          .droplevel(1)
          .set_index('atom', append=True) 
          .unstack(1) 
          .droplevel(0, axis=1) 
          .reindex(columns=atoms) 
   ).fillna({c:0 for c in atoms}, downcast='infer')

UPDATE-2(2019 年 5 月 13 日):

根据评论，缺少数字的原子应分配一个常量 1 。请参阅下面的两个修改:

正则表达式:
- cnt应该允许空字符串，因此:来自 (?P<cnt>\d+)至(?P<cnt>\d*)
- atom必须进行排序，以便在较短的字符串之前测试较长的字符串，这很重要，因为正则表达式交替从左到右匹配子模式。这是为了确保 Cl 在 C 之前进行测试，否则 Cl 将永远不会匹配。
```
# sort the list of atoms based on their length
atoms_sorted = [ i[0] for i in sorted([(k, len(k)) for k in atoms], key=lambda x: -x[1]) ]

# the new pattern based on list of atoms_sorted and \d* on cnt
atom_ptn = r'(?P<atom>' + r'|'.join(atoms_sorted) + r')(?P<cnt>\d*)'
print(atom_ptn)
#(?P<atom>Cl|Br|C|H|O|N|S)(?P<cnt>\d*)
```
测试一下。你可以尝试:df.molecular_formula.str.extractall(atom_ptn)通过使用由排序和未排序列表创建的 atom_ptn。

fillna(1) 对于与上述正则表达式模式中的 0 位数字匹配的所有原子，请参见下文:

df.join(df.molecular_formula.str.extractall(atom_ptn)
          .fillna(1)
          .droplevel(1)
          .set_index('atom', append=True)
          .unstack(1)
          .droplevel(0, axis=1)
          .reindex(columns=atoms)
   ).fillna({c:0 for c in atoms}, downcast='infer')

关于python - 在Python中捕获字符串中的数字并存储在数据框中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56060109/

python - 在Python中捕获字符串中的数字并存储在数据框中

上一篇：python - Tensorflow2.0训练: model.编译vs GradientTape

下一篇：python - 使用来自另一个数据框的匹配值列表创建数据框列