Python 3 pandas 使用字符串与正则表达式标记数据框中的数据

所以我有两种方法来做同样的事情，并且想知道哪一种更有效:

第一种方法从文本文件或数组加载列表并使用该列表标记数据框:

import pandas as pd

ban_list = ['Al Gore', 'Kim jong-un','Kim jong un','Kim Jong Un', 'Al Sharpton','Kim jong il', 'Richard Johnson', 'Dick Johnson']

df=pd.DataFrame({'Users': [ 'Al Gore', 'Kim jong il', 'Kim jong un', 'Al Sharpton', 'James', 'Richard Johnson', 'Bill Gates', 'Alf pig', 'Dick Johnson', 'Python Monte'],
                 'Time': ['D','D','N','D','L','N', 'N','L','L','N']})

df['Banned'] = ''


for i in range(len(ban_list)):
    df.loc[df.Users.str.contains(ban_list[i]) & (df.Banned == ''),'Banned'] = 'Yes'

第二种方法使用正则表达式模式而不是名称列表

import pandas as pd

ban_list = ['^(?i)Al(\s)(Gore|Sharpton)$', '^(?i)Kim\sjong(\s|-)(il|un)$', '^(?i)(Dick|Richard)\sJohnson$']

df=pd.DataFrame({'Users': [ 'Al Gore', 'Kim jong il', 'Kim jong un', 'Al Sharpton', 'James', 'Richard Johnson', 'Bill Gates', 'Alf pig', 'Dick Johnson', 'Python Monte'],
                 'Time': ['D','D','N','D','L','N', 'N','L','L','N']})

df['Banned'] = ''


for i in range(len(ban_list)):
    df.loc[df.Users.str.contains(ban_list[i]) & (df.Banned == ''),'Banned'] = 'Yes'

两组代码的工作原理和作用相同。到目前为止，问题是第一个不区分大小写，第二个有警告UserWarning:此模式具有匹配组。要实际获取组，请使用 str.extract。 “组，使用 str.extract。”，UserWarning)

第一种方式中的数组加载一个大列表，第二种方式具有包含多个步骤的正则表达式。为了提高效率，我应该使用哪一种？或者还有其他方法可以改善这个问题吗？

最佳答案

使用 isin 似乎更干净(至少对我来说)，因为您有一个很好的被禁止用户列表(然后您可以将 True/False 映射到 Yes/'':

df['Banned'] = df.Users.isin(ban_list).map({True:'Yes',False:''})
print df

  Time            Users Banned
0    D          Al Gore    Yes
1    D      Kim jong il    Yes
2    N      Kim jong un    Yes
3    D      Al Sharpton    Yes
4    L            James       
5    N  Richard Johnson    Yes
6    N       Bill Gates       
7    L          Alf pig       
8    L     Dick Johnson    Yes
9    N     Python Monte

当然，如果 True/False 足够好，您可以只执行命令的第一部分:

df['Banned'] = df.Users.isin(ban_list)
print df

  Time            Users Banned
0    D          Al Gore   True
1    D      Kim jong il   True
2    N      Kim jong un   True
3    D      Al Sharpton   True
4    L            James  False
5    N  Richard Johnson   True
6    N       Bill Gates  False
7    L          Alf pig  False
8    L     Dick Johnson   True
9    N     Python Monte  False

编辑:如果您有第二个列表，我会按如下方式执行:

Adminlist = ['Bill Gates']
df['Banned'] = (df.Users.isin(ban_list).map({True:'Yes',False:''}) +
                df.Users.isin(Adminlist).map({True:'Admin',False:''}))
print df

  Time            Users Banned
0    D          Al Gore    Yes
1    D      Kim jong il    Yes
2    N      Kim jong un    Yes
3    D      Al Sharpton    Yes
4    L            James       
5    N  Richard Johnson    Yes
6    N       Bill Gates  Admin
7    L          Alf pig       
8    L     Dick Johnson    Yes
9    N     Python Monte

关于Python 3 pandas 使用字符串与正则表达式标记数据框中的数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23530380/

Python 3 pandas 使用字符串与正则表达式标记数据框中的数据

上一篇：python - 不同机器上的 RabbitMQ 代理

下一篇： python : count the number of different values for a given attribute in a list of objects