给出下表:
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']})
看起来像这样:
code
0 100M
1 60M10N40M
2 5S99M
3 1S25I100M
4 1D1S1I200M
我想将 code
列字符串转换为数字,其中 M、N、D 分别等于(乘以 1),I 等于(乘 -1),S 等于到(乘以 0)。
结果应如下所示:
code Val
0 100M 100 This is (100*1)
1 60M10N40M 110 This is (60*1)+(10*1)+(40*1)
2 5S99M 99 This is (5*0)+(99*1)
3 1S25I100M 75 This is (1*0)+(25*-1)+(100*1)
4 1D1S1I200M 200 This is (1*1)+(1*0)+(1*-1)+(200*1)
我为此编写了以下函数:
def String2Val(String):
# Generate substrings
sstrings = re.findall('.[^A-Z]*.', String)
KeyDict = {'M':'*1','N':'*1','I':'*-1','S':'*0','D':'*1'}
newlist = []
for key, value in KeyDict.items():
for i in sstrings:
if key in i:
p = i.replace(key, value)
lp = eval(p)
newlist.append(lp)
OutputVal = sum(newlist)
return OutputVal
df['Val'] = df.apply(lambda row: String2Val(row['code']), axis = 1)
将此函数应用于表后,我意识到它效率不高,并且在应用于大型数据集时需要很长时间。我该如何优化这个流程?
最佳答案
由于 pandas 字符串方法没有优化(尽管对于 pandas 2.0 来说似乎不再如此),如果您追求性能,最好在循环中使用 Python 字符串方法(用 C 编译)。似乎每个字符串上的简单循环可能会提供最佳性能。
def evaluater(s):
total, curr = 0, ''
for e in s:
# if a number concatenate to the previous number
if e.isdigit():
curr += e
# if a string, look up its value in KeyDict
# and multiply the currently collected number by it
# and add to the total
else:
total += int(curr) * KeyDict[e]
curr = ''
return total
KeyDict = {'M': 1, 'N': 1, 'I': -1, 'S': 0, 'D': 1}
df['val'] = df['code'].map(evaluater)
性能:
KeyDict1 = {'M':'*1+','N':'*1+','I':'*-1+','S':'*0+','D':'*1+'}
df = pd.DataFrame({'code':['100M','60M10N40M','5S99M','1S25I100M','1D1S1I200M']*1000})
%timeit df.assign(val=df['code'].map(evaluater))
# 12.2 ms ± 579 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.assign(val=df['code'].apply(String2Val)) # @Marcelo Paco
# 61.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(val=df['code'].replace(KeyDict1, regex=True).str.rstrip('+').apply(pd.eval)) # @Ynjxsjmh
# 4.86 s ± 155 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
注意您已经实现了类似的东西,但外部循环(for key, value in KeyDict.items()
)是不必要的;由于KeyDict
是一个字典,所以将它用作查找表;不要循环。另外,当只有一列相关时, .apply(axis=1)
是一种非常糟糕的循环方式。选择该列并调用 apply()
。
关于python - 如何有效地将函数应用于数据框中的每一行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75935256/