这可能是一个简单的解决方案,但我发现很难让这个函数适用于我的数据集。
我有一个包含各种数据的工资列。下面的示例数据框:
ID Income desired Output
1 26000 26000
2 45K 45000
3 - NaN
4 0 NaN
5 N/A NaN
6 2000 2000
7 30000 - 45000 37500 (30000+45000/2)
8 21000 per Annum 21000
9 50000 per annum 50000
10 21000 to 30000 25500 (21000+30000/2)
11 NaN
12 21000 To 50000 35500 (21000+50000/2)
13 43000/year 43000
14 NaN
15 80000/Year 80000
16 12.40 p/h 12896 (12.40 x 20 x 52)
17 12.40 per hour 12896 (12.40 x 20 x 52)
18 45000.0 (this is a float value) 45000
@ user34974 - 在提供可行的解决方案(如下)方面非常有帮助。但是,该解决方案为我提供了一个错误,因为数据框列也包含浮点值。任何人都可以帮助处理可以在数据框列中处理的函数中的浮点值吗?最后,更新列中的输出应该是浮点值。Normrep = ['N/A','per Annum','per annum','/year','/Year','p/h','per hour',35000.0]
def clean_income(value):
for i in Normrep:
value = value.replace(i,"")
if len(value) == 0 or value.isspace() or value == '-': #- cannot be clubbed to array as used else where in data
return np.nan
elif value == '0':
return np.nan
# now there should not be any extra letters with K hence can be done below step
if value.endswith('K'):
value = value.replace('K','000')
# for to and -
vals = value.split(' to ')
if len(vals) != 2:
vals = value.split(' To ')
if len(vals) != 2:
vals = value.split(' - ')
if len(vals) == 2:
return (float(vals[0]) + float(vals[1]))/2
try:
a = float(value)
return a
except:
return np.nan # Either not proper data or need to still handle some fromat of inputs.
testData = ['26000','45K','-','0','N/A','2000','30000 - 45000','21000 per Annum','','21000 to 30000','21000 To 50000','43000/year', 35000.0]
df = pd.DataFrame(testData)
print(df)
df[0] = df[0].apply(lambda x: clean_income(x))
print(df)
最佳答案
这是我将如何在没有所有循环的情况下做到这一点。
c = ['ID','Income']
d = [
[1, 26000],
[2, '45K'],
[3, '-'],
[4, 0],
[5, 'N/A'],
[6, 2000],
[7, '30000 - 45000'],
[8, '21000 per Annum'],
[9, '50000 per annum'],
[10, '21000 to 30000'],
[11, ''],
[12, '21000 To 50000'],
[13, '43000/year'],
[14, ''],
[15, '80000/Year'],
[16, '12.40 p/h'],
[17, '12.40 per hour'],
[18, 45000.00]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
df['Income1'] = df['Income'].astype(str).str.lower()
df['Income1'].replace({'n/a' : '0', '':'0', '-':'0', 0:'0'}, regex=False, inplace=True)
df['Income1'].replace({'k$': '000','to': '+', '-': '+', ' per annum': '', 'p/h' : 'per hour', '/year': ''}, regex=True, inplace=True)
df['Income1'].replace(' per hour', ' * 12 * 52', regex=True, inplace=True)
df.loc[df.astype(str).Income1.str.contains('\+'),'Income1'] = '(' + df['Income1'].astype(str) + ') / 2'
df['Income1'] = df['Income1'].apply(lambda x: eval(x) if (pd.notnull(x)) else x)
df['Income1'] = (df['Income1'].fillna(0)
.astype(int)
.astype(object)
.where(df['Income1'].notnull()))
print (df)
输出将是: ID Income Income1
0 1 26000 26000
1 2 45K 45000
2 3 - NaN
3 4 0 NaN
4 5 N/A NaN
5 6 2000 2000
6 7 30000 - 45000 37500
7 8 21000 per Annum 21000
8 9 50000 per annum 50000
9 10 21000 to 30000 25500
10 11 NaN
11 12 21000 To 50000 35500
12 13 43000/year 43000
13 14 NaN
14 15 80000/Year 80000
15 16 12.40 p/h 7737
16 17 12.40 per hour 7737
17 18 45000 45000
关于python数据框收入列清理,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64962033/