这是我的功能:
def clean_zipcodes(df):
df.ix[df['workCountryCode'].str.contains('USA') & \
df['workZipcode'].astype(str).str.len() > 5, 'workZipcode'] = \
df['workZipcode'].astype(int).floordiv(10000)
df.ix[df['contractCountryCode'].str.contains('USA') & \
df['contractZipcode'].astype(str).str.len() > 5, 'contractZipcode'] = \
df['contractZipcode'].astype(int).floordiv(10000)
return df
这是我期望的测试函数:
def test_clean_zipcodes():
testDf = pandas.DataFrame({'unique_transaction_id' : ['1', '1', '1'],
'workZipcode' : [838431000, 991631000, 99163],
'contractZipcode' : [838431000, 991631000, 99163],
'workCountryCode' : ['USA: STUFF', 'NONE: STUFF', 'USA: STUFF'],
'contractCountryCode' : ['USA: STUFF', 'NONE: STUFF', 'USA: STUFF']})
resultDf = pandas.DataFrame({'unique_transaction_id' : ['1', '1', '1'],
'workZipcode' : [83843, 991631000, 99163],
'contractZipcode' : [83843, 991631000, 99163],
'workCountryCode' : ['USA: STUFF', 'NONE: STUFF', 'USA: STUFF'],
'contractCountryCode' : ['USA: STUFF', 'NONE: STUFF', 'USA: STUFF']})
assert resultDf.equals(clean_zipcodes(testDf))
除了缩进不正确(没有转换为 SO 格式)之外,df.ix 没有按预期执行。它不会对 contractZipcode 或 workZipcode 列执行任何转换。第一行应更改为 83843,如 resultDf 中所述。
提前致谢!
最佳答案
In [2]: import pandas as pd
In [3]: testDf = pd.DataFrame({'unique_transaction_id' : ['1', '1', '1'],
...: 'workZipcode' : [838431000, 991631000, 99163],
...: 'contractZipcode' : [838431000, 991631000, 99163],
...: 'workCountryCode' : ['USA: STUFF', 'NONE: STUFF', 'USA: STUFF'],
...: 'contractCountryCode' : ['USA: STUFF', 'NONE: STUFF', 'USA: STUFF']}
...: )
...:
...: resultDf = pd.DataFrame({'unique_transaction_id' : ['1', '1', '1'],
...: 'workZipcode' : [83843, 991631000, 99163],
...: 'contractZipcode' : [83843, 991631000, 99163],
...: 'workCountryCode' : ['USA: STUFF', 'NONE: STUFF', 'USA: STUFF'],
...: 'contractCountryCode' : ['USA: STUFF', 'NONE: STUFF', 'USA: STUFF']})
...:
...:
...:
请注意,当您尝试像这样建立索引时,会返回一个空切片:
In [4]: testDf.ix[testDf['workCountryCode'].str.contains('USA') &
testDf['workZipcode'].astype(str).str.len() > 5,
'workZipcode']
Out[4]: Series([], Name: workZipcode, dtype: int64)
如果您在不同的过滤器周围添加括号:
In [5]: testDf.ix[(testDf['workCountryCode'].str.contains('USA'))
& (testDf['workZipcode'].astype(str).str.len() > 5),
'workZipcode']
Out[5]:
0 838431000
Name: workZipcode, dtype: int64
你会得到你想要的。如果您使用 loc
也没关系:
In [6]: testDf.loc[testDf['workCountryCode'].str.contains('USA') &
testDf['workZipcode'].astype(str).str.len() > 5,
'workZipcode']
Out[6]: Series([], Name: workZipcode, dtype: int64)
所以这是清理过的函数: 为了便于阅读,我添加了一些小的 lambda。
In [7]: def clean_zipcodes_loc(df):
...: strlen = lambda x: x.astype(str).str.len()
...: floordiv = lambda x: x.astype(int).floordiv(10000)
...:
...: df.loc[((strlen(df.workZipcode)) > 5) &
...: df.workCountryCode.str.contains("USA"),
...: 'workZipcode'] = floordiv(df.workZipcode)
...:
...: df.loc[((strlen(df.contractZipcode)) > 5) &
...: df.contractCountryCode.str.contains("USA"),
...: 'contractZipcode'] = floordiv(df.contractZipcode)
...:
...: return df
...:
In [8]: clean_zipcodes_loc(testDf) == resultDf
Out[8]:
contractCountryCode contractZipcode unique_transaction_id workCountryCode \
0 True True True True
1 True True True True
2 True True True True
workZipcode
0 True
1 True
2 True
关于Python Pandas df.ix 未按预期执行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39358602/