python - 读取数据时无法预先定义dtype

标签 python pandas

我正在读取一个管道分隔的文件,没有标题进入 Pandas,并且我正在使用 Pandas 版本 0.24.2。这是公开数据,因此无需担心 secret 性。

数据如下:

999778247820|R|JPMORGAN CHASE BANK, NATIONAL ASSOCIATION|7.375|113000|360|02/2001|04/2001|95|95|1|52|665|Y|P|SF|1|P|IL|601|30|FRM||1|N
999783196683|R|OTHER|7.25|59000|360|01/2001|04/2001|97|97|2|43|682|Y|P|PU|1|P|HI|967|30|FRM|676|1|N
999783470376|C|BANK OF AMERICA, N.A.|7.875|110000|360|12/2000|02/2001|74|74|2|26|700|N|P|SF|1|P|NY|125||FRM|698||N
999786911479|C|BANK OF AMERICA, N.A.|7.5|57000|360|12/2000|02/2001|90|90|1|28|699|N|P|SF|1|P|TX|781|25|FRM||1|N
999786913710|R|JPMORGAN CHASE BANK, NA|7.125|114000|360|01/2001|04/2001|73|73|2|16|745|N|C|SF|1|P|WA|992||FRM|||N
999788833695|B|OTHER|9|50000|360|10/2000|12/2000|90|90|2|40|674|N|P|SF|2|I|WI|535|25|FRM|737|1|N

这是我正在使用的代码:

orig_files_fnma = glob.glob("/...1/Acquisition*.txt")

col_names = ["loan_id", "origination_channel","seller_name","original_interest_rate","original_upb","original_loan_term","origination_date","first_payment_date","original_ltv","original_cltv","number_of_borrowers","original_dti",
            "borrower_fico_at_origination","first_time_home_buyer_indicator", "loan_purpose","property_type","number_of_units","occupancy_type","property_state","zip_code_short","primary_mortgage_insurance_percent",
            "product_type","coborrower_fico_at_origination","mortgage_insurance_type","relocation_mortgage_indicator"]

col_type = {"loan_id": "object","origination_channel": "object","seller_name": "object","original_interest_rate": "float","original_upb": "float","original_loan_term": "int","origination_date": "object",
            "first_payment_date": "object","original_ltv": "object","original_cltv": "object","number_of_borrowers": "int","original_dti": "float","borrower_fico_at_origination": "int",
            "first_time_home_buyer_indicator": "object", "loan_purpose": "object","property_type": "object","number_of_units": "int","occupancy_type": "object","property_state": "object",
            "zip_code_short": "object","primary_mortgage_insurance_percent": "float",
            "product_type": "object","coborrower_fico_at_origination": "int","mortgage_insurance_type": "object","relocation_mortgage_indicator": "object"}

dfs = []
temp_df = []

for orig_files_fnma in orig_files_fnma:
    temp_df = pd.read_csv(orig_files_fnma, sep = '|', header = None, names = col_names, dtype = col_type, index_col = None, parse_dates=True, verbose = True, engine='python')
    dfs.append(temp_df)

总是出现以下错误:

Filled 1 NA values in column original_ltv
Filled 52 NA values in column original_cltv
ValueError: Unable to convert column number_of_borrowers to type int

我确实发现我是否没有预先定义 dtype 和 .astype 以在加载后更改数据类型。但请问是否可以,我可以像上面的代码一样先预定义数据类型。

另外,我想将对象的长度定义为20长度。这样做的正确代码是什么?

非常感谢!

最佳答案

我遇到了不同的错误:

ValueError: Unable to convert column coborrower_fico_at_origination to type int

导入将数据导入Excel,您将看到该列中有3行是空白的。 int 类型无法处理空格。您应该将其更改为 float ,此时空白变为nan:

col_type = {..., "coborrower_fico_at_origination": "float", ...}

此后命令成功。

关于python - 读取数据时无法预先定义dtype,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59971592/

相关文章:

python - Selenium/Webdriver 的 python 绑定(bind)中的 get_Text() 等价物是什么

python - 在 MultiIndex Pandas DataFrame 中按列进行子选择

python - 如何找到大于 n 的连续值的数量,从最近的日期回溯

python - 使用 Python 中 ma numpy 中的 fiil_value 将屏蔽值 (--) 替换为 Null 或 None 值

python - 如何从多索引数据框中返回多个级别/组的值?

python - 从个体中减去子组平均值而不求助于 for 循环

python - 如何在浏览器中增加 Jupyter/ipython 笔记本的单元格宽度?

python - 在numpy中绘制相同的随机数

python - 试图获得单个图像的卷积

python - 获取 Selenium 的搜索栏 ID