python - Pandas to_numeric(错误 ='coerce' )不会将无效值转换为 nan

标签 python pandas

我有一个数据集“app_metadata.csv”,包含三列:item_id、类别、描述。 item_id 是整数,category 是字符串,description 是字符串。我使用以下代码加载数据集

app_metadata_df = pd.read_csv(app_metadata_csv_path)

但是数据集中存在损坏的数据,例如存在一行 item_id 不是数字而是文本。我想删除具有无效 item_id 值的行并将 item_id 列数据类型转换为 int。以下是我尝试过的,首先我用 error='coerce' 调用 pd.to_numeric

app_metadata_df.loc["item_id"] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')

然后我降低 NA 值

app_metadata_df.loc["item_id"] = app_metadata_df.loc["item_id"].dropna()

最后调用astype(int)将数据类型转换为int:

app_metadata_df.loc["item_id"] = app_metadata_df["item_id"].astype(int)

但是,它会抛出以下错误

invalid literal for int() with base 10: 'So'

看起来 to_numeric 没有将某些无效值转换为 NAN。为什么会发生这种情况以及如何解决这个问题?

最佳答案

试试这个:

app_metadata_df = pd.read_csv(app_metadata_csv_path)

app_metadata_df['item_id'] = pd.to_numeric(app_metadata_df["item_id"], errors='coerce')
app_metadata_df = app_metadata_df[app_metadata_df['item_id'].notna()].reset_index()

app_metadata_df["item_id"] = app_metadata_df["item_id"].astype(int)
>>>打印(app_metadata_df)
       index   item_id            category  \
0          0  593676.0  HEALTH_AND_FITNESS   
1          1  601235.0                GAME   
2          2  860079.0       COMMUNICATION   
3          3   64855.0       VIDEO_PLAYERS   
4          4  597756.0             MEDICAL   
...      ...       ...                 ...   
98577  98594  683377.0               TOOLS   
98578  98595  862905.0             FINANCE   
98579  98596  165878.0     MUSIC_AND_AUDIO   
98580  98597  683417.0         PHOTOGRAPHY   
98581  98598  703224.0                GAME   

                                             description  
0      Abs Workout, designed by professional fitness ...  
1      The best building game on android is free to d...  
2      Tamil Actress Stickers app has 200 + Tamil her...  
3      The simplest VLC Remote you'll ever find. Peri...  
4      This is the official mobile app of the Nationa...  
...                                                  ...  
98577  endoscope app for android an app to connect wi...  
98578  Acerca de esta app<br>La App OCA está pensada ...  
98579  This app provides free downloading of audio sh...  
98580  <b>Water Paint : Colour Effect</b><br><br>Want...  
98581  DIAMOND CRUSH with spectacular graphics and ex...  

[98582 rows x 4 columns]
>>>打印(app_metadata_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98582 entries, 0 to 98581
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   index        98582 non-null  int64 
 1   item_id      98582 non-null  int32 
 2   category     98582 non-null  object
 3   description  98582 non-null  object
dtypes: int32(1), int64(1), object(2)
memory usage: 2.6+ MB

关于python - Pandas to_numeric(错误 ='coerce' )不会将无效值转换为 nan,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73764054/

相关文章:

python - 如何使用 python 按特定顺序对文件名进行排序

python - 等价于 NumPy 中的命名元组?

python - 如何让 Flask 默认使用 Python 3

python - 获取 Dataframe Pandas 中最高值的列和行索引

python - 如何修复 Jupyter Notebook pandas 错误(OSError : [WinError 193] %1 is not a valid Win32 application)

python - 如何从源头构建?

python - SQLAlchemy create_engine 连接字符串与 Microsoft ODBC 数据源用户 DSN

pandas - 通过分组创建虚拟变量

python-3.x - 我如何从函数返回数据框

python - 需要帮助优化此代码以获得更快的结果